fmII
Fri, May 16th home | browse | articles | contact | chat | submit | faq | newsletter | about | stats | scoop 13:06 PDT
in
Section
login «
register «
recover password «
[Article] add comment [Article]

 SpamAssassin vs. Spastic
 by Keith Winston, in Editorials - Sat, May 24th 2003 00:00 PDT

SpamAssassin has emerged as the most popular antispam tool in the Open Source world. It has gained such momentum that it has even crossed over into the commercial world as SpamKiller by Network Associates, and other commercial products are also based on it. This article is a short comparison of real world results between two antispam tools, SpamAssassin and Spastic.


Copyright notice: All reader-contributed material on freshmeat.net is the property and responsibility of its author; for reprint rights, please contact the author directly.

Disclaimer: I am the current project leader and main developer for Spastic.

Types of antispam programs

Without getting into all the intricacies of email RFCs, I should mention that spam can be fought in many places throughout the system. Most mail servers, or Mail Transfer Agents (MTAs), have some antispam capabilities, but most users don't have the ability or desire to run their own mail servers. The Mail Delivery Agents (MDAs) are programs that take mail from an MTA and deliver it to local mailboxes. procmail is a very popular MDA and is the means by which both SpamAssassin and Spastic are usually invoked. Finally, many mail clients, or Mail User Agents (MUAs), have some antispam capabilities. One promising new trend is Bayesian filtering, which is built into the latest version of the Mozilla mail client (among others). However, this article is focused on two tools which filter at the MDA level using procmail.

Overview of SpamAssassin

SpamAssassin is a collection of Perl modules which test elements of an email message and assign a numeric ranking to it. The higher the ranking, the more likely that the message is spam. The default settings define a spam message as anything with a score of 5.0 or higher. SpamAssassin also checks Realtime Blackhole Lists and has many other advanced features. It is usually called through procmail, although newer versions come with a powerful spamd/spamc client-server interface as well.

Overview of Spastic

About two years ago, the level of spam I began to receive crossed my pain threshold, and I was motivated to take control of the problem. I tried several Open Source spam solutions, including SpamAssassin. At the time, the numeric ranking method of determining spam by SpamAssassin seemed counterintuitive. How do you know how to effectively weigh each setting? In time, I stumbled across SPAST, which was a relatively simple-to-understand procmail script which used word lists to match against elements of an incoming message. It was simple to set up, understand, and customize. The problem was that SPAST was no longer supported by its author, Chrissie LeMaire. I tracked Chrissie down and asked her permission to take over the SPAST project and develop it. Thus, Spastic was born.

Spastic uses procmail and common system utilities like formail, dig, and egrep to scan elements of an email message for patterns, check for valid domains and address formats, etc. One big difference between Spastic and SpamAssassin is that Spastic rules are binary. When a Spastic rule fires, the message is flagged as spam. If a message passes all the tests, it is not flagged. There is no ranking system. The Spastic distribution also includes bash scripts for reporting statistics and rotating spam archives.

Testing Method

The way I tested each program was to set it up to filter all incoming email for a seven day period and log the success rate of each. I made no configuration changes or tweaks to either program during the test. The main configuration I did for SpamAssassin was setting up my whitelist and a couple of cosmetic settings. Since I am on several mailing lists, I receive about 300 messages a day. In this mix is usually a small number of spam messages which come from a variety of sources. I usually receive about 10-20 spam messages a week, which I consider low by most standards today.

I tested SpamAssassin from April 14-20, 2003 and Spastic from April 21-27, 2003.

While my test results are accurate for the email I typically receive, I can't generalize my results to other email users. Please keep in mind that your results may vary.

Test Results

SpamAssassin Spastic
Correctly stopped 16 spam messages. Correctly stopped 10 spam messages.
1 false positive. 0 false positives.
1 missed spam message. 2 missed spam messages.
Total messages processed outside of whitelists: 51 Total messages processed outside of whitelists: 49
2 out of 69 incorrect = 67/69 = 97.10% correct 2 out of incorrect = 63/65 = 96.92% correct

Unfortunately, I realized too late that I should have saved the messages with which each program made an error and cross-tested them against the other one to see if it would have done better. I made a note of it for the next time I run a comparison test.

The results were very close, with SpamAssassin ending with a slightly higher percentage for correctly processing messages. If you are more concerned with false positives, Spastic came out slightly ahead, since many people would rather see a spam message slip through their filter than take a chance on losing an important message. Keep in mind that these sample sets are very small, so drawing firm conclusions is difficult.

Strengths and Weaknesses

After using the programs back-to-back, I have some observations about the strengths and weaknesses of each.

SpamAssassin strengths

A nice spamd/spamc interface, efficient, easy to use.
This feature is intended to make the program easier to use and improve performance. It is one of my favorite features.
It's easy to customize the whitelist and add other rules in the ~/.spamassassinrc/user_prefs file.
By adding or modifying rules in your personal user_prefs, you can customize the behavior and weightings if you don't like the defaults.
More tests, more generalized, more accurate.
It is much more sophisticated in testing elements for spam qualities than Spastic, and a better generalized solution for filtering an entire site.
Very easy to implement under Red Hat 9 by selecting it during installation.
This makes installation in Red Hat 9 drop dead easy.
A large community supporting and testing it.
A more detailed report of spam triggers.
SpamAssassin provides a detailed report of each spam rule that adds up to the final ranking for the message.
Support for other antispam tools like Vipul's Razor and RBLs.

SpamAssassin weaknesses

Depends on Perl.
Since SpamAssassin is written in Perl, it requires a recent version of Perl to be installed on the local machine. It depends on many modules and Perl packages, and may be effected if Perl is upgraded on the machine.
You may not be able to use it if you do not have rights to install Perl.
If you don't have rights to install Perl on the target machine, you can't use SpamAssassin. In most cases, this is not an issue, since Perl is installed on the majority of *nix machines.
The default setting mangles messages flagged as spam (by changing MIME types).
I hesitate to mention this as a drawback because the default setting is this way to protect users from Web bugs and malicious HTML content like Javascript. When SpamAssassin flags a message as spam, it changes the MIME type of all attachments to text so they are no longer executable. However, if the mail was a false positive, it may be difficult to recover the original message format if it was base64 encoded or was a multi-part MIME message. This default can be changed by setting the "defang_mime" option to 0.

Spastic strengths

Very easy to implement on any Linux distribution, easy on most *nixes.
In most cases, you can download the 60k tar.gz file, unzip it, run the setup script, and be ready to filter spam in about 5-10 minutes.
Depends on common system utilities (procmail, grep, and dig).
Since Spastic uses procmail and common system utilities, it is unlikely that additional software installation or configuration will be required to run it. Unless it is used as a site filter, root access is not required. It may be the best choice to use on a hosted server if Perl/SpamAssassin is not available.
It's easy to customize the whitelist and change rules and filter lists.
Customizing the whitelists and rules is a simple matter of editing a few text files.
A rotate-spam script to archive spam folders and produce statistical reports.
Spastic includes an optional bash script which can be run from cron to rotate the spam mailbox and keep up to nine archives. It also summarizes the reasons that messages were flagged and provides totals so you can see who is sending you the most spam. Note: with a few small tweaks, I was able to use the rotate-spam script with SpamAssassin to provide similar functions.
Basic antivirus recipes.
Spastic can flag any message carrying executable content to prevent it from reaching a vulnerable Windows box and causing damage.

Spastic Weaknesses

Not as accurate as SpamAssassin.
SpamAssassin does more tests and is more thorough. The default weights (determined by a genetic algorithm, no less) in SpamAssassin are very good and proved to be slightly more accurate in my testing. For a sitewide antispam solution, I have no doubt that SpamAssassin is more accurate than Spastic. For individuals who tune their filter files to the email they receive, Spastic and SpamAssassin are about equally effective.
Small community supporting and testing it.
Since SpamAssassin has a much larger community, it is better tested and supported.

Conclusion

SpamAssassin is the king of spam filtering for a reason. It is very sophisticated, well designed, and effective. For a sitewide filtering solution, I would strongly recommend SpamAssassin over Spastic. If you can't use SpamAssassin on a particular box (like a hosted box), or if you want a simpler solution for a small number of users, Spastic will also serve you well.

If you want to explore further, here are two other interesting antispam tools:

Editor's Note

This is just the tip of the growing iceberg of antispam tools in circulation today. I've been very happy with SpamAssassin for the last year or so. What are you using? What's your experience been with it? What's still slipping through? Where do you think the spam war is headed?


Author's bio:

Keith Winston would like to hear about all the latest Nigerian breast enlargement techniques at slippery@users.sourceforge.net.


T-Shirts and Fame!

We're eager to find people interested in writing articles on software-related topics. We're flexible on length, style, and topic, so long as you know what you're talking about and back up your opinions with facts. Anyone who writes an article gets a t-shirt from ThinkGeek in addition to 15 minutes of fame. If you think you'd like to try your hand at it, let jeff.covey@freshmeat.net know what you'd like to write about.

[Comments are disabled]

 Referenced categories

Topic :: Communications :: Email :: Filters

 Referenced projects

Apache SpamAssassin - An extensible email filter that is used to identify spam.
Mozilla Seamonkey - A Web browser and email client.
POPFile - An automatic email classifier using a Naive Bayes algorithm.
procmail - Versatile e-mail processor.
SPASTIC - Simple Procmail Anti-Spam Templates (improved code).
Vipul's Razor - Spam detection and filtering network.

 Comments

[»] 17 spam in 6 days?
by Serge Knystautas - Aug 25th 2003 21:48:34

The sample size is so minimal, the tests are pretty much meaningless. But more importantly, you're getting 3 spam a day and you care about spam email?

[reply] [top]


[»] SpamAssassin vs. SPASTIC vs. Bayesian in another posting
by era - Aug 24th 2003 23:30:38

You'll notice that a more proper test is now at http://freshmeat.net/articles/view/964/

[reply] [top]


[»] Why is Perl a drawback but Procmail isn't?
by dozer - Jul 11th 2003 21:44:33

Pretty much every Unix system on the planet now has Perl installed. This is certainly not true of Procmail. So explain, please, why you consider that being implemented in Perl is a drawback? Performance? No. Availability? No. Compatibility? Maybe, but not with SpamAssassin. I don't understand.

Now, in my experience, a Procmail implementation is certainly a drawback. Procmail is a tool that peaked in 1998. Now we have easier to use, more capable and, most importantly, more secure solutions (Sieve + Amavis is one notable one). It's time to put Procmail's security holes and awful syntax to pasture.

[reply] [top]


[»] A four letter word
by David Collantes - Jun 27th 2003 05:21:28

TMDA, actually, the best Spam reducer tool. Clean, professional, accurate.

[reply] [top]


    [»] Re: A four letter word
    by Macdaddy - Aug 11th 2003 14:23:02


    > TMDA, actually, the best Spam reducer
    > tool. Clean, professional, accurate.

    I too am thinking of a 4-letter word that describes TMDA. Unfortunately it's not "TMDA."

    [reply] [top]


[»] You don't need root to install Perl
by era - Jun 16th 2003 23:51:40

In addition to the comments about the lack of statistical validity for such a small sample (come on, it's not hard to get samples of spam, thousands of'em!) I'd like to remark that it is by no means impossible (though also not necessarily very straightforward) to install Perl for your own use, without any administrator privileges. I believe the Perl installer offers this as an option (but of course I haven't compiled Perl myself in eons ... apt-get rules :-).

[reply] [top]


[»] Plain bogofilter a simple effective alternative
by Mat Farrington - Jun 13th 2003 01:57:50

Rather than upgrade spamassassin again, I replaced it with bogofilter alone.

After a few thousand training emails it now outperforms that version of spamassassin (which admittedly was ageing). I expect performance to improve further with ongoing training.

Local email aliases allow users on my system to maintain personal bogofilter databases.

I appreciate that recent versions of spamassassin have bayesian learning and that bogofilter can be trained using spamassassin output, but see little reason to complicate an already-effective and elegant solution.

[reply] [top]


[»] Ask for another solution
by I. B. Turner - Jun 11th 2003 23:13:06

I've used ask (Active Spam Killer) for exactly year. It has let through 2 spams in that time, while I get 15-30 per day.

Advantages: people that you care about can easily get through. Very small percentage of spams get through. White list is easy to set-up.


Disadvantages: takes too much disk space in it's queue-it saves the spam for too long by default and should probably compress/decompress it.

Seeming disadvantage that I haven't encountered: it relies on replies which should make spammers think your address is real and increase the amount of spam you get. I haven't found this to be the case.

All that said I'm happy with it.

[reply] [top]


[»] Scientific method anyone?
by Marcelo E. Magallon - Jun 7th 2003 10:57:09

Catchy title. I mean, basically "SpamAssassin vs Anything" is catchy :-)

The following made some alarms trigger:

I tested SpamAssassin from April 14-20, 2003 and Spastic from April 21-27, 2003.

bzzzzzt! Wrong! You don't compare spam filtering tools like that. You compare them with the same corpus, otherwise the comparison is worthless. Since SpamAssassin is quoted as producing one false positive, it would have been nice to see the message that did that and the reason why SpamAssassin tought it was spam. The version of SpamAssassin is also missing, which complicates things further in the reproduceability department. The later Perl-bashing is also not welcomed. If you don't like Perl for whatever reason, please write a disclaimer at the top of your article ("I have a bias against Perl, nevertheless I'm going to write a comparison that involves a Perl program") so that readers can know what to expect. In particular, SpamAssassin is not the kind of thing people would want to install on their own. Ask your system administrator to set it up on the machine you use to receive email. Any competent system administrator will setup spamd and that won't make it necessary to run N copies on the machine in question, which I'm sure has better things to do with the CPU (like, uhm, receiving email). Regarding the perceived advantages of Spastic, rotating email in a folder is a no-brainer given that you have grep-mail handy. The same goes for summarizing the reasons why mail got flagged as spam or non-spam. The bit about executables can be done with a system wide procmail setup and it really doesn't have anything to do with classifying spam.

[reply] [top]


[»] SpamAssassin
by hnoesekabel - Jun 4th 2003 04:26:00

My email adress at work is protected by SpamAssassin, and it actually does a pretty good job on cutting down spam. As for the false positives: SpamAssassin only detects and tags spam, whatever you do with these message is up to you. You can delete messages with a score of, say, 20+ automatically and store the rest in a folder. That way, you cut your 'losses'.

As for the installation of SpamAssassin: install it with CPAN. Quick and simple. I could do it ;)

Right now, I keep the highest scoring spam mails for fun. Current top score is 60 (with the default scores).

[reply] [top]


[»] complexity is good
by Florin Andrei - Jun 2nd 2003 16:45:16

Spam is a complex thing. Sometimes even humans have issues identifying it as such - i've heard people saying they actually got good deals from spam. So no wonder it's extremely hard for a computer to tell spam from ham.
Complex problems require complex solutions. Therefore, don't expect a simple one-off solution to be good at catching spam.
Take a look at this article:
Fairly-Secure Anti-SPAM Gateway Using OpenBSD, Postfix, Amavisd-new, SpamAssassin, Razor and DCC
It describes a method to combine SpamAssassin with other anti-spam techniques (Vipul's Razor, DCC) and with anti-virus stuff to better handle bad e-mail. Worth reading!

[reply] [top]


[»] some other approach
by karellen - May 24th 2003 23:18:19

I rather like the other approach. Using blacklists to waste spammer's time on a phony mail transport agent and drive the cost of spam skyhigh. I think the OpenBSD community did something in this direction. This combined with some kind of bayesian filtering that kills the spam *before* it reaches the MTA (or built into the MTA via some kind of hook that calls an external program). Nobody likes to queue spam. My system is not a spammer trash can.

[reply] [top]


    [»] Re: some other approach
    by cloudmaster - May 27th 2003 14:54:56


    > I rather like the other approach. Using
    > blacklists to waste spammer's time on a
    > phony mail transport agent and drive the
    > cost of spam skyhigh. I think the
    > OpenBSD community did something in this
    > direction. This combined with some kind
    > of bayesian filtering that kills the
    > spam *before* it reaches the MTA (or
    > built into the MTA via some kind of hook
    > that calls an external program). Nobody
    > likes to queue spam. My system is not a
    > spammer trash can.

    Here's a couple of commands that might help you:

    (echo -n '|'; `which procmail`) > ~/.forward

    (echo ':0';echo '* ^X-Spam-Status: Yes';echo '/dev/null';echo) >> ~/.procmailrc

    Then, assuming your mail's going through spamassassin, the message is instantly deleted. You don't waste any space on it.

    Seriously, a lot of spam makes it past the checks that an MTA can make on the conenction info (RBLs, validity checks), so you need *something* to check the body and client-supplied headers for bad stuff. That has to be done after the message is received because of the way SMTP works. If you don't want to waste the space, you can use a mazimum message size limit in combination with a spam checker (I use SpamAssassin) and something that throws away messages marked as spam.

    I've been using SA for a few months now - it tags around 100 messages daily (between the few accounts that I use). I've had 0 false positives. I've got some basic system-wide rules set up, and per-user whitelists stored in a database, managed with a simple PHP form that even our dullest users can handle. It's received only praise for letting users filter out their spam. My only advice is to look over the config settings, and change some of the defaults - the default setup is not ideal for the average user, IMHO.

    --
    ----------------------------- Light in the absence of eyes illuminates nothing

    [reply] [top]


    [»] Re: some other approach
    by dystopia - Jun 18th 2003 12:31:19


    > I rather like the other approach. Using
    > blacklists to waste spammer's time on a
    > phony mail transport agent and drive the
    > cost of spam skyhigh. I think the
    > OpenBSD community did something in this
    > direction. This combined with some kind
    > of bayesian filtering that kills the
    > spam *before* it reaches the MTA (or
    > built into the MTA via some kind of hook
    > that calls an external program). Nobody
    > likes to queue spam. My system is not a
    > spammer trash can.

    OpenBSD uses 'spamd' which uses a combination of
    'spews' a 'fake MTA which uses high tarpitting
    settings' in conjunction with it's PF (Packet Filter).

    Read (not Reed) more about it at:
    http://www.benzedrine.cx/relaydb.html

    [reply] [top]


[»] SpamAssassin weakness IMHO
by Gilgongo - May 24th 2003 16:02:56

I've been using SA for about 18 months on a mail server with 15 people on it getting personal mail, and have trialled it on a server for users getting business mail.

With any anti-spam system, false positives are a problem. This is compounded by the fact that very often, one man's spam is another man's legitimate communication. When I trialled SA with 10 users in our company, with a default score of 10 (which I thought was quite high) I spent about 4 hours in the first three weeks having to tweek SA scores and populate whitelists for these users. I was completely unprepared for the amount of stuff they were getting that they regarded as legit, which I myself would simply have binned if it arrived in my inbox. This meant that if I'd rolled it out to the remaining 90 users in the company, I'd be spending a hell of a lot of time maintaining SA rules.

Now, I know that SA has Bayesian filters that can be trained per user, but this isn't practical when all our users are non-technical and their access to the mail server is purely via POP3 with Outlook 2000.

I would therefore say that a weakness of SA is that relies too heavily on system-wide rules that in turn produce too many false positives.

I have since looked at DSPAM, which is a purely Bayesian filter with no system-wide rules at all. The trouble is that it won't work with our new mail server config, which is running Mailscanner as a proxy to MS Exchange (don't ask, don't ask).

--
Gone are the days when you could say "Those were the days."

[reply] [top]


    [»] Re: SpamAssassin weakness IMHO
    by Macdaddy - May 29th 2003 16:39:46


    > I've been using SA for about 18 months
    > on a mail server with 15 people on it
    > getting personal mail, and have trialled
    > it on a server for users getting
    > business mail.
    >
    > With any anti-spam system, false
    > positives are a problem. This is
    > compounded by the fact that very often,
    > one man's spam is another man's
    > legitimate communication. When I
    > trialled SA with 10 users in our
    > company, with a default score of 10
    > (which I thought was quite high) I spent
    > about 4 hours in the first three weeks
    > having to tweek SA scores and populate
    > whitelists for these users. I was
    > completely unprepared for the amount of
    > stuff they were getting that they
    > regarded as legit, which I myself would
    > simply have binned if it arrived in my
    > inbox. This meant that if I'd rolled it
    > out to the remaining 90 users in the
    > company, I'd be spending a hell of a lot
    > of time maintaining SA rules.

    I rolled out SpamAssassin on a 3000 user production system. I'm still waiting to here a single valid complaint on its accuracy. I'm no longer maintaining that system however. It's now running a dated copy of SA. My primary mail account is still there though. The amount of spam getting through that old copy of SA is steadily increasing. What people don't realize is the SA has to be kept up to date. No ifs ands or buts about it. YOU HAVE TO KEEP IT UP TO DATE. You can't be a lazy admin that only works on something when it's broken. A 2 year old FTP daemon will still work fine as long as no security holes have been found. A 3 month old copy of SA is out of date and must be upgraded. No excuses. Spammers are specifically targetting the negative scoring rules in older copies of SA to lower the overall score of their spam. It's a no brainer. Update your copy of SA or don't bitch and moan when it stops working as expected.

    [reply] [top]


    [»] Re: SpamAssassin weakness IMHO
    by A'rpi/ESP-team - Jun 7th 2003 05:34:15

    I'm running SA 2.53 on a production server in a school (including teachers, administration, students), for ~1500 users. The flag limit is left at score 5.0, but it deletes mails with 10.0+ points. At the first weeks I got a few complains (actually i've asked users to do so) of false positives and negatives. By manually tuning some SA rules and bayesian filter I got rid of them. Since that, i didn't get a single complaint. Looking at statistics, there is around 1600 spam with 10+ score and around 150 with score 5..10, weekly. So this filtering reduced delivered spam level to 10%, and still flag spam-looking mails to easier separation by users. A'rpi

    [reply] [top]


[»] I can only read this article if I'm not logged in to freshmeat.
by riddley - May 24th 2003 12:04:08

subject sez all

[reply] [top]


    [»] Re: I can only read this article if I'm not logged in to freshmeat.
    by jeff covey - May 25th 2003 12:33:12

    If you believe you've found a bug in freshmeat's code, you should report it at http://freshmeat.net/contact/.

    Thanks.

    --
    vs lbh pna ernq guvf, lbh'er n trrx.

    [reply] [top]


      [»] Re: I can only read this article if I'm not logged in to freshmeat.
      by riddley - Jun 5th 2003 09:54:08


      >
      >
      > If you believe you've found a bug in
      > freshmeat's code, you should
      > report it at
      > http://freshmeat.net/contact/.
      >
      >
      >
      > Thanks.
      >

      I usually only submit bug reports to Open Source projects...

      [reply] [top]


[»] Pros and cons
by Gustavo Muslera - May 24th 2003 10:11:01

Is good to put on the table 2 fairy good spam detectors, but...

- Extremelly small sample (not puting the mailing lists on the whitelist could give a better hint on false positives, even with such a small sample). The accuracy results could be very different in the long run.

- AFAIK spamassassin includes bayesian filtering by now, and anyway can use bogofilter. Filtering by keywords or searching for duplicate messages a la razor is mostly useless by now for most spam, as they include random text, intercalate random html comments inside keywords or even changes letters with symbols (i.e. w0rd instead of word).

- Put a reference to POPFile (a perl pop3 proxy with bayesian clasification) but not to bogofilter, that work in the same way than spamassassin and spastic. I'm not saying that popfile is bad, in fact, is THE way I'm using right now to filter spam, but I don't think that it should be used at server level like the other two.

[reply] [top]


[»] Moderate View...
by antrik - May 24th 2003 09:25:43

Seeing all this flaming here, I think the author deserves to hear my somewhat
more positive opinion.

To someone like me, who has heard about SpamAssassin (actually, even read some
article -- which wasn't terribly useful...), but knows hardly anything about
it's *practical* application, this article actually *is* very informative. Only
the title is somewhat misleading in this reagard...

On the other hand, I agree that the "comparision test" is silly. Ever heard of
a thing called "statistical relevance"?...

To the other flamers: No, it's *not* necessary to use an identical test set. If
the test sets are large enough to give meaningful results, it's statistically
completely irrelevant whether they are the same. On the other hand, if the test
sets are too small (as they definitely are here), it doesn't help to use the
same test set -- the probability of one of the programs being handicapped by an
accumulation of messages it doesn't like is just the same.

[reply] [top]


[»] Geez...
by Chris Carlin - May 24th 2003 03:18:38

I mean seriously, no offense but this article completely failed to live up to its potential. I mean, objective (if not significant) comparison of server side spam filtering implementations? Great!

But not only did this one not cover many systems, the ones it did cover weren't explored in a meaningful way even in terms of this guy's specific case.

Next time use procmail (or whatever) to feed the same messages to each of the filtering systems and let the whole thing run for two weeks. THEN there will be something approaching worthy of the tshirt.

[reply] [top]


[»] More SpamAssassin features
by Bastian Kleineidam - May 24th 2003 03:13:27

SpamAssassin also supports pyzor (a free razor clone). And with the spamd/spamc feature and the ifspamh script I can use it with qmail which does not use procmail filtering but its own .qmail configuration.

[reply] [top]


[»] worthless?
by tooar - May 24th 2003 01:15:32

oh boy, fm should really care more about their articles. how intelligent does one have to be to know that a spam filter test with different emails is completely use- and worthless?

as it seems, in the end, the author realized about half of the truth: "Unfortunately, I realized too late that I should have saved the messages with which each program made an error and cross-tested them against the other one to see if it would have done better. I made a note of it for the next time I run a comparison test."

hey, not only the error messages, ALL messages.

[reply] [top]


    [»] Re: worthless?
    by Grant K Rauscher - May 24th 2003 02:23:06

    Active Spam Killer is nice - you manage your queue by e-mail... an HTML interface on the web for processing your queue would be easy, so it could integrate with webmail well. have meant to try SpamAssassin, though. ASK fm project page

    [reply] [top]


    [»] Re: worthless?
    by David Necas (Yeti) - May 24th 2003 04:53:59

    I agree the article is bogus.

    To author:

    First, It's not so hard to set up procmail so that both filters can be tested simultaneously (by duplicating the queue) and thus in fair conditions. Anyone taking the testing seriously would do it.

    Second, how accurate results you expect from 16 spams (leaving out they are different)? I get about 20 spams a day -- and I've never seen any false positive from SpamAssassin, while I see approximately one false negative per week. While I didn't perform any exact measurement, my claims are based on experiences with a sample of 10k+ e-mails.

    Then, spamassasin can remove its markup from the messages, just run it on the message again, with -d. If the markup really annoys you, it's not hard to automate this action.

    Then, you can always install perl to home. Well, not always only if you have reasonable disk quota. But who wants an account on a machine without perl and with low disk quota? ;-)

    Then, ... OK, I read ``don't flame and insult others'' above, so I stop here ;-)

    [reply] [top]


      [»] Re: worthless?
      by slippery - May 24th 2003 11:36:07

      > First, It's not so hard to set up procmail so that both filters can be tested simultaneously (by duplicating the queue) and thus in fair conditions. Anyone taking the testing seriously would do it.

      It was not meant to be an exhaustive test, but more anecdotal. I mention several times not to draw any firm conclusions from the test results. However, if I run any future comparisons, I will be much more careful and thorough.

      Spam is more topical today than two years ago since the problem has become so much worse recently. What I hoped to do was share my experience, and pass on a few things I learned about spamassassin and the issues surrounding spam in general.

      > Then, ... OK, I read ``don't flame and insult others'' above, so I stop here ;-)

      Flame on! The negative comments mean I failed to communicate my goal, my message, or both. I will learn from the all the feedback, positive or otherwise.

      Best Regards, Keith

      [reply] [top]


        [»] Re: worthless?
        by David Necas (Yeti) - May 24th 2003 14:32:26

        % It was not meant to be an exhaustive
        > test, but more anecdotal.

        If it was anecdotical, you probably shouldn't list the efficiencies with four significant digits ;-) (have you any idea how many e-mails you have to test to achieve this precession?) OK, normal people don't care... they also don't mind deducing conclusions from graphs w/o units and w/o a zero axis on TV... :-)

        An important thing is being up to date (you don't even mention the versions!). Spammers adapt and old spam filter versions give considerably worse results than recent ones (I have experiences mainly with SpamAssassin, but unless the filter is quite stupid and inefficient, this rule should be quite general). I would even suggest upgrading the filters during the test if/when a new version is released -- reality doesn't wait.

        % Flame on!

        So, at least one more SpamAssassin note: A weakness that definitely worths mentioning is its speed -- or better slowness. As someone pertinently commented it: I don't want to compute the Universe, I just want to check for spam. Spamd slows down a SMTP server a lot, not speaking about spamassassin run by individual users via procmail, which is even worse.

        [reply] [top]


          [»] Re: worthless?
          by slippery - May 24th 2003 15:43:49

          >If it was anecdotical, you probably shouldn't list the efficiencies with four significant digits ;-)

          Remember, I stated the results could not be generalized. What I reported was what actually happened to 4 significant digits ;)

          > An important thing is being up to date (you don't even mention the versions!).

          The versions were in the original text (SpamAssassin 2.44 and Spastic 3.0), but removed by the editor.

          >So, at least one more SpamAssassin note: A weakness that definitely worths mentioning is its speed -- or better slowness.

          You have hit on what I think is one of the real harms of spam. It wastes resources, both computing and human. The more spam there is, the more resources are wasted dealing with it.

          Although there are some U.S. state laws prohibiting spam, legislation can't be effective unless it can be enforced globally. A redesign of SMTP that ensures e-mail headers can't be forged would be very hard or impossible to implement and would take years or decades to roll out. This is why I believe spam will be with us for a long time.

          The only way I can see to deal with it is to make it cost the spammer money. If it costs something to send a million spams, the spammers will be much more selective and targeted. It would not eliminate spam, but it would make it more like the junk mail you get in snail mail. The levels would drop to something more sane. How do you meter e-mail? I have no idea.

          [reply] [top]


            [»] Re: worthless?
            by Sam - May 29th 2003 20:45:05


            > >If it was anecdotical, you probably
            > shouldn't list the efficiencies with
            > four significant digits ;-)
            >
            > Remember, I stated the results could not
            > be generalized. What I reported was
            > what actually happened to 4 significant
            > digits ;)


            You said it couldn't be generalised to other users, but that is a whole different thing to not being generalised to a higher volume of your email.

            It shouldn't be to hard to run each on the same corpus of at least a few hundred emails if not a few thousand.

            If you do so in the future, then I would also suggest measuring recall and precision seperately and not just reporting precent "correct" (and not reporting more sig figs than you have). Information retrieval is a reasonably old field with lots of previous work, recall and precision has served the field well and provides useful metrics.

            You may know all of this, but a ranting I shall go...

            Recall is the spam found divided by the the spam that was present, 16/17 and 10/12 in your sample. Precision is the number of spams found divided by the items found, 16/17 and 10/10 in your sample.

            Measuring accuracy (which you did) and error (1-accuracy) is useful for doing a comparison, however there needs to be some weighting of false positives to fals negatives. A weighting of 1 is simply not useful for real world email filtering. After all I suspect everyone would gladly trade a recall:100%, precision:95% for recall:95%, precision:100% without any hesitation.

            Searching for spam filtering on CiteSeer (http://citeseer.nj.nec.com/cs) will provide a few interesting papers which show some good ways of doing comparisons. Obviously a freshmeat article doesn't need to be anywhere near as scientifically written :)

            Since your the project leader of Spastic I'll mention that it would be nice if spam filter projects provided a simple testing interface. A script (or C code if that's what they like) that takes a mail spool file as input and outputs two mail spools - spam and non-spam, would be useful. Though just making a mode where the program reads an email from stdin and produces appropriate exit codes for spam and non-spam.

            Of course Spastic might already do that, I haven't checked, I'm just raving :)

            [reply] [top]


    [»] Re: worthless?
    by jeff covey - May 24th 2003 10:37:21

    I do realize this is short on scientific method; Keith says as much. I didn't see good statistical analysis as its purpose; it just introduces some spam fighting tools and methods to people who may be looking for them, and I hoped others would chime in with information about how they deal with spam and the problems and solutions they've had to evolve to stay ahead of the ever-changing spam wave. I didn't see the article as offering many conclusions itself, but as a springboard to discussion.

    I'd be interested in a more extensive overview of antispam tools in a category review of Topic :: Communications :: Email :: Filters, if anyone feels up to it.

    --
    vs lbh pna ernq guvf, lbh'er n trrx.

    [reply] [top]


      [»] Re: worthless?
      by Eric Kilfoil - Jun 1st 2003 19:29:02


      > I'd be interested in a more extensive
      > overview of antispam tools in a
      > category review of Topic ::
      > Communications :: Email :: Filters, if
      > anyone feels up to it.

      I would like to see this as well. Most people are concerned about things such as false positives and accuracy. Personally, my biggest concern is per-user configurability. I think that these tools can provide me with the level of flexibility that I need to help users fight spam. My problem then becomes scalability. In a large production environment, say upwards of 50k users, i can't afford for a spam filter solution to drop my CPU resources to zero. I can't (literally) afford to add 10 more MX servers because my spam solution hogs all of the resources.

      PERL is nice. It has great flexibility and amazing text processing power. Unfortunately, it is slow. I would really love to see an open source spam fighting solution written in a compiled language to help improve scalability. Perhaps spastic can provide that to me.

      The nicest feature that I can see about SpamAssassin is that I can provide a web interface to my users to let them choose how aggresive they want their spam to be filtered. Then if they complain about a false positive, i'll can just tell them to decrease the aggressiveness of the filter.

      I liked the article quite a bit. I would really have liked to see information about MANY spam solutions rather than just these two. Brightmail is a decent commercial offering. Fortinet makes anti-spam hardware based firewalls, and there are tons of others.

      [reply] [top]


        [»] Re: worthless?
        by slippery - Jun 4th 2003 04:54:09

        % Personally, my biggest concern is
        > per-user configurability. I think that
        > these tools can provide me with the
        > level of flexibility that I need to help
        > users fight spam. My problem then
        > becomes scalability. In a large
        > production environment, say upwards of
        > 50k users, i can't afford for a spam
        > filter solution to drop my CPU resources
        > to zero. I can't (literally) afford to
        > add 10 more MX servers because my spam
        > solution hogs all of the resources.

        Wow, 50,000 users puts you in a category far above most environments. Any global solution would have be very fast and likely span many incoming mail servers.

        > PERL is nice. It has great flexibility
        > and amazing text processing power.
        > Unfortunately, it is slow. I would
        > really love to see an open source spam
        > fighting solution written in a compiled
        > language to help improve scalability.
        > Perhaps spastic can provide that to me.

        Spastic is not compiled, per se, it uses native procmail commands and shells out to grep for regexps so I don't think it would scale to the level you need.

        > I liked the article quite a bit. I
        > would really have liked to see
        > information about MANY spam solutions
        > rather than just these two. Brightmail
        > is a decent commercial offering.
        > Fortinet makes anti-spam hardware based
        > firewalls, and there are tons of others.

        A complete examination of ALL the spam programs, commerical, open source, and hardware solutions would be daunting. There are probably 50-100 open source solutions alone. The most recent version of Imail Server 8.0 from Ipswitch (which is used at one of my clients with 500 users) includes a decent anti-spam filter. Even dedicated testing labs like at ZDnet/Cnet usually limit their testing to 8-10 products at a time.

        I still think that current anti-spam solutions are more bandaids than cures. Until e-mail is metered like snail mail, the economics of spam will keep spammers in business. And I'm not sure I want e-mail metered.

        Best Regards, Keith

        [reply] [top]


        [»] Re: worthless?
        by A'rpi/ESP-team - Jun 7th 2003 05:44:13


        >
        > PERL is nice. It has great flexibility
        > and amazing text processing power.
        > Unfortunately, it is slow. I would
        > really love to see an open source spam
        > fighting solution written in a compiled
        > language to help improve scalability.

        I've started a project spamassassin-c, a rewrite of the spamassassin engine in plain hand-optimized C/asm code. Using libpcre for regexp matching, with precompiled regexp ruleset compiled into the binary.
        It was around 20 times faster than SA, with a bit limited featuers (i had to left some complicated regexp out, as libpcre couldn't handle it, and SA also have some rules implemented as perl code). Finally it turned out, that perl SA is slow because it had to re-compile regexps and do whoel perl startup at every mail. They also noticed this, and created client-server approach, ie spamd+spamc. So the rule matching code is running in spamd, with precompiled regexps, hashed searches initilaized at startup, resulting in a 10 times faster performance. So my stripped down C version was only 2 times faster than spamd. I guess if i implement all the SA features, it wouldn't be faster more than 20-30%, so it simply doesn't worth it. I've stopped my project.

        A'rpi

        [reply] [top]


        [»] Re: worthless?
        by rmemmons - Aug 10th 2003 10:12:59


        > My problem then
        > becomes scalability. In a large
        > production environment, say upwards of
        > 50k users, i can't afford for a spam
        > filter solution to drop my CPU resources
        > to zero. I can't (literally) afford to
        > add 10 more MX servers because my spam
        > solution hogs all of the resources.

        I understand the need for scalability and faster code would be great. I would however challange the notition that you can't afford 10 more MX servers. I think that it's more like "management does not want to pay for 10 more MX servers."

        I don't know your costs, but if you just take some simple numbers regarding the true man-hour costs of spam on your end users you'll get into the millions, or 10's of millions of dollars a year in lost man hours. For example 10 emails a day and 10 seconds an email and 50000 users 10.1 hours a year per employeed which translated to 75 million. This is just an example--and it is huge.

        If this is true, you have the money, your organization just lacks the will.

        I bring this up because I work for a company of similar size, and I constantly see IT saying "too expensive"... but at the same time being happy to push much larger costs on the buget of others due to inaction or stupid policies. I don't know if your org is like that, but mine certainly is.

        Rob

        [reply] [top]


          [»] Re: worthless?
          by Eric Kilfoil - Aug 28th 2003 16:58:04


          > I understand the need for scalability
          > and faster code would be great. I would
          > however challange the notition that you
          > can't afford 10 more MX servers. I
          > think that it's more like "management
          > does not want to pay for 10 more MX
          > servers."


          I suppose we're getting off topic, but here goes anyway.

          Yes it's management, but even I agree with them (for a change). We're talking about serving 50k customers, not 50k employees. The cost of providing the service outweighs the benefits gained from the cost. As we've all seen in the telecom bust, providing services at a lower cost than what you pay is a bad idea :). funny that the engineers realized it and marketing didn't... oh well. Anyway, it's basically an ROI decision.

          [reply] [top]




© Copyright 2007 SourceForge, Inc., All Rights Reserved.
About freshmeat.net •  Privacy Statement •  Terms of Use •  Trademark Guidelines •  Advertise •  Contact Us • 
ThinkGeek •  Slashdot  •  ITMJ •  Linux.com •  NewsForge  •  SourceForge.net  •  Surveys •  Jobs •  PriceGrabber