|
| Sat, May 17th | home | browse | articles | contact | chat | submit | faq | newsletter | about | stats | scoop | 16:49 PDT |
|
login « register « recover password « |
| [Article] | add comment | [Article] |
Spam is a growing problem for email users, and many solutions have been proposed, from a postage fee for email to Turing tests to simply not accepting email from people you don't know. Spam filtering is one way to reduce the impact of the problem on the individual user (though it does nothing to reduce the effect of the network traffic generated by spam). In its simplest form, a spam filter is a mechanism for classifying a message as either spam or not spam. Copyright notice: All reader-contributed material on freshmeat.net is the property and responsibility of its author; for reprint rights, please contact the author directly. There are many techniques for classifying a message. It can be examined for "spam-markers" such as common spam subjects, known spammer addresses, known mail forwarding machines, or simply common spam phrases. The header and/or the body can be examined for these markers. Another method is to classify all messages not from known addresses as spam. Another is to compare with messages that others have received, and find common spam messages. And another technique, probably the most popular at the moment, is to apply machine learning techniques in an email classifier. Bayesian FilteringPaul Graham kicked off a flood of mail filters implementing Bayesian filtering with his "A Plan for Spam" article in August 2002, though it was far from a new concept. In fact, ifile has used a Naive Bayes classification algorithm since August 1996 to automatically file mail into folders. In academic circles, Bayesian methods have been used in text classification for many years, and for spam detection prior to Graham, as evidenced by the 1998 workshop paper A Bayesian Approach to Filtering Junk E-Mail by Sahami, et al. In a nutshell, the approach is to tokenize a large corpus of spam and a large corpus of non-spam. Certain tokens will be common in spam messages and uncommon in non-spam messages, and certain other tokens will be common in non-spam messages and uncommon in spam messages. When a message is to be classified, we tokenize it and see whether the tokens are more like those of a spam message or those of a non-spam message. How we determine this similarity is what the math is all about. It isn't complicated, but it has a number of variations. There's a lot more to it than that (Bayesian methods are used a lot in the AI field, for example, in machine learning and user modelling), but that's all we need to know. Some Spam FiltersIn order to compare some spam filters, a number of filters had to be selected from the large list that is the Freshmeat Topic :: Communications :: Email :: Filters category. The selection was restricted by only considering free software and only filters that didn't use network resources in their classification. The filters were further restricted to those that could be executed as standalone programs, read a message from standard input, and indicate via their output or their exit value whether it was spam or not. Several satisfying the restrictions were downloaded, and a few of those removed due to problems with installation or execution. In the end, seven filters were used, five of which were Bayesian. The version of each filter that was available for download on the Third of July 2003 was used. This was done because, though the email was filtered in bulk in August, the actual email was received during July; it should be used with July's versions of the programs. The filters are:
The Email DataThe email used in the testing consisted of my email from the month of July 2003. The mail consisted of 1,273 messages, of which 1,073 were spam. For the Bayesian filters, a training set of 68 spam messages and 68 non-spam messages was used (my email from the second half of June, with a random sample of spam messages from the same period). The messages used were all hand-classified as spam or non-spam. MethodologyEach program was installed according to its documentation. For the filters that required training, the training set data was supplied. Each filter was then taken in turn and executed once for each email in the spam and non-spam sets, and the classification it gave was recorded. Default options were used for the filters in all cases. The aim was to examine the filtering abilities of the packages. Hence, whitelists were not used, even though, in practice, they probably would be. Some analysis was done to see how much performance would be improved by whitelists. ResultsThe standard metrics for text classification are recall and precision. For spam filtering, we are trying to correctly classify spam messages as spam and not incorrectly classify non-spam messages messages as spam. Spam classified as non-spam is known as a false negative. Non-spam classified as spam is known as a false positive. Precision is the percentage of messages that were classified as spam that actually are spam. High precision is essential to prevent the messages we want to read being classified as spam. A low precision indicates that there are many false negatives. Recall is the percentage of actual spam messages that were classified as spam messages. High recall is necessary in order to prevent our inbox filling with spam. A low recall indicates that there are many false positives. False positives are generally considered far worse than false negatives. Viewing a spam is better than not getting an important message. Hence, precision is a more important measure than recall, though, of course, a low recall makes a filter useless. Experiment OneFor the first test of the filters, the 68 spam and 68 non-spam training messages were used to train the filters that required training. Then, the set of 1,273 messages was classified by each of the filters, the results of which are shown in Table 1:
SpamAssassin is the only filter that has a recall rate worth using. I think it's reasonably clear that the Bayesian filters did not have large enough training sets, and hence are only achieving low recall rates. Experiment TwoFor the second test, the training data consisted of the original 68 spam and 68 non-spam training messages, plus the first 100 non-spam messages and the first 500 spam messages of the email data. All the filters were run on the remaining email data, 100 non-spam messages and 573 spam messages, producing the results shown in Table 2:
Those results are more along the lines of how Bayesian filters are expected to perform. Quick Spam Filter and Bogofilter have noticeably lower recall than the other Bayesian filters, and Quick Spam Filter's precision is too low to be useful. SpamAssassin is now showing a significantly lower recall rate than most of the Bayesian filters. It should be noted that, in practice, SpamAssassin will likely use a few more metrics (using network resources), and hence should do a little better than these results indicate. Also, SpamAssassin has a Bayesian classifier built in, but it wasn't used in these tests, since having five was enough. That SpamAssassin is not better than the bulk of the other filters is a good sign for email filtering. Bayesian filters are reasonably easy to implement and require no knowledge of what differentiates spam from other email. SpamAssassin's rules, on the other hand, need to be developed by people and probably account for most of the work in creating the software. SPASTIC has both significantly lower precision and recall than the other filters. Since people actually do use it to filter mail, it must be suitable for some email profiles, but for my email, it isn't usable. Examining the False PositivesSpamProbe and SpamAssassin both generated one false positive, and it was caused by the same message. That message was essentially an advertisement for a conference, and many people would classify it as spam. However, I attended the previous conference, and I don't mind this showing up in my inbox. It has a number of spam-like properties. "HTML only" is a big one. It is also generically addressed ("Dear Friends"). The From: address looks like it might be auto-generated due to some digits (icce2003@...). Basically, it's spam that I didn't mind receiving. The address it's from could easily be entered into a whitelist to solve the problem, but it could also be argued that it should be classified as spam. I actually didn't read it when it turned up in my inbox in real life (I don't bother with HTML-only email), though it did remind me of the conference. Bayesian Mail Filter also misclassified the message discussed above, as well as a message from my Web hosting provider announcing a server move and a little resulting downtime. Clearly, that is a message I want to receive. However, it was sent from the email address of my hosting provider, an address from which I expect to receive mail I want and which could easily be entered into a whitelist. In fact, it's the type of address that should be put on a whitelist, since valid commercial messages look a lot like unsolicited commercial messages. dbacl gave four false positives, one of which was the conference advertisement mentioned above. Another was a message detailing administrative responsibilities of staff. It was from someone who doesn't send spam, and that address could easily be added to a whitelist. It also flagged a forwarded IBM PhD Program nomination advertisement. This is another message that is essentially spam, but it was intentionally sent to a list I am on by a staff member. Again, a whitelist would catch this. The final false positive was a second copy of the IBM PhD Program email, this time forwarded by someone else to another list I am on. Quick Spam Filter produced 23 false positives. These included the conference announcement and the hosting provider announcement mentioned above. A dozen or so newletters were flagged as spam, as were a few commercial messages that were not unsolicited and a couple of messages from my wife. Whitelists can solve these problems quickly and easily. The false positives that are not easily fixed are the problem, so I'll focus on those. An email bounce notice was flagged as spam. A whitelist can't solve this without a fair amount of effort, since the address is determined by the machine on which I happen to run the "netfile" command. A message requesting I contact a person about something which "needs urgent attention" was flagged as spam. This is what spam filtering nightmares are made of, especially when the email originates from an Associate Dean. Whitelists don't help, since Associate Deans change and I had never heard of this person before I received this message. The reply to my reply to this message was also flagged as spam. Four seminar announcements were flagged as spam. Since the sender is often different, a whitelist won't fix this. SPASTIC produced 30 false positives. The vast majority of these were newsletters, solicited commercial messages, and "calendar" reminder messages (which have no subjects), all of which cause problems easily solved by a whitelist. SPASTIC also flagged an important message as spam, this time from my supervisor with the subject "URGENT". Putting my supervisor in a whitelist is reasonable, I guess, but this highlights the problem with SPASTIC's method of tagging a message as spam if any single test for spam succeeds. This particular message was not spam-like in any way, except for the subject. Two more messages were tagged as spam which were not spam, but not from people I would put on a whitelist, since I wouldn't expect email from them. So, allowing for whitelists, we generate the false positives shown in Table 3:
Experiment ThreeFor the third test, the 1,273 pieces of July's mail were used as the training set. The testing set was the first week of August's mail: 252 mails, 210 of which were spam. The results are shown in Table 4. The low SpamAssassin and SPASTIC recalls indicate that my spam was quite different from what they expect spam to look like.
Experiment FourFor the fourth experiment, the 200 non-spam messages from July's mail were combined with 200 spam messages randomly selected from July's mail to make the training set. The testing set was the same as in the previous experiment. Therefore, SpamAssassin and SPASTIC were not tested; since they don't use the training data, they would have the same results as in Table 4.
The results in Table 5 show that all the Bayesian filters do worse than they did in Experiment Three, so a training set with a large amount of spam is better than a smaller, balanced training set. This conflicts with the documentation for sa-learn, SpamAssassin's Bayesian classifier (not used in these tests), which says, "You should aim to train with at least the same amount (or more if possible!) of ham data [as] spam." Experiment FiveAll the previous experiments haven't been very scientific, and have merely indicated how the various filters performed on various data sets. In order to produce some numbers with which it may be possible to objectively compare the filters, we will follow the methodology used in a technical report by Androutsopoulos, et al.: Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach. The data set used was all my email from the month of July. This was partitioned randomly into ten equally-sized sets, each containing 107 spam messages and 20 non-spam messages. Three spam messages were left over and were discarded. For each of the ten sets, the other nine sets were combined and made up the training set, and it was tested. Hence, each filter was run ten times. The average precision and recall of the filters over those ten tests is shown in Table 6:
For our objective analysis, we will use the metrics defined in the technical report linked above. Some measure of the relative cost of false positives to false negatives is needed in order to do this. Androutsopoulos, et al. suggest using a measure in which each non-spam is treated as equivalent to a number of spam messages. That number can be tweaked to represent just how bad false positives are to the user. We'll call this weight FPW (false positive weight). The variables we will define are:
The Weighted Accuracy of the filter is then defined as:
The Total Cost Ratio (please see the technical report for the
justification of this metric) is then defined as: Tables 7, 8, 9, and 10 show the results for three values of FPW. The Weighted Accuracy and Total Cost Ratio were calculated by summing all the variables across all ten runs, and not by calculating them ten times, then averaging. Doing this prevents infinite Total Cost Ratio scores (when no mistakes are made by a filter on one run).
If the Total Cost Ratio is greater than 1, the filter is worth using if the False Positive Weight is an accurate representation of the relative costs of errors. A False Positive Weight of 1 is only realistic for the case in which email is being marked by the filter, but still placed in your inbox for manual removal. If that is how you plan to use a filter, SpamProbe or Bayesian Mail Filter are the best options, according to Table 7. A False Positive Weight of 9 might be appropriate if you are filtering spam messages to a folder which you check every day. In that case, Bogofilter, SpamProbe, and Bayesian Mail Filter all look reasonable, according to Table 8. A False Positive Weight of 99 might be an accurate representation for someone who checks the spam folder each week for false positives. In this case, Bogofilter and SpamAssassin are the most worthwhile filters. A False Positive Weight of 999 would represent a set-and-forget spam filter which sends spam to the bit bucket. In this case, Bogofilter is the only option, and it isn't any better than no filter. Personally, I check my spam folder a few times each day. It only takes a second to glance at the new subjects and check the sender for the subjects that look like they might not be spam. So, for me, a False Positive Weighted Accuracy of 9 is appropriate. The graph below gives an indication of how the filters compare at a range of False Positive Weights:
![]() It's important to note that the Total Cost Ratio isn't a perfect metric. It scores classifying a forwarded joke from an annoying coworker as spam, just as it scores classifying an urgent message from your boss or partner as spam. ConclusionThe Bayesian filters, after training, offer better recall than the two heuristic filters. Catching a higher proportion of spam is clearly good, since that is the reason people use them. With insufficient training, however, the Bayesian filters perform poorly in comparison with SpamAssassin in terms of recall. Based upon the results for my email, SpamProbe and Bayesian Mail Filter have usable recall percentages and acceptable precision. Four spam messages a week is much more bearable than 210, and well worth the minor effort involved in setting up one of these filters. If false positives are especially bad to you, Bogofilter is the best choice, according to my email. SPASTIC is useless for my email, since it lets through far too much spam and marks some legitimate messages as spam messages. SpamAssassin is better; it lets through more spam than the Bayesian filters, but has enough precision to at least not hide wanted email. Quick Spam Filter performs poorly when compared with the other Bayesian filters. I suspect it will improve in future versions, since clearly the underlying mechanism (Bayesian filtering) isn't the problem. dbacl is similar to SpamAssassin in performance. However, it should be noted that dbacl can classify into multiple folders, not just spam and non-spam. This extra functionality may cause its performance to be less than that of the other Bayesian filters, but if you use that functionality, the tradeoff might be worthwhile. RecommendationsIf you want to filter spam out of your email, I strongly suggest not automatically deleting messages. File the spam away, just in case you get false positives. Any spam which isn't picked up by your filters should be manually moved to the spam folder, not deleted. The same is true for your real email; instead of deleting it, move it to another folder. That way, you'll build a collection of spam and non-spam messages, which will come in handy for training filters. Start by filtering with SpamAssassin. The Bayesian filters don't work well if you don't train them, and you can't train them without having a collection of your past email (both spam messages and non-spam messages). A non-learning filter makes it easy to build this collection. Watch for false positives. You really do need to scan the spam folder every so often to check for items that shouldn't have been flagged as spam, especially if you ever move to a learning filter. Otherwise, it will learn that some valid messages are spam messages. If your filter supports whitelists (if not, you can always add a whitelist to a chain of filters), use them. If friends' email gets flagged as spam, add them to the whitelist. It will save you time and lost messages in the end. If you can find the inclination, adding people to your whitelist preemptively should help avoid false positives. Once you have enough spam messages and non-spam messages correctly classified, you can think about using a Bayesian filter. You really want a few hundred of each type, preferably more. You also want to make sure there isn't an unintended identifying feature of the spam messages or non-spam messages. For example, don't use non-spam messages from the past 6 months and only the last month of spam messages; the learning algorithm might decide that messages with old dates are non-spam messages and messages with new dates are spam messages. Don't try to pad the numbers with duplicates; it will overtrain the filter on the features in those messages. Moving to a learning filter is a good thing, since keeping up-to-date with the latest rules isn't necessary. The learning algorithm won't get worse with time, since it will learn the ever-changing look of spam. (At least until spammers make their spam look very much like non-spam messages.) Once you are using a learning filter, you must remember to train it every so often. If you don't, the performance will deteriorate as your email usage changes. Of course, deteriorating performance is a great reminder to do some training. Training will be easy, since you will have a nice collection of classified spam messages and non-spam messages, and you will have corrected by hand any misclassifications the filter makes. Don't just blindly feed the filter's own classifications back in as training data; it will reinforce any mistakes. Another option is to simply train it on the messages it classified as false positives or false negatives, to correct the mistakes. Try spam filtering. It puts the joy back into email. Author's bio: Sam Holden is a seemingly eternal student who is expecting his first child in mere weeks, and hence will actually be finishing university and getting a "real job" Real Soon Now. T-Shirts and Fame! We're eager to find people interested in writing articles on software-related topics. We're flexible on length, style, and topic, so long as you know what you're talking about and back up your opinions with facts. Anyone who writes an article gets a t-shirt from ThinkGeek in addition to 15 minutes of fame. If you think you'd like to try your hand at it, let jeff.covey@freshmeat.net know what you'd like to write about. [Comments are disabled]
[»]
Excellent Article I know I'm very late in here, but just wanted to say well done on an excellent article Sam! Even though your article is now very old it still has a lot of relevance today! --
[»]
Banning foreign sites - No SPAM now At my little community-oriented Internet site, I have banned almost all (the one's I could find) foreign IP addresses from Europe, Asia, South America, etc. I was getting 3 SPAM's per day to my system accounts, and now I get less than 3 SPAM's per month. I also have much fewer assauts via the network on my web server, email server, etc. This solution won't fit many of your situations, but it would fit some. Do you really care if someone in China cannot send you e-mail or browse your website? I'm sure you wouldn't care if someone in Europe could not remotely admin your Internet site. You may send email directly to me if you would like my list of foreign IP's or an detailed description of how I do this. No marketing, just my free tech info. Kevin --
[»]
Antispam relay! Thanks for this article, it helped me too. I perefer spamassasin+clamAV+mailfilter. I configured smtp relay server, which kills 99% of spam. Other companies sell analog hardware spam filters for $9K, but we can get it almost for free!
[»]
Checked and marked very usefull. Thanks for that greate article. I work at an small webhosting company, your text helped answering most of the major questions about implementing an Spamfilter for our Mailservers. I hope you will write more articles like this one in the future. --
[»]
Bookmarked Thank you so much for this much needed article! Spam :Grr: --
[»]
Why not just Stop Spam on the servers - period. Hi All, --
[»]
Re: Not that easy It ain't that easy, there are many messages and addresses NOT SPAM that could be trapped in those rules, I have for example, an old e-mail address (since 1997) that have been used by some infected servers to send spam, but it wasn't in fact me or my PC at all. An blacklist server would classify my e-mail as spam but in fact I never sent one of those e-mails and it would get banned. This spam crap is getting very annoying. Every e-mail should be signed by default to track spammers.
[»]
Re: Why not just Stop Spam on the servers - period. Syed, Interesting point and this would be much easier to add with a Virtual Private Server --
[»]
Re: Why not just Stop Spam on the servers - period. I just now stumbled across this great article. Syed, I wanted to respond to your post about RSS feed. I'm using an outsourced spam/virus filter: Sentinare PostGuard that uses both Bayesian Filtering and SpamAssassin as well as 'greylisting', and 'tarpitting' a multi pronged approach that really works great. But the main feature is the web based quarantine for training the filter. Really quite impressive, and as you mentioned, they have an RSS feed of your quarantined items so you dont have to login to the quarantine to check for any false positives. Even though the accuracy is like 99.87% for me, its nice to have the RSS feed anyways.. and add in IMAP and TLS support, geeez. The best. Sentinare knows email! > Hi All, > > With all the Spam as well as security > scare around emails and the need for an > anti-virus software to protect from > those Spam as well as worms, trojans, > and viruses that spread thru email, why > not simply add a simple XML based > "RSS feed" like feature on top > of the email server. > > A user then simply selects their server > as an RSS Feed Source that they > subscribe to. > > A sligthly modified "email > client" with XML parser could then > be built on top of an RSS Feed Reader / > Aggregator that would stop emails and > Spams from being downloaded and pushed > down to clients like in POP3 but users > will get only the email subject and > headers like "news-feeds" so > they could decide and select which ones > they need to read (download from the > server) and the rest can be nuked > straight at the server ? > > I think - once most spams can be stopped > this way in their tracks on the servers > - and end-users (many now-a-days PC > Neophytes) stop helping in its > propagation over the net - the tidal > wave of spam can be greatly reduced and > managed. > > Sounds like quite a simple solution > isn't it ? Can any one see anything > wrong with this ? > > - Syed Ahmad
[»]
Spamassassin While this is an excellent article about spam filtering software, it should be noted that spamassassin is not just a spam filter. It is a frontend to a number of different filters and rlb checks and other tests for spam. For example, spamassassin can be configured to check the razor, pyzor, and dcc content based blacklistst for the message, it can also run other spam filters as part of the filtering process. It is very easy to extend using perl scripting. There are a number of additional checks that are being created by users for spamassassin which keeps it up to date with modern spamming techniques, e.g. Rules de Jour and rules emporium. Also, it is possible to configure spamassassin to use any of the filters mentioned in the article (and most others) as part of the filtering process. I personally recommend dspam or crm114. Spamassassin also integrates very well with MTAs, and can run in a high performance daemon mode. --
[»]
Re: Spamassassin Agreed, I also recommend dspam.
[»]
Re: Spamassassin
Yes, this was an excellent article about spam filtering software. Along
with SpamAssassin and a module plugin sid
you can have the ultimate spam protection. --
[»]
Re: Spamassassin
--
[»]
Bayesian observations and applications 1. I noticed a steady drop-off in the performance of my bogofilter after a
period of excellent performance.
[»]
Wow, great article. That is just an amazing about of information. I've been reading up alot about Spam lately, it is becoming more and more of an issue for everyone. Thanks again for agreat article. --
[»]
Re: Wow, great article. Has anyone tried Lockspam from www.polesoft.com? It will work with all POP3 mail clients and is free to use. You could choose to order a Pro version, but you could still choose to use the Free version for ever.
[»]
Re: Wow, great article. Thanks for your message! We've tested your Lockspam Free and it's really
impressive!
We've recommended it at our www.free-anti-spam.com. Please
have a look there.
Have a good day!
Jerry
[»]
Another way to stop Spam The system I have sends back a challenge to the first email from any source and then if it is replied to correctly adds the source to a whitelist. It seems to work pretty well BozMo --
[»]
spamd When you find out that you get far too much spam from one or more addresses, you might use spamd. Uses less resources, and is more effective and spam filters. I'd call it a spammer filter :-) You can learn more about it from: --
[»]
Better Bayes I think we're overlooking a possibility that would combine the effectiveness of rule-based and Bayes filtering. Words and phrases need not be the only Bayes tokens scored. Any property of a message, i.e. the result of evaluating any rule, can be a Bayes token. POPFile implements this to some degree with "pseudowords", but I would really like to see what would happen if a large ruleset like SpamAssassin's were fully tokenized. Forget about manual scoring -- why not let Bayes figure it out?
[»]
SpamBully - Bayesian spam filter I'm an experienced user spam filter user. I used to use SpamInspector,
plus used SpamArrest (online server) to add a challenge email step to one
of my accounts...
[»]
Re: SpamBully - Bayesian spam filter I used a SpamBayse which is free and works well but I also use Spam Arrest
and I must say it is worth its weight in gold. It stops spam at the source
unlike some other services that say they do the same thing. I'm going try
your Spam Bully, if it does better than the free one we'll see.
[»]
Re: SpamBully - Bayesian spam filter Yes, i like also spam bully works fine for me, and tested for arround 50000
messages.
[»]
Virus and virus bounces... I was planning on rerunning the tests with more data last month (when I
would have more mail). However, I've been flooded with what I think are
virus, virus bounce, "we detected a virus" notices, and fake
microsoft patches.
[»]
Quick Spam Filter's accuracy Just thought I'd point out that since this article was written, QSF has got a lot more accurate - version 0.9.0 now comes in about third in the final test above, according to the article's author.
[»]
Details(?) Thank's for the article -- you have taken considerable measures to get a
useful comparision, and except for the omission of SpamAsassin with
learning and for a few wrong conclusions (not important enough to go
into), it's really good.
[»]
Re: Details(?)
[»]
Use multiple filters (Evaluating SpamAssassin without its Bayes feature enabled is probably non-useful. Who would do that? The question is, what slips past SA with Bayes turned on?) I run four spam filters. The combination gets just about everything. First is a short blacklist of repeated spammers. Then SpamBouncer (www.spambouncer.org), SpamAssassin with Bayes enabled, and simple filtering on the last hop before the mail was delivered to my ISP. (90% of all spam is sent from servers that either have no reverse DNS or are identified as a dialup, cable, or DSL line.) Every day, I get some spam that is caught by only one of these filters. Since I got the last three set up, I get one or two false negatives a month: i.e. spam seen as legitimate. I see a few pieces of mail a month that get false positives until whitelisted. Last month I got over 17,000 mail messages, 95% spam, and the percentage is gradually rising every month. The spammers and the filters are in a continual arms race. Whoever commits first loses: you can filter any fixed kind of spam, and you can design messages to evade any fixed filter. But the delay in designing filters will leave some incentive for spammers, and the rising cost of filtering may eventually cause some people to give up. As one of the first creators of an email program I am saddened by its misuse. See www.multicians.org/thvv/mail-history.html for what I remember.
[»]
Pre-server spam filtering to be reviewed in Network World As an interesting corollary to this article, I have submitted a review to Network World which will be published in two weeks on 16 enterprise-sized spam filters, including their spam filtering performance and speed, as well as a host of other features. I'll come back and post a URL for reference when it is available on the web. I can't share the results before publication, but I can say that we looked at a very different set of products, specifically those which take SMTP in and feed out SMTP, so these would be considered prefilters before an enterprise mail server. Because Network World is aimed at a corporate networks managers, I didn't directly review any open source products, but several of the commercial products, of course, have open source cores. I also considered a very different set of requirements. For example, in a network with 10,000 users, individual training of filters wouldn't be practical except in a whitelist/blacklist sense. Anyway, if you found this interesting but think that you need a more commercial answer to spam's problems for large numbers of users, then I'd recommend you take a look at that article when it comes out.
[»]
Re: Pre-server spam filtering to be reviewed in Network World The review is now published and can be read at: http://www.nwfusion.com/reviews/2003/0915spam.html Joel Snyder
[»]
Re: Pre-server spam filtering to be reviewed in Network World
--
[»]
Re: Pre-server spam filtering to be reviewed in Network World
--
[»]
Re: Pre-server spam filtering to be reviewed in Network World
[»]
suggestion Thank you for an interesting article. I had hoped
[»]
Re: suggestion
[»]
You're asking way too much out of a spam filter comparison I think a lot of you are asking far too much out of a spam filter
comparison article. Yes, I think a feature matrix would be nice
(client-side, server-side, trainable, etc.), but outside of a feature
matrix, any test run on spam filters is going to be specific to the user's
email behavior. There's no tride and true way to get effective tests for
any spam tool unless you try it yourself. There are just way too many
variables:
[»]
Training Questions I'm curious to know if the initial training is all that was performed; as
you know, bayesian filters learn from their mistakes, so I would like to
know if false positives were also put back into the system to be retrained
by any of the tools that supported it. I also would be interested in
different reports based on different training threshholds...while the
minimum threshhold for a particular filter might be x, if you train to x2,
what difference it makes. Since most of these tools are in it for the long
haul, it would be very interesting to see how much more effective they
became over different periods of time. Some tools may be very ineffective
at 1000 emails and 100% accurate at 2000 emails. Measuring the ramp-up
cycle would be nice; your graphs do something of the sort, but I don't see
any hard data though.
[»]
IMAP support? It seems that every spam filter review misses a very important detail -- does it support IMAP and/or SSL? Every spam filter I've tried requires plaintext (as in insecure) POP3 support on the server -- something I'm not willing to use. A spam filter that supports IMAP and can file e-mail into different IMAP folders would be greatly appreciated.
[»]
Re: IMAP support?
[»]
Re: IMAP support? The filters don't do this because it's not the job of the filter. It is the
job of the smtp/lmtp/procmail etc. Since these are server side filters,
the email server is what makes or breaks the security.
[»]
All spam filters fail in comparison... ... to TMDA. It's not a filter, rather a whitelist/blacklist based
challenge/response system. I installed it on my ISP's mailserver and
customers are bombarding our phones asking us to install it for them
(especially w/ the SoBig.F virus making its rounds).
[»]
Re: All spam filters fail in comparison...
[»]
Re: All spam filters fail in comparison...
[»]
Re: All spam filters fail in comparison...
[»]
Re: All spam filters fail in comparison...
[»]
Re: All spam filters fail in comparison...
[»]
Re: All spam filters fail in comparison...
[»]
Re: All spam filters fail in comparison...
--
[»]
Re: All spam filters fail in comparison...
[»]
Re: All spam filters fail in comparison...
[»]
Re: All spam filters fail in comparison...
[»]
Re: All spam filters fail in comparison... Yes but in a world where new customers are contacting you to purchase your
product or service they don't want to be greeted by a challenge and
response. You lose business that way.
[»]
Individual vs. systemwide use? That is a VERY important characteristic of your installation that is not reflected in these tests. Which of these engines is most suitable for an 'all users' installation, as opposed to an individual user having their own bayes DB's, and own finely tuned filtering.
[»]
Re: Individual vs. systemwide use?
I totally agree. Making a distinction between the two types of installation is CRITICAL to real-world usage. The main reason for this is that in most organisations, site-wide rulesets must be so general as to be almost meaningless - there being such a heavy administration penalty for false positives that any attempt to really clamp down is just not worth it. If you don't believe me, try setting up something like SpamAssassin site-wide for 100 users with a default score of 5.0. The result is complete chaos. One man's spam is another man's legitimate marketing message I'm afraid. Added to this is that most users have at best a limited interest in training Bayesian filters. In about 99% of cases the user's only access to the mail server is via Outlook, which severely limits the options when it comes to training. These tests all assume the user is a command-line toting geek. --
[»]
Re: Individual vs. systemwide use? At present there are lot of individual spam filters based on Bayesian filtration for Outlook clients. Most popular are Spam Bully, Inboxer, and Outlook Spam Filter. My most liked outlook spam filter is Spam Reader.
[»]
Re: Individual vs. systemwide use?
[»]
Re: Individual vs. systemwide use?
[»]
Re: Individual vs. systemwide use? Hmmm . . . very confusing. Just kidding. I don't the bayesian filters are that bad though. --
[»]
spambayes? Can I know why this fine python filter wasn't tested?
[»]
popfile Hi, it´s a greatarticle with a good analysis, but what about popfile?
its one of the best spam filter available for a user.
[»]
Re: popfile I agree POPFile is a great tool, even it's outlook counter part but I see
it more as a mail filer tool rather than a spam tool. Yes you can use it
to filter spam but it dosn't have the real filters that make up an anti
spam program.
[»]
Thanks Not too long ago, I submitted a question via Ask
[»]
DSPAM Stats If this helps any for your report...
[»]
bogofilter is good enough :) As of my experience... The first (and only) filter I tried so far is
bogofilter.
[»]
Re: bogofilter is good enough :)
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||