Spam is a huge problem, and it’s only getting bigger. Statistical content filtering is capable of impressive rates of correct classification (quite frequently > 99%). However, as spam volumes continue to grow, even a 1% error rate can let an unacceptable amount of spam through.
One of the ways to overcome this problem is to combine several techniques. Here I discuss the approach I’ve taken. For me, it’s been very successful.
I run this system on FreeBSD, with postfix and fetchmail doing a lot of the heavy lifting.
Most email is assumed to be from one of two sources. It could be from a white-listed sender, or it could be spam. Those two paths through the filter are meant to be fast. Less common paths are slower (sometimes quite a bit slower), but it’s worth it to avoid incorrectly classifying mail.
I maintain four lists of addresses:
If you’re not familiar with TMDA, there are two things you need to know about it.
First, it can operate as a challenge-response system. It will ask unknown senders to "please confirm your message" as a way of verifying that they are in fact human and do in fact read their email. Unconfirmed messages go into a pending queue while they wait for confirmation or eventual expiration.
Second, it can provide various kinds of tagged addresses. Most importantly for me are the dated addresses, which carry an HMAC code and a time-stamp. If the code is valid and the address has not expired, mail sent to a dated address goes straight to the inbox.
New message processing is as follows:
If a message sits in the TMDA pending queue for more than 15 days, it is assumed that this is spam. CRM114/Mailreaver is trained on this message, and the message is removed from the pending queue.
If we receive a bounced confirm request, the message which generated that request is assumed to be spam. CRM114/Mailreaver is trained on the message, and the message is removed from the pending queue.
All outgoing mail is sent with a hashcash stamp. The following table summarizes the other various parameters.
Recipient | dated envelope-sender |
dated reply-to |
add recipient to white |
Unknown | x | x | x |
Mailing List | x | x | |
(white|grey)-domain | x | x | |
white|grey | x |
The envelope-sender is the address to which software should reply. Any message with a dated env-sender will be reply-able for mailing list software, MTA’s, and so forth.
The reply-to is the address to which people should reply. Any message with a dated reply-to will be reply-able to anyone with a sane Mail User Agent.
There are two magic folders, train/ham and train/spam. If CRM ever gets a message wrong and none of the other filters in the chain slap it on the knuckles, put a copy of the message into the appropriate folder. The cronjob will train it for you.
mail.tar.gz contains a skeleton set of procmail scripts that should be fairly drop-and-go. You’ll need to find instances of USER and DOMAIN and replace them with your actual email address. There is also one instance of LOCALUSER that needs to be replaced.
You'll need to run tmda-keygen as described in ServerConfiguration in the TMDA docs.
To make outgoing mail processing work, you'll need to either configure tmda-ofmipd or tmda-sendmail.
You'll also need to get CRM114 installed and running, the CRM114 & Mailfilter HOWTO is helpful for this. I've also included a GNU Makefile in mail/crm, which has some useful targets defined.
And lastly, you'll need to configure the tmda-cronjob script to run nightly.
For questions/comments/concerns please contact me. I'd particularly like a note if you're using anything I'm presenting here. I like to know that I'm not just talking to myself. :-)
There's nothing particularly special about CRM114, aside from the fact that it is my current favorite. I have in the past also used bogofilter and dspam in this framework.
For my mail, CRM114/Mailreaver does a better job. I think that part of this is the excellent training framework used by CRM114/Mailreaver. Old trained messages are cached, and there is a bulk training tool which makes sure all messages that had been previously trained as 'spam' or 'ham' will still be classified as such.
If you are in the filter market, it may be worth reading The Grumpy Editor's Guide to Bayesian Spam Filters and A Grumpy Editor's Bayesian Followup.
There are a considerable number of people who do not like the challenge-response approach to spam filtering.
Before adopting a CR-style system, I'd strongly suggest reading the objections linked above and thinking about them.
In my opinion, all of the objections in the linked articles boil down to three things. First, poorly designed software could do a large number of horrible things with/to your email. Second, you will annoy people. Third, you could send confirm requests based on forged headers (the "joe-job attack").
I do not think it is valid to indict an entire class of software based on assumptions about the types of bugs that would be possible if it were incorrectly implemented.
If people have been annoyed by my confirm requests, I have yet to meet someone who was annoyed enough not to reply. Paradoxically, I have had positive comments about my challenges ("All you have to do is reply, rather than going to some website and typing text from a picture").
The objection about forged headers holds water. It is bad to send email based on forged headers. However, mail transfer agents do that all the time when they bounce a message. Mailing list software does it with subscription confirmations. This is a problem common to all software that sends email.
It's up to the individual user to decide if this risk is worth taking "just" to reduce spam.
Brad Templeton has a list of principles for Challenge/Response anti-spam systems. The system I’ve outlined here complies with his suggestions.
I’m definitely not the first person to adopt a hybrid approach to spam filtering. The CAMRAM system is quite similar to what I am doing (though I was unaware of their work until late December of 2006).
Two Penny Blue combines statistical content analysis (via CRM114) with a reputation-based system and proof-of-work tokens.