Spam Filtering

The Problem

Spam is a huge problem, and it’s only getting bigger. Statistical content filtering is capable of impressive rates of correct classification (quite frequently > 99%). However, as spam volumes continue to grow, even a 1% error rate can let an unacceptable amount of spam through.

One of the ways to overcome this problem is to combine several techniques. Here I discuss the approach I’ve taken. For me, it’s been very successful.

The Tools

Procmail: the swiss-army chainsaw of email processing.
CRM114: A statistical content classifier. Given a corpus of training messages sorted into ‘ham’ and ‘spam’ buckets, it can classify a new message.
Spamassassin: A very popular rule-based email filter.
TMDA: A challenge/response and HMAC system.
GnuPG: A system for digitally signing and encrypting email. Think of it like a digital envelope.
hashcash: A system for providing "proof of work". Think of it like a digital stamp.

I run this system on FreeBSD, with postfix and fetchmail doing a lot of the heavy lifting.

The Method

Most email is assumed to be from one of two sources. It could be from a white-listed sender, or it could be spam. Those two paths through the filter are meant to be fast. Less common paths are slower (sometimes quite a bit slower), but it’s worth it to avoid incorrectly classifying mail.

I maintain four lists of addresses:

white known senders who always send legitimate mail.
grey known senders who usually send legitimate mail. People who send me forwarded jokes end up here, and my filters are trained to recognize forwarded jokes as spam.
white-domain domains from which I only receive legitimate mail. (I’m thinking of eliminating this one).
grey-domain domains from which I usually receive legitimate mail, but are sometimes spoofed by spammers.

If you’re not familiar with TMDA, there are two things you need to know about it.

First, it can operate as a challenge-response system. It will ask unknown senders to "please confirm your message" as a way of verifying that they are in fact human and do in fact read their email. Unconfirmed messages go into a pending queue while they wait for confirmation or eventual expiration.

Second, it can provide various kinds of tagged addresses. Most importantly for me are the dated addresses, which carry an HMAC code and a time-stamp. If the code is valid and the address has not expired, mail sent to a dated address goes straight to the inbox.

Incoming Mail

New message processing is as follows:

Classify the message with CRM114/Mailreaver
Is the sender on the white/grey(-domain)?
If so, deliver the message to the inbox.
If the sender on white or white-domain and CRM thought the message was not ham, train CRM to correct that.
Was the message sent to a valid TMDA tagged address?
If so, deliver the message to the inbox.
Does CRM114 think that the message is spam?
If so, deliver it to the spambox
Does Spamassassin think that the message is spam?
If so, train CRM on this message, then deliver it to the spambox.
Does CRM114 think that the message is good?
If so, deliver it to the inbox.
Does the message carry a hashcash stamp?
If so, deliver it to the inbox.
Does the message carry a valid PGP signature?
If so, deliver it to the inbox.
Out of options
Send a TMDA confirm request. None of the previous filters can determine if this is spam or not. All of the steps post-spamassassin were designed to avoid reaching this point. However, when I do get messages that defy automatic classification, I would much rather send a confirmation request than anything else. I don't want to put spam into my inbox and I don't want legitimate mail to be spamboxed either.

If a message sits in the TMDA pending queue for more than 15 days, it is assumed that this is spam. CRM114/Mailreaver is trained on this message, and the message is removed from the pending queue.

If we receive a bounced confirm request, the message which generated that request is assumed to be spam. CRM114/Mailreaver is trained on the message, and the message is removed from the pending queue.

Outgoing Mail

All outgoing mail is sent with a hashcash stamp. The following table summarizes the other various parameters.

Recipient	dated envelope-sender	dated reply-to	add recipient to white
Unknown	x	x	x
Mailing List	x	x
(white\|grey)-domain	x		x
white\|grey	x

The envelope-sender is the address to which software should reply. Any message with a dated env-sender will be reply-able for mailing list software, MTA’s, and so forth.

The reply-to is the address to which people should reply. Any message with a dated reply-to will be reply-able to anyone with a sane Mail User Agent.

Training

There are two magic folders, train/ham and train/spam. If CRM ever gets a message wrong and none of the other filters in the chain slap it on the knuckles, put a copy of the message into the appropriate folder. The cronjob will train it for you.

The Goods

mail.tar.gz contains a skeleton set of procmail scripts that should be fairly drop-and-go. You’ll need to find instances of USER and DOMAIN and replace them with your actual email address. There is also one instance of LOCALUSER that needs to be replaced.

You'll need to run tmda-keygen as described in ServerConfiguration in the TMDA docs.

To make outgoing mail processing work, you'll need to either configure tmda-ofmipd or tmda-sendmail.

You'll also need to get CRM114 installed and running, the CRM114 & Mailfilter HOWTO is helpful for this. I've also included a GNU Makefile in mail/crm, which has some useful targets defined.

And lastly, you'll need to configure the tmda-cronjob script to run nightly.

For questions/comments/concerns please contact me. I'd particularly like a note if you're using anything I'm presenting here. I like to know that I'm not just talking to myself. :-)

Notes

On Statistical Content Filters

There's nothing particularly special about CRM114, aside from the fact that it is my current favorite. I have in the past also used bogofilter and dspam in this framework.

For my mail, CRM114/Mailreaver does a better job. I think that part of this is the excellent training framework used by CRM114/Mailreaver. Old trained messages are cached, and there is a bulk training tool which makes sure all messages that had been previously trained as 'spam' or 'ham' will still be classified as such.

If you are in the filter market, it may be worth reading The Grumpy Editor's Guide to Bayesian Spam Filters and A Grumpy Editor's Bayesian Followup.

On Challenge-Response Email

There are a considerable number of people who do not like the challenge-response approach to spam filtering.

Before adopting a CR-style system, I'd strongly suggest reading the objections linked above and thinking about them.

In my opinion, all of the objections in the linked articles boil down to three things. First, poorly designed software could do a large number of horrible things with/to your email. Second, you will annoy people. Third, you could send confirm requests based on forged headers (the "joe-job attack").

I do not think it is valid to indict an entire class of software based on assumptions about the types of bugs that would be possible if it were incorrectly implemented.

If people have been annoyed by my confirm requests, I have yet to meet someone who was annoyed enough not to reply. Paradoxically, I have had positive comments about my challenges ("All you have to do is reply, rather than going to some website and typing text from a picture").

The objection about forged headers holds water. It is bad to send email based on forged headers. However, mail transfer agents do that all the time when they bounce a message. Mailing list software does it with subscription confirmations. This is a problem common to all software that sends email.

It's up to the individual user to decide if this risk is worth taking "just" to reduce spam.

Brad Templeton has a list of principles for Challenge/Response anti-spam systems. The system I’ve outlined here complies with his suggestions.

Other Systems

I’m definitely not the first person to adopt a hybrid approach to spam filtering. The CAMRAM system is quite similar to what I am doing (though I was unaware of their work until late December of 2006).

Two Penny Blue combines statistical content analysis (via CRM114) with a reputation-based system and proof-of-work tokens.