Statistical Spam Filtering in the UW CS Department

I don't get a huge amount of spam compared to some people, but I do get more than enough to be annoying. So, with some tips from Paramjit, and some info on the net especially here, I've started using Bogofilter in conjunction procmail to filter spam for my CS email. I don't have any numbers, but its been quite effective for me, only missing a few spam messages, and never giving a false positive.

It takes a few minutes to set up, wastes a few megs of disk space, and requires a few more milliseconds to deliver your mail. But, you'll still have plenty of time to get those 6th floor donuts if you hurry, disk space is cheap especially if you use /scratch, and satisfaction is worth a little time.

Following is some info on how to do this in our department, but first the caveat: This works for me, but I make no claims about how it might work for you, and am not responsible if you lose email, other files, blow your computer up, or anything else you might try to blame me for.

Local Delivery

In order to invoke a program to run on incoming email in the CSL (such as a spam filter), you must have your mail delivered to your local workstation. This obviously makes sense: the spam filter takes a heck of a lot more resources than just dumping the incoming mail onto the right spool, so if everyone ran one on the mail server, it would likely grind to a halt. Accessing your mail file locally is much faster anyway, but if you are constantly being shuffled between between offices and computers, it may not be the best choice for you. However, I don't think that there is much of a risk of losing messages, since you can always change it back to IMAP if you know your machine will disappear, and if you machine is unreachable for awhile, mail will just be queued until it comes back up.

Whether its intended or just a configuration quirk I don't know, but you can still access your mail via the IMAP server if you wish, provided that your inbox lives at "~/mail/mbox". So basically, there are few disadvantages of local delivery that I know of.

To do this, goto http://www-auth.cs.wisc.edu, click on Web Forms, then Change e-mail delivery server. It will happen overnight.

Setting up Bogofilter

First download and install bogofilter from http://bogofilter.sourceforge.net or copy the binaries and/or source from ~pwells/public/bogofilter Oh, and read the documentation too. Different versions of bogofilter have different options, make sure to verify them for your copy. Those of version 0.11.2 are what I describe.

Train the filter

If you don't have a huge archive of saved spam, you can grab a couple of archives from SpamArchive, though the filter seemed to work better for me with only my own small collection of spam. Unzip the archives them and tell bogofilter they're spam:

bogofilter -v -s < spam_arch

where spam_arch is the name of the spam archive. If you have some of your own spam, use that instead of, or in addition to the archives, just run bogofilter with the -s option with more spam.

You also want to give it some ham, and your current INBOX and any folders with saved messages are good to use (they are found in a variety of places depending on your mail delivery and mail client options - check ~/mail/, ~/Mail/, and /var/spool/mail/username) . Run with the -n option to tell it it's non-spam:

bogofilter -v -n < INBOX

If you use Netscape, or some other program that stores mail in individual files, not one big file, you can do something like:

cat Inbox/* | bogofilter -v -n

Bogofilter creates a ~/.bogofilter directory and puts goodlist.db and spamlist.db in there. They can take a couple megs, so if you want, put them somewhere in /scratch and sim-link 'em.

Test the filter

Find a piece of spam, put it in a file (with the full headers), and run bogofilter:

bogofilter -v < spam

It will say something like:

X-Bogosity: Yes, tests=bogofilter, spamicity=0.999863, version=0.11.2

Sweeet...

Running Bogofilter with Procmail

To filter your mail, use procmail to sort your mail into the folders based on the results of the bogofilter test. Below, I've shamelessly copied pieces of other peoples' ~/.procmailrc files into mine. Based on the return value of bogofilter, it either updates your goodlist and delivers to inbox, or updates spamlist and delivers to spam folder.

PATH=/bin:/usr/bin:/s/procmail/bin:/u/p/w/pwells/bin
LD_LIBRARY_PATH=/s/db-4.1.25/lib
LOGFILE=/u/p/w/pwells/local/log/procmail.log #recommended
MAILDIR=/u/p/w/pwells/mail #you'd better make sure it exists
DEFAULT=mbox

# :0c
# maillog

:0HB                     # run bogofilter on the Header and Body
* ? bogofilter
{                        # if it says its spam
    :0c
    | bogofilter -s      # update your spam word db

    :0
    spam                 # and copy to 'spam' folder
}

:0EHBc
| bogofilter -n          # otherwise, its not spam, update good list

#Default: mail delivered to $DEFAULT mailbox

This setup is nice because it automatically adjusts to new spam words, but bad because in the cases where bogofilter misses spam, it puts the words in the good list, and in the (very rare but possible) cases where it has a false positive, registers good email as spam. Thus, in these cases you need to manually update the word lists. If you use mutt, there is info here on how to do that very easily.

For the rest of us, the manual method will have to suffice until someone spends the time to find something better: Save the missed spam (with full headers) to a file such as /tmp/spam. Then tell bogofilter to unregister it as non-spam, and re-register it as spam:

bogofilter -v -N -s < spam

If you manage somehow to get a false positive, save the ham to a file and run bogofilter with the -S -n (or just -Sn) options to do the opposite.

Running Procmail from .forward

Now, of course, you need to run procmail on all incoming mail. You should probably wait to do this until your mail is being delivered locally, since you can't otherwise run programs from your .forward, and it's possible you might lose some mail if you try. To run it, put the following line in your ~/.forward file (with the quotes!):

"| /s/std/bin/runauth /s/std/bin/procmail"

runauth is used to run the command with the proper AFS authentication. In order to use it, you much previously run the command:

/s/std/bin/stashticket

It's recommended that you put stashticket in your .cshrc so that its run every time you log in. If the stashed ticket expires - i.e. when you don't log in for a month or more - procmail wouldn't be able to run with the correct authentication, and your mail would be delivered by default to /var/spool/mail/username on the local machine. The same happens if delivery fails for some other reason such as AFS outages.

Tuning Bogofilter

After you have run bogofilter for awhile and built up a collection of ham and spam, you may find that bogofilter still misses a fair amount of spam - especially those messages that attempt to defeat filters such as these. I have found, however, that the bogosity of such messages is still high (i.e. 0.5 or 0.7), but not high enough for the default spam cutoff of 0.9, while non-spam messages are almost always less that 0.1. You can lower the spam cutoff (and increase the probability of seeing a false positive) by running it with the -o option:

bogofilter -o 0.5

That's It!

Hope it works for you! If you have comments or additions for this tutorial, please email me at pwells@cs.wisc.edu. (Notice that I'm not afraid to put my un-obfuscated address on my web page anymore!)