This is great news! I have been running SpamAssassin on my box for quite a while, just to filter my own mail. I recently installed it on my mother's Windows 98 box to filter her mail when she checks it with Outlook Express, and she hasn't complained about Spam since. With a bit of tweaking, its been catching 95% with no false positives. Hopefully the SpamAssassin project will keep on getting better:)
A success rate of 95% really sucks when (like me) you get just over 2,500 spams a day. That'd still mean around 125 spams a day would be getting through. (I've had the same email address since the early 1990's, back when there was no reason to keep your email address "secret.")
Personally I do use SpamAssassin, but as an intermediate step.
First step: Check a whitelist of known senders. Deliver if the sender is on the list, AND the message originated from an IP subnet that I allow for them personally.
Second step: Scan with SpamAssassin. If the score is really high (above 20) throw it the hell out.
Third step: If the score is less than 20, and the person wasn't whitelisted, run the message through TMDA [tmda.net] and politely tell the sender I'm not sure who they are, and I get a lot of spam, and could you please click this link to prove that you're a real person.
I've been using this three-step system for eighteen months now, and out of over one million messages that have come into my mailbox (really), exactly FOUR spam messages have made it all the way through. Apparently the spammers decided to go ahead and click on the little link, or they used a real person's return address, and when that person got they autoreply, they were too stupid to understand what was going on.
Even better, I have not received ANY indiciation that I've lost any messages; at least, no one has ever mentioned anything about an email that I didn't get.
I've got five other people at my domain using the same system, although for not quite as long (one for fifteen months, three for about a year, and one for just a month now); they have all had similar success.
So based on those numbers I'd estimate a success rate of 99.9997% for eliminating spam (which is, admittedly, COMPLETELY INSANE), and a false-positive (or at least "lost message") rate of 0% so far (fingers crossed). A few people have had to confirm their messages, of course, but I've whitelisted them as that happens.
I actually wrote all the connecting code in PHP, believe it or not, with a MySQL database as a backend. It's invoked using.qmail files. PHP is indeed good for things other than web pages; and was a little bit easier for me to maintain and deal with than Perl. The whole thing is less than 25KB of code. There is also a web backend which I use to configure it; that adds another 40KB.
The whole system took about twelve hours of programming to set up, on one Saturday.
Now, for correspondence to companies (such as Microsoft, or Amazon.com), I use a different scheme (although it's handled by the same PHP code). I create up a unique email address for each of them, which ONLY allows mail to or from that domain (for example "rptamazon@mydomain.com" only allows messages from amazon.com). Those addresses are also easily cancellable, individually, if the company starts to annoy me with spam. Basically, each email address can be assigned its own unique whitelist, and can be cancelled individually at any time, through the little web interface.
I also have a number of email addresses for things such as customer support for our company (I write computer software). I'm using the same system for those, also, but instead of checking whitelists based on the sender, I've found a simple way to do it is to check for ANY of our product names anywhere in the message body or subject. If the message doesn't mention any of them, it sends a simple autoreply back similar to that in (3) above, but mentioning that the message didn't seem to be about any of our products, but if it was, please click here, blah blah. We don't have a high volume of support messages (about one or two a day; we're a small company) but in the last year only three or four people have had to click through like that, and, honestly, their support requests were so f*cked up anyways that I'd rather it just dropped them on the floor.;-)
Then, as a very last step in all this, I also catch all email sent to invalid addresses in my various domains (which come to over 5,000 messages a day), and report those as spam to Vipul's Razor [sourceforge.net]. Which helps out the community, and me indirectly because my SpamAssassin installation also uses the Razor.
Interesting. I've personally found that SA doesn't do well on "word salad" spams, base64 encoded spams, spams with numbers / special characters / intentional misspellings ("V!agr0"), random word HTML ("<frank><moon>") etc. Nigerian scam spam seems to get through waaayyy too frequently.
What I have found very useful is the DNSBL's that block known spamming IP's (spamhaus.org) and all email from dynamic addresses. This cuts 95% out before SA even sees it. With a whitelist system in front and SA
Do you use sa-learn to teach SA about new spam? I have spam tagged email dumped to a Spam folder on my imap server so I can go through it and make sure there aren't any false-negatves. I then move all the spam to a shared folder and run an sa-learn script on it nightly.
Currently I have amassed 3681 spams totalling 76 megs. I should probably empty that directory sometime:P
sa-learn makes a big difference though. Helps with the misspellings and random junk. Havn't seen a Nigerian scam come through eith
Or... you could strip out all your personal information and either make those 76megs available for others to train their spamassassians or make the SA database available...
Word salad I can understand (if you bayes isn't aggressively trained at least).. I don't have problems with it, but my bayes is very heavily trained. (100-300 spams a day manual training)
What I don't understand is the base64 problem.. One of the first thing SA does is decode base64. Even "rawbody" rules get base64 decoding, so really base64 encoding shouldn't make a difference at all, as SA never examines the encoded text.
As for the intentional mis-spellings of V!agr0, check out antidrug.cf (use google) or wait for SA 3.0 which includes this set of rules as a part of the standard distribution.
Disclaimer: I am the author of antidrug, and thus do have a bias here.
Any chances that you'ld be willing to provide a download spot for your scripts? I know that I can't be the only one interested in seeing how you did this:)
PLEASE tell us where we can get these PHP scripts and the accompanying mySQL schema. And any glue or config files explaining how this runs.
I didn't see any references to ClamAV in here, but since its integration with SA is documented in other places, that can be an afterthought.
When you run your own mail server, it's easy to trump Google, Yahoo and MSN's recent multi-GB offerings. Wonder if they can top my 100GB mail account. Not that I've ever gotten more than 1GB mail worth reading in my almost 20 ye
Yes, I would definitely like to make this stuff publicly available; I know a lot of people would be interested. I need to find a good way to do it. I'm a bit worried about drawing needless attention to myself by releasing such a thing--for example, the system is NOT foolproof, so I could certainly see myself becoming a target for attacks and such.
Hopefully I'll find some free time later this summer (two big big programming projects I'm working on now are ending next month) and I'll see if I can take a we
Third step: If the score is less than 20, and the person wasn't whitelisted, run the message through TMDA and politely tell the sender I'm not sure who they are, and I get a lot of spam, and could you please click this link to prove that you're a real person....
So based on those numbers I'd estimate a success rate of 99.9997% for eliminating spam (which is, admittedly, COMPLETELY INSANE), and a false-positive (or at least "lost message") rate of 0% so far (fingers crossed).
You have no idea how many legitimate messages you fail to get because the sender couldn't be bothered, or quite simply can't (i.e. automatic sender, but non-spam) click that link.
Yes, I'm pretty sure I do. Like I said, I've been using this email address for almost 15 years now, and have a pretty good idea of who I correspond with. Very rarely do I get messages on my primary address from completely random people who I've never met before. It's more for personal correspondence. The idea is that since I
Apparently the spammers decided to go ahead and click on the little link, or they used a real person's return address, and when that person got they autoreply, they were too stupid to understand what was going on.
I respond to those all the time. I politely send a "please don't auto-reply to forged spam" message. It's not my fault that your anti-spam solution is stupid enough to re-define an email reply to mean that you should accept forged mail.
I respond to those all the time. I politely send a "please don't auto-reply to forged spam" message. It's not my fault that your anti-spam solution is stupid enough to re-define an email reply to mean that you should accept forged mail.
I'm sure I'm not the only one who has just one word to say in response to that... Huh?
Seriously, if you would "reply" to the confirmation autoreply, you'd just get another email back saying, again, I don't know who you are, and I get a lot of spam, so please click this l
Ah, I didn't realize you were talking about a different system. Most of them are reply-based. Since I read my home mail using mutt, "clicking" is a meaningless term, and I would have ignored your mail. Out of curiousity, does your "click here" blurb trip SpamAssassin's click-through tests? It's quite possible, you'll still get through, but raising the odds of your "I'll never see your mail unless you see this" message getting trashed by spam filters seems like a bad plan.
Out of curiousity, does your "click here" blurb trip SpamAssassin's click-through tests?
Indeed, the message ends up with a negative score in SpamAssassin, because it has proper "In-Reply-To" and "References" headers and such. And since it quotes the text of the original message, at the bottom, it gets through any Bayesian filters and such they have as well, unless their message was very spammy in the first place. (In which cast it's their own fault, not mine.)
"Well hello there Charlie Brown, you blockhead."
-- Lucy Van Pelt
Great News! (Score:5, Informative)
Re:Great News! (Score:5, Interesting)
Personally I do use SpamAssassin, but as an intermediate step.
First step: Check a whitelist of known senders. Deliver if the sender is on the list, AND the message originated from an IP subnet that I allow for them personally.
Second step: Scan with SpamAssassin. If the score is really high (above 20) throw it the hell out.
Third step: If the score is less than 20, and the person wasn't whitelisted, run the message through TMDA [tmda.net] and politely tell the sender I'm not sure who they are, and I get a lot of spam, and could you please click this link to prove that you're a real person.
I've been using this three-step system for eighteen months now, and out of over one million messages that have come into my mailbox (really), exactly FOUR spam messages have made it all the way through. Apparently the spammers decided to go ahead and click on the little link, or they used a real person's return address, and when that person got they autoreply, they were too stupid to understand what was going on.
Even better, I have not received ANY indiciation that I've lost any messages; at least, no one has ever mentioned anything about an email that I didn't get.
I've got five other people at my domain using the same system, although for not quite as long (one for fifteen months, three for about a year, and one for just a month now); they have all had similar success.
So based on those numbers I'd estimate a success rate of 99.9997% for eliminating spam (which is, admittedly, COMPLETELY INSANE), and a false-positive (or at least "lost message") rate of 0% so far (fingers crossed). A few people have had to confirm their messages, of course, but I've whitelisted them as that happens.
I actually wrote all the connecting code in PHP, believe it or not, with a MySQL database as a backend. It's invoked using
The whole system took about twelve hours of programming to set up, on one Saturday.
Now, for correspondence to companies (such as Microsoft, or Amazon.com), I use a different scheme (although it's handled by the same PHP code). I create up a unique email address for each of them, which ONLY allows mail to or from that domain (for example "rptamazon@mydomain.com" only allows messages from amazon.com). Those addresses are also easily cancellable, individually, if the company starts to annoy me with spam. Basically, each email address can be assigned its own unique whitelist, and can be cancelled individually at any time, through the little web interface.
I also have a number of email addresses for things such as customer support for our company (I write computer software). I'm using the same system for those, also, but instead of checking whitelists based on the sender, I've found a simple way to do it is to check for ANY of our product names anywhere in the message body or subject. If the message doesn't mention any of them, it sends a simple autoreply back similar to that in (3) above, but mentioning that the message didn't seem to be about any of our products, but if it was, please click here, blah blah. We don't have a high volume of support messages (about one or two a day; we're a small company) but in the last year only three or four people have had to click through like that, and, honestly, their support requests were so f*cked up anyways that I'd rather it just dropped them on the floor.
Then, as a very last step in all this, I also catch all email sent to invalid addresses in my various domains (which come to over 5,000 messages a day), and report those as spam to Vipul's Razor [sourceforge.net]. Which helps out the community, and me indirectly because my SpamAssassin installation also uses the Razor.
Re:Great News! (Score:2, Interesting)
Re:Great News! (Score:2)
random word HTML ("<frank><moon>") etc. Nigerian scam spam seems to get through waaayyy too frequently.
What I have found very useful is the DNSBL's that block known spamming IP's (spamhaus.org) and all email from dynamic addresses. This cuts 95% out before SA even sees it. With a whitelist system in front and SA
Re:Great News! (Score:3, Informative)
Currently I have amassed 3681 spams totalling 76 megs. I should probably empty that directory sometime
sa-learn makes a big difference though. Helps with the misspellings and random junk. Havn't seen a Nigerian scam come through eith
Re:Great News! (Score:2)
Re:Great News! (Score:4, Interesting)
What I don't understand is the base64 problem.. One of the first thing SA does is decode base64. Even "rawbody" rules get base64 decoding, so really base64 encoding shouldn't make a difference at all, as SA never examines the encoded text.
As for the intentional mis-spellings of V!agr0, check out antidrug.cf (use google) or wait for SA 3.0 which includes this set of rules as a part of the standard distribution.
Disclaimer: I am the author of antidrug, and thus do have a bias here.
Re:Great News! (Score:1)
Re:Great News! (Score:2)
I didn't see any references to ClamAV in here, but since its integration with SA is documented in other places, that can be an afterthought.
When you run your own mail server, it's easy to trump Google, Yahoo and MSN's recent multi-GB offerings. Wonder if they can top my 100GB mail account. Not that I've ever gotten more than 1GB mail worth reading in my almost 20 ye
Re:Great News! (Score:3, Interesting)
Hopefully I'll find some free time later this summer (two big big programming projects I'm working on now are ending next month) and I'll see if I can take a we
Re:Great News! (Score:2)
Yeah that is COMPLETELY INSANE
Re:Great News! (Score:1)
Yes, I'm pretty sure I do. Like I said, I've been using this email address for almost 15 years now, and have a pretty good idea of who I correspond with. Very rarely do I get messages on my primary address from completely random people who I've never met before. It's more for personal correspondence. The idea is that since I
Re:Great News! (Score:2)
I respond to those all the time. I politely send a "please don't auto-reply to forged spam" message. It's not my fault that your anti-spam solution is stupid enough to re-define an email reply to mean that you should accept forged mail.
Re:Great News! (Score:1)
I'm sure I'm not the only one who has just one word to say in response to that
Seriously, if you would "reply" to the confirmation autoreply, you'd just get another email back saying, again, I don't know who you are, and I get a lot of spam, so please click this l
Re:Great News! (Score:2)
Re:Great News! (Score:2)
Indeed, the message ends up with a negative score in SpamAssassin, because it has proper "In-Reply-To" and "References" headers and such. And since it quotes the text of the original message, at the bottom, it gets through any Bayesian filters and such they have as well, unless their message was very spammy in the first place. (In which cast it's their own fault, not mine.)