Ian Landsman is Starting From Scratch, June 1, 2006:

Form SPAM

If you're in the market for a powerful and user friendly Help Desk solution, please take a look at my company's flagship product HelpSpot.
It appears I'm not the only one coming under generic form spam over the past few days. I caught this post by Sam Ruby the other day. It appears to be a totally generic form submission bot, but it's pretty good. It's been hitting the HelpSpot forums daily and making lots of work for me :-(

My solution will be to add Bayesian filtering for the forums into HelpSpot. It's already used for email and should work pretty well for the forums, though I'll need to use a new table to store it to keep the word lists separate since it's a different type of spam than email and I've noticed a different set of words/links being used.

I'm curious why more weblogs don't use Bayesian filters. They work really well because at the end of the day the spammers must link to their bogus sites and the filter uses that to weed them out. It's pretty easy to code. The hardest part is figuring out the odd Lisp language Graham used to prototype it.
Created on 06.01.2006 12:06 pm · Comments (11)


Discussion

I use Drupal for all of my sites which has a Bayesian filter spam module. Once it's tuned, it's fantastic. Prior to it being tuned, I was getting upwards of 300 spam comments/minute on CodeSnipers.com but now we get about 300/day and only a few make it through each month.

Created by Keith Casey on 06.01.2006 1:06 pm

There are many available, but most are GPL so you can't use them in your commercial software.

I thought about it saber, but it's not really worth the time for me. Perhaps I could charge a hundred bucks or so? Not really worth it. And to Keith's point it's something where there's many easily used or copied alternatives.

Created by Ian on 06.01.2006 1:06 pm

Come on... there must be 20 different implementations of it to choose from.

If you're running a stock forum, cvs, blog, etc, there's probably one for you already. If you've written your own, then you probably have the skills to tweak one of the existing ones or write your own.

Created by Keith Casey on 06.01.2006 1:06 pm

Ian, ever considered selling a PHP filter component? I would be interested, assuming the license allowed me to include it in another commercial product.

Created by saberworks on 06.01.2006 1:06 pm

Yeah those little hacks work well. For this blog I have it set to close comments on posts over 21 days old. I pretty much never get spam since the spammers seem to like to hit the old posts.

Created by Ian on 06.01.2006 1:06 pm

I just implemented the simplest possible solution on my "email me" form: I added a radio button that says "Spam" or "Not Spam". A human being will click over to the not spam category. A bot that doesn't know I did that won't. It was a 5 minute hack. For public forums, such a thing may not be too onerous.

Of course, if the Bayesian thing "just works", then it's a win... but, you can cut off remarkable amounts of spam by adding a simple question.

Created by Kevin Dangoor on 06.01.2006 1:06 pm

I coded it up over a year ago so I don't remember the exact resources I used but I do remember that most of them were bad grin

In the end I did the implementation by following the essay's where Paul Graham lays out the idea. The two most relevant ones are:

Original:
http://www.paulgraham.com/spam.html

Improved:
http://www.paulgraham.com/better.html

Created by Ian on 06.01.2006 1:06 pm

Hey Ian,
If you have any good links about how to code this Bayesian filter thingy in PHP, do you mind sharing them? I've heard about this before but not really sure how it works. It might make up a nice feature in a product i'm currently procrastinating at :-D . (Note: If it takes you more than 1.5 minutes to find that link don't bother!)

Cheers,
Ali.

Created by Ali on 06.01.2006 1:06 pm

Those Drupal guys have a module for everything don't they!

Yeah there are implementations in all the languages now Philipp. I had to write my own because I needed to do a few HelpSpot specific parts. Also it isn't always clear what license applies to some of that code out there so better safe than sorry on that front.

Created by Ian on 06.01.2006 1:06 pm

"The hardest part is figuring out the odd Lisp language Graham used to prototype it."

I guess implementations in all kinds of languages abound on the web by now. I remember reading about it with a Perl implementation in Dr. Dobb's a year ago or two...

Created by Philipp Schumann on 06.01.2006 1:06 pm

I guess I overestimated the complexity of the task. I should read the pages you guys linked to, I guess.
-----

Created by saberworks on 06.01.2006 1:06 pm

 

Leave a Comment

Commenting is not available in this weblog entry.


> RSS 2.0
> Blog Archives (complete list)
> HelpSpot Mailing List

Copyright © by Ian Landsman

Design by Jakob Nielsen