Best language choice for a spam detection service


I have around 20 or so active blogs that get quite a bit of spam. As I hate CAPCHA the alternative is very smart spam filtering. I want to build a simple REST api like spam checking service which I would use in all my blogs. That way I can consolidate IP blocks and offload spam detection to 3rd party such as Akisment, Mollom, Defensio and sometime in the future write my own spam detection to really get my head into some very interesting spam detection algorithms.

My language of choice is PHP, I consider myself quite proficient and I can really dig in deep and come out with a solution. This project, I feel, can be used as a good exercise to learn another language. The big 2 that come to mind are Python and Ruby on Rails as everyone talks about them like its the next coming of our savior. Since this is mostly just an API and has no admin or public facing anything, seems like basic Python running a simple http server seems like the way to go. Am I missing anything? What would you, the great community, recommend? I would love to hear your language, book and best practices recommendations.

This has to scale and I want to write it with that in mind. Right now I'd probably be able to use 3rd party's free plans, but soon enough I'd have to expand the whole thing to actually think on its own. For now I think I'll just store everything in a MySQL database until I can do some real analysis on it. Thanks!

By : smazurov


I'd have to recommend Akismet for it's ease-of-use and high accuracy. With only a API key and an API call, you can determine if a given blob of text from a user is spammy. I've been using the Akismet plugin for WordPress, which uses the same API, and have had stellar results with it for the last year or so.

Zend Framework has a great Akismet PHP class you can use independent of the rest of the framework, which should make integration pretty straightforward. Documentation is quite thorough, as well.

Python has some advantages.

  1. There are several HTTP server frameworks in Python. Look at the WSGI reference implementation, and learn how to use the WSGI standard to handle web requests. It's very clean and extensible. It takes a little bit of study to see that WSGI is all about adding details to the request until you reach a stage in the processing where it's time to formulate a reply.

  2. MIME email parsing is pretty straightforward.

  3. After that, you'll be using site blacklisting and content filtering for your spam detection.

    • A site blacklist can be a big, fancy RDBMS. Or it can be simple pickled Python Set of domain names and IP addresses. I recommend a simple pickled set object that lives in memory. It's fast. You can have your RESTful service reload this set from a source file on receipt of some GET request that forces a refresh.

    • Text filtering is just hard. I'd start with SpamBayes.

By : S.Lott

I humbly recommend Lua, not only because it's a great, fast language, already integrated with web servers, but also because you can then exploit OSBF-Lua, an existing spam filter that has won spam-filtering competitions for several years in a row. Fidelis Assis and I have put in a lot of work trying to generalize the model beyond email, and we'd be delighted to work with you on integrating it with your app, which is what Lua was designed for.

As for scaling, in training mode we process hundreds of emails per second on a 2006 machine, so that should work out pretty well even for a busy web site.

We'd need to work with you on classifying stuff without mail headers, but I've been pushing in that direction already. For more info please write [email protected]u. (Yes, I want people to send me spam. It's for research!)

This video can help you solving your question :)
By: admin