Tackling Splogs

I’ve been troubled by spam blogs (splogs) for a long time, and in recent months have been looking for methods to deal with them. If you have a blog, then no doubt you will have been bombarded with comment spam since about day 1. If you run a blog service (like BritBlog or Technoranki), then you will also come across spam blogs trying to abuse your service for their own evil gains.

Thankfully, spam comments are now quite easily kept under control if you use the fabulous Akismet. Over time this service has improved a great deal, and it is a rare day now when a spam comment gets past it.

While this is good, it’s still annoying that the spammers think it’s OK for them to bombard you with HTTP requests, sucking up your bandwidth and impacting on server resources. In some extremes this can even bring down a web server, becoming nothing less than a denial of service attack.

So that’s one gripe: While it is easier now than ever before to detect and remove comment spam, it’s not so easy to stop it ever reaching your web server and costing you money.

My real issue at the moment though is dealing with splogs that are listed in blog directories (such as BritBlog) and other blog services designed to promote their members. Some blogs are set up with the best of intentions, but it seems some evil types are scouring the Internet looking for dead blogs that they can now hijack.

This has happened on more than one occasion with BritBlog, and as I inspect some of the sites we have listed it is clear that the content on some has changed since they were first approved.

So this makes me wonder how I can keep these under check?

Possible Solutions

Two (simple) options came to mind:

  1. Use the Akismet database to tell me if a hostname or URI is that of a known splog. After contacting the people behind Akismet I got a message from Matt Mullenweg telling me that this isn’t possible. This is a real shame as they have potentially a great resource for identifying splogs, and I also have a lot of faith in applications and services rolled out by Automattic.
  2. Use SplogSpot. SplogSpot is basically a website that allows you to query a database of URLs identified by Pingoat as splogs. They provide an API which means you can integrate it with your own applications, and have a web form that allows other people to manually submit splogs. On the face of it this sounds great - I can simply chuck all the blogs in BritBlog at it one at a time, and see which ones come back as splogs.

While the SplogSpot approach works in theory, it doesn’t really work as well as it could. This is because the SplogSpot database contains spam post URLs and not spam host names.

This means if I check a URL like http://**saver.blogspot.com/2005_11_01_qtsaver_archive.html against SplogSpot may tell me that it is a splog. However if I check just the hostname (http://**saver.blogspot.com) it tells me that it’s OK (i.e. not a splog).

I can see why this may be the case: you can’t go ruling out all blogs at journals.aol.com just because journals.aol.com/nite**owler147/buddyshack-free-dating is a splog. However, I do see a bit of room for some additional logic here: If the blog is a blogspot one, then you can flag the host name as being a splog. Similarly, if the blog is at AOL then you could flag any URL beneath the users main page (journals.aol.com/nite**owler147 in the above example) as a splog too. There aren’t too many mass blog hosting services out there, so coming up with some rules to cover the main ones could give massive benefits.

I mean no disrespect to the people at SplogSpot — they have a great idea — but it’s a shame they’re not doing more with it. (And I know it’s not really fair of me to comment given the number of unfinished projects I have on the go at the moment…)

Proof of the Pudding

As all I’m really left with is SplogSpot, I’ve been running all the blogs on Technoranki through it. (Technoranki is made up almost entirely of blogs from BritBlog at the moment.) I have so far run about 1000 blogs through it, and it’s thrown up about 20 alleged splogs. I’ve not manually checked these so-called splogs, but the list currently includes the BritBlog Blog, Ducking for Apples, and MC Rebbe, so I don’t hold out much hope for the rest. Maybe it’s unfair to comment just yet, so I’ll come back to it in a short while.

** Several hours pass **

OK, back now and I’ve queried SplogSpot with 3931 blogs. (They now seem to be blocking my requests, so I guess I’ve annoyed them by using their API too much. Ho hum.) So what are the results? SplogSpot has identified 175 potential splogs out of the 3931 submitted. As I mentioned earlier, I spotted three straight away that are most definitely not splogs. Randomly sampling some of the blogs flagged up, I am yet to discover one that is actually a splog. Quite a disappointment!

I won’t give up here though, but my dinner is nearly ready!. I’ll slow down the SplogSpot poll interval and leave it running over night (if they let me back), and at some point manually review all the splogs flagged up.

To be continued…

Tags: , , ,

Sociable:These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • NewsVine
  • Reddit
  • YahooMyWeb

2 Responses to “Tackling Splogs

  • Brian Robinson
    December 24th, 2006 16:55
    1

    Pingoat is not a realiable source for determing what is a splog or not. Akismet works remarkably well for a free service.

  • Mark
    December 24th, 2006 18:11
    2

    I’m about to write up my findings, and they weren’t good! Looks like the pingoat/splogspot route is a waste of time!

    As far a Akismet is concerned, it’s great for identifying spam comments, but as there is no way to query just a list of know spam URLs it’s no good for me either :-(

Leave a Reply