When I’m tailing logfiles…

[to the tune of “When I’m Cleaning Windows“]

Well Happy New Year to both of my readers; I hope you’re having a thrilling year so far!

Over the Christmas break I’ve been working on a tool to help me flag dead or suspect blogs listed in BritBlog. It’s been quite interesting, and I’ve already removed quite a few none-blogs from the database.

Anyway, if you use UN*X then you’ve probably come across the ‘tail‘ command:

tail - output the last part of files
tail [OPTION]… [FILE]…
Print the last 10 lines of each FILE to standard output. With more
than one FILE, precede each with a header giving the file name. With
no FILE, or when FILE is -, read standard input.


It has a great option (-f) which allows you to follow a log file as it is being written to. In other words, you can have a log file scroll past your eyes while another process is writing to it (like Apache, for instance):

tail -f access.log

I don’t know what it is about log files, but they seem to have a mesmerising effect. I have been known to follow my apache log files for quite a long while. I won’t say how long, but it’s great to see traffic on your web site in real time.

So, in my attempt to purge BritBlog of all the spam blogs, dead blogs and other problem blogs that we currently list, I’ve written a little tool that visits the sites listed in the directory and performs various tests against it. As this runs, it writes out some details to a log file (as well as storing test results in a database). So as you can probably guess: I’ve found myself staring at this for hours over Christmas (it beats Christmas TV anyway…).

While doing this you see all sorts of weird web addresses scroll past you. For some unlucky bloggers/spammers they are so odd that I stop and investigate them. So far I’ve removed about 25 like this, and have another 40 or so to investigate.

The tool has also identified about 400 blogs that have technical problems one way or another (missing files, server errors, DNS errors etc.). I’ll be emailing all these people in due course to tell them of their pending removal from the directory, so that will be a step in the right direction!

There are also a few blogs hosted on *.blog.co.uk (a German-owned blogging platform), and they’re blocking our robot with a robots.txt rule. Not nice. We’re going to either have to ignore their rules or remove these blogs too… Bit of a dilemma!

The harder task is that of detecting spam blogs (or splogs). As I mentioned last month, Splogspot has totally failed me flagging up loads of real blogs as splogs, and totally missing all my known splogs! I think the only real solution here is to get members flagging them up when the see them, which will require a bit more work.

Once this is all done though, I’ll be able to start on technoranki! Should really have put something like this in place a long time ago…

Anyway, it’s home time. I’ve not recovered from the Christmas break yet — really need sleep!

