Blocking SiteMorse & other unwelcome robots

Like just about every other site owner on the planet, you probably crave for more traffic to visit your corner of the web. But not every visitor to your website should be welcomed with open arms. A good deal of the hits listed in your logs are likely to come from the many programs - commonly known as robots, bots, crawlers and spiders - that automatically trawl the web for a variety of purposes, including:

Most of these robots are harmless and positively benificial - there are few reasons for trying to stop the Googlebot having access to your site. But not every robot visits your site with your best interests at heart.

Here are 4 techniques for blocking unwelcome robots from accessing your site. Before using any of these you will need to know either the ip address the bot is originating from, or the user-agent string the bot uses.

1. robots.txt

The robots exclusion standard or robots.txt protocol is the most straightforward way of excluding robots from your site. It's platform independent, flexible and easy to setup, but it does require the co-operation of the robot in question.

To implement robots.txt create a text file called robots.txt in the root directory of your site. Here's a simple example that allows Googlebot full access to your site, but blocks all other bots:

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

robots.txt is flexible in that it allows you to block access to particular areas of your site, while leaving others open.

For full details of the robots exclusion standard, including a list of known bots and their purposes, visit the excellent Web Robots Pages (external link).

Most legitimate bots will honour robots.txt, but there are some that don't, including SiteMorse. So, for the purposes of illustration I'll use SiteMorse to demonstrate alternative techniques which can be used for bots that don't offer site owners the courtesy of observing the robots exclusion standard.

2. Blocking ip addresses using <Limit> (Apache only)

If you know the ip address from which the robot is accessing your site, and your site runs on the Apache web server, the <Limit> directive provides a convenient and effective method of blocking access. There are uncertainties involved in using this method - ip addresses can change, so you need to check regularly that the bot you're blocking is still using the same address.

At the time of writing SiteMorse's spider operates from the ip address 212.100.249.69. The easiest way to implement <Limit> is via a .htaccess file. To block access to the SiteMorse bot create a text file called .htaccess in the root directory of your site, or if the file already exists edit it, and include this content in the file:

<Limit GET POST>
order deny,allow
deny from 212.100.249.69
</Limit>

Alternatively if you have access to the Apache httpd.conf file the directive can be included in that file, in any context - i.e. for the whole server, for a virtual host or for a single driectory.

For full details of the <Limit> directive see the official documentation for your version of Apache - version 1.3 (external link) / version 2.0 (external link).

3. Blocking user agents using SetEnvIfNoCase and <Limit> (Apache only)

A more reliable but also more complex method than using ip addresses is to use the user-agent string to restrict access. The user-agent string for the SiteMorse bot will always contain the characters 'b2w'.

This time we use the Apache mod_setenvif module directive SetEnvIfNoCase. This allows us to set an environment variable based on a regular expression. Note that we're using the case insensitive SetEnvIfNoCase instead of SetEnvIf. Here's how it looks for our SiteMorse block:

SetEnvIfNoCase User-Agent "^b2w" bad_bot
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

The first line tests the user-agent string to see if it starts with 'b2w', and if it does it defines the environment variable bad_bot. We then use the <Limit> directive as before to deny access to our site if the bad_bot environment variable is defined.

As before this can be used in any context. However, note that to use the SetEnvIfNoCase directive in a .htaccess file requires FileInfo to be allowed to be overridden.

Again please check the official documentation for your version of Apache for full details: version 1.3 (external link) / version 2.0 (external link).

4. Blocking ip addresses or user agents using PHP

The final technique I'm going to cover involves blocking access from within your web content. While I wouldn't recommend this approach for blocking a bot from your entire site (the above techniques are far more efficient) it can be useful when you want to block a bot from very specific content, or serve alternative content based on the user agent or ip address of the bot. For example I used this in the past to serve SiteMorse a different version of the A to Z pages on ClacksWeb, to prevent it from spidering the page for more than one letter.

Here's a quick example of blocking access based on ip address or user-agent:

<?
if (($_SERVER["REMOTE_ADDR"] == "212.100.249.69") || ($_SERVER["HTTP_USER_AGENT"] == "b2w")) {
    header("HTTP/1.0 403 Forbidden");
} else {
    // serve content normally
}
?>

This code is most effective if you use a single controller script, but can be used in individual scripts as required. Note though that the use of the header function means that it must be the first output from the script.


Other platforms

Comments

I always wondered privately about how ethical it would be to block Sitemorse from parsing our site, given the nonsense that is placed on their publicised reports. Can you say whether you would/do do this?
I've sometimes noticed DOS-like crawling from indexing providers for various Government projects (why can't there be just one?) and at times I've blocked them by IP range until the situation is resolved. I can certainly justify such action when, in one instance, 4GB of data was transferred in one calendar month in direct contravention of a specific Robots.txt entry, but how would it be perceived in the case of Sitemorse?

Posted by: Doug at May 31, 2006 9:51 AM

SiteMorse is blocked from accessing ClacksWeb. I'm not going to go through the reasons here though - there's been enough discussion at Accessify Forum in particular to allow people to draw their own conclusions.

The one thing I will say is that in my experience it was a badly behaved bot, making many requests per second at times and having a perceivable effect on performance on our old server. As with your 4Gb/month bot, I can think of no good reason why we should put up with that when there's no benefit to us or our users.

Posted by: Dan at May 31, 2006 3:31 PM

Yes some good blocking methods there. I don't think SiteMorse are winning friends.

Posted by: Robert Wellock at June 1, 2006 6:16 PM

Excellent article Dan!

This sort of thing is big on my priority list lately, not so much for 'bot exclusion, but rather to prevent spammers from posting bogus emails via a contact form. My in-use forms now have sanitization employed so they don't pose a risk to me or others by being used for BCC sending, but I still have to get the crap email so I'm trying to make a couple of scripts to handle this.

Many 'bots it seems fake the HTTP_USER_AGENT string, use random IPs, etc., so techniques like this are only quasi effective (for what I'm trying to do). But for a know, static 'bot... good stuff indeed.

Posted by: Mike Cherim at August 1, 2006 7:44 PM

Post a comment

Personal information