May 2006 Archive

« April 2006 | Main | June 2006 »

May 26, 2006

Blocking SiteMorse & other unwelcome robots

@ 11:03 AM

Like just about every other site owner on the planet, you probably crave for more traffic to visit your corner of the web. But not every visitor to your website should be welcomed with open arms. A good deal of the hits listed in your logs are likely to come from the many programs - commonly known as robots, bots, crawlers and spiders - that automatically trawl the web for a variety of purposes, including:

Most of these robots are harmless and positively benificial - there are few reasons for trying to stop the Googlebot having access to your site. But not every robot visits your site with your best interests at heart.

Here are 4 techniques for blocking unwelcome robots from accessing your site. Before using any of these you will need to know either the ip address the bot is originating from, or the user-agent string the bot uses.

1. robots.txt

The robots exclusion standard or robots.txt protocol is the most straightforward way of excluding robots from your site. It's platform independent, flexible and easy to setup, but it does require the co-operation of the robot in question.

To implement robots.txt create a text file called robots.txt in the root directory of your site. Here's a simple example that allows Googlebot full access to your site, but blocks all other bots:

User-agent: Googlebot
Disallow:

User-agent: *
Disallow: /

robots.txt is flexible in that it allows you to block access to particular areas of your site, while leaving others open.

For full details of the robots exclusion standard, including a list of known bots and their purposes, visit the excellent Web Robots Pages (external link).

Most legitimate bots will honour robots.txt, but there are some that don't, including SiteMorse. So, for the purposes of illustration I'll use SiteMorse to demonstrate alternative techniques which can be used for bots that don't offer site owners the courtesy of observing the robots exclusion standard.

2. Blocking ip addresses using <Limit> (Apache only)

If you know the ip address from which the robot is accessing your site, and your site runs on the Apache web server, the <Limit> directive provides a convenient and effective method of blocking access. There are uncertainties involved in using this method - ip addresses can change, so you need to check regularly that the bot you're blocking is still using the same address.

At the time of writing SiteMorse's spider operates from the ip address 212.100.249.69. The easiest way to implement <Limit> is via a .htaccess file. To block access to the SiteMorse bot create a text file called .htaccess in the root directory of your site, or if the file already exists edit it, and include this content in the file:

<Limit GET POST>
order deny,allow
deny from 212.100.249.69
</Limit>

Alternatively if you have access to the Apache httpd.conf file the directive can be included in that file, in any context - i.e. for the whole server, for a virtual host or for a single driectory.

For full details of the <Limit> directive see the official documentation for your version of Apache - version 1.3 (external link) / version 2.0 (external link).

3. Blocking user agents using SetEnvIfNoCase and <Limit> (Apache only)

A more reliable but also more complex method than using ip addresses is to use the user-agent string to restrict access. The user-agent string for the SiteMorse bot will always contain the characters 'b2w'.

This time we use the Apache mod_setenvif module directive SetEnvIfNoCase. This allows us to set an environment variable based on a regular expression. Note that we're using the case insensitive SetEnvIfNoCase instead of SetEnvIf. Here's how it looks for our SiteMorse block:

SetEnvIfNoCase User-Agent "^b2w" bad_bot
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

The first line tests the user-agent string to see if it starts with 'b2w', and if it does it defines the environment variable bad_bot. We then use the <Limit> directive as before to deny access to our site if the bad_bot environment variable is defined.

As before this can be used in any context. However, note that to use the SetEnvIfNoCase directive in a .htaccess file requires FileInfo to be allowed to be overridden.

Again please check the official documentation for your version of Apache for full details: version 1.3 (external link) / version 2.0 (external link).

4. Blocking ip addresses or user agents using PHP

The final technique I'm going to cover involves blocking access from within your web content. While I wouldn't recommend this approach for blocking a bot from your entire site (the above techniques are far more efficient) it can be useful when you want to block a bot from very specific content, or serve alternative content based on the user agent or ip address of the bot. For example I used this in the past to serve SiteMorse a different version of the A to Z pages on ClacksWeb, to prevent it from spidering the page for more than one letter.

Here's a quick example of blocking access based on ip address or user-agent:

<?
if (($_SERVER["REMOTE_ADDR"] == "212.100.249.69") || ($_SERVER["HTTP_USER_AGENT"] == "b2w")) {
    header("HTTP/1.0 403 Forbidden");
} else {
    // serve content normally
}
?>

This code is most effective if you use a single controller script, but can be used in individual scripts as required. Note though that the use of the header function means that it must be the first output from the script.


Other platforms

May 19, 2006

My top 3 public sector sites

@ 9:37 PM

William Heath from the Ideal Government Project (external link) turned the tables on me a bit after my rant about the new DTI website, and asked:

What are your three top-rated public-service web sites? And is it just accessibility you focus on, or content and effectiveness?

To answer the second question first, I don't just focus on accessibility, but I do believe that it a high degree of accessibility is a fundamental characteristic of any quality website. It's a good general indicator - to be truly accessible a site must have a number of other inherent qualities, including valid, semantic mark-up, good information architecture and a usable interface.

I also like to see apparently minor features like clean, technology-neutral URLs, good 404 pages, robust error recovery, user-friendly functions like RSS feeds and mailing lists, and effective search. It's this attention to detail that separates a great site from an average site for me.

My three top-rated public service sites? It's a great question, and the answer is naturally very subjective. All of these sites have issues (as does every site I've ever worked on I hasten to add), but I'll go for:

  1. Royal Borough of Kensington and Chelsea (external link)
  2. London Borough of Lambeth (external link)
  3. Lincolnshire County Council (external link)

Try as I might I couldn't come up with an exemplar from central government. I did consider the National Crime Squad (external link) since it's been standards-based for a long time now, but it's a defunct body and the site won't be there much longer.

So what are your top three public sector sites? What gems have I missed? They needn't be UK or english-language sites - it would be really interesting to hear of some top notch sites from elsewhere.

May 17, 2006

DTI achieves new low

@ 10:56 PM

Layout tables galore on the DTI websiteUsually it's accompanied by a feeling of disappointment, resignation and perhaps mild surprise. This week though I'm truly shocked by the mind-numbing, soul-crushing, bile-inducing awfulness of a new UK central government website. I've checked the date on this news release (external link) at least half a dozen times in the hope that it says May 2000 and not May 2006, or will reveal itself to be a sick joke. But no luck, it's a fact, the DTI's newly revamped website (external link) is about as shit as it's possible for a large, corporate website to be.

To make matters worse it's clear that they either don't know how shit it is, or don't care. Take their accessibility page (external link) for example, which boldly claims AA-level standard (sic), and provides a mine of useful information such as how to change the size and colour of text in Netscape. The entire site (thousands of pages at a guess) appears to be devoid of a single heading. It uses a javascript pop-up to provide a printable version of pages.

This time though I'm not just going to whinge about it here, I've been galvanised into action. I'm determined to do some digging to find out just what process was followed to produce this monstrosity, how much it cost and why the eGovernment Unit (external link), whose mission according to the PM is ensuring that IT supports the business transformation of Government itself so that we can provide better, more efficient, public services, are failing so miserably in their responsibility to promote best practice across government.

May 15, 2006

OPSI daisy

@ 1:51 PM

Apparently our favourite automated accessibility testing company, SiteMorse, has been working with the Office of Public Sector Information (external link)(OPSI) to make the OPSI website accessible. The impressive press release at e-consultancy (external link) tells us that the site is "accessible to all", that OPSI are aiming for AAA compliance, and that SiteMorse is part of the ideal solution for them to achieve it.

John Sheridan, who heads up the OPSI, goes so far as to say:

Automated testing was the obvious answer as it can check thousands of pages and site journey permutations in minutes, saving time and resources compared to manual testing. Of course there is still a need for manual testing for areas that cannot be checked automatically, e.g. images matching alternative text tags.

After reading the press release you'd be forgiven for thinking that the OPSI site must be a paragon of accessibility, representing the very best current practice and thinking around web accessibility. Sadly you'd be wrong. Sure it's better than many central government websites, but as I've documented here in the past that's not a very difficult thing to achieve.

I only spent 10 minutes on the OPSI site, but here are a list of the serious accessibility problems I identified in that short time (and which were obviously not picked up by SiteMorse):

These are all basic errors which any developer with an understanding of accessibility and web standards issues would have avoided during the design and build phases of site development.

The intention here isn't to pillory the OPSI. They've got a vast range of information across thousands of pages which they are trying to make as accessible as possible. The problem is that, like many local authorities (external link), they appear to have been seduced into thinking that the way to achieve accessibility is to run automated tests, then pick up the pieces. This approach is fundamentally flawed. Fixing the things found by automated software does not make an inaccessible site accessible.

Accessibility must be built in from the start, and that obviously requires an understanding of what makes an accessible site. The answer is to invest in your own knowledge of accessibility (buy some books, visit some forums, subscribe to some mailing lists) and to apply that knowledge and understanding to the design and build of your website. Then use the W3C validator (external link), and a free tool like TAW3 (external link) which are extremely helpful for finding typos rather than fundamental grammatical errors), and finally get some users to test it. Just don't believe the SiteMorse hype.

Thanks to Isolani (external link) for the OPSI/SiteMorse link.

May 4, 2006

Website Accessibility 2006

@ 5:44 PM

From the blurb (external link):

Website Accessibility 2006 is a major UK conference on public sector website accessibility, giving the latest guidance and best-practice including the new PAS 78 guidelines.

Apart from the fact that I'm speaking at the event, I'm excited on three counts:

  1. It's being held at the wonderful Scotsman Hotel (external link) in Edinburgh, without doubt the best hotel I've stayed in in the UK. We were there two years ago for my birthday (external link), and again last year before Christmas. It'll be strange being there without my wife, but I'll survive.
  2. It's a damned good cast (external link), and I'm delighted to have been invited to share the stage with them all.
  3. It's a major UK public sector event, and it's being held in Scotland. This happens all too infrequently IMHO, and given the venue there's the real possibility that this event will set a trend.

The only downside is that it's the day before I travel down to London for @media - I could really have done with the chance to observe the top quality speakers there first to get a few tips.