May 2006 Archive
« April 2006 | Main | June 2006 »
May 26, 2006
Blocking SiteMorse & other unwelcome robots
Like just about every other site owner on the planet, you probably crave for more traffic to visit your corner of the web. But not every visitor to your website should be welcomed with open arms. A good deal of the hits listed in your logs are likely to come from the many programs - commonly known as robots, bots, crawlers and spiders - that automatically trawl the web for a variety of purposes, including:
- Indexing your site (e.g. Googlebot, Inktomi Slurp)
- Gathering statistics (e.g. WebWatch)
- Site maintenance and validation (e.g. LinkWalker, W3CLinkChecker)
Most of these robots are harmless and positively benificial - there are few reasons for trying to stop the Googlebot having access to your site. But not every robot visits your site with your best interests at heart.
Here are 4 techniques for blocking unwelcome robots from accessing your site. Before using any of these you will need to know either the ip address the bot is originating from, or the user-agent string the bot uses.
1. robots.txt
The robots exclusion standard or robots.txt protocol is the most straightforward way of excluding robots from your site. It's platform independent, flexible and easy to setup, but it does require the co-operation of the robot in question.
To implement robots.txt create a text file called robots.txt in the root directory of your site. Here's a simple example that allows Googlebot full access to your site, but blocks all other bots:
User-agent: Googlebot
Disallow:
User-agent: *
Disallow: /
robots.txt is flexible in that it allows you to block access to particular areas of your site, while leaving others open.
For full details of the robots exclusion standard, including a list of known bots and their purposes, visit the excellent Web Robots Pages
.
Most legitimate bots will honour robots.txt, but there are some that don't, including SiteMorse. So, for the purposes of illustration I'll use SiteMorse to demonstrate alternative techniques which can be used for bots that don't offer site owners the courtesy of observing the robots exclusion standard.
2. Blocking ip addresses using <Limit> (Apache only)
If you know the ip address from which the robot is accessing your site, and your site runs on the Apache web server, the <Limit> directive provides a convenient and effective method of blocking access. There are uncertainties involved in using this method - ip addresses can change, so you need to check regularly that the bot you're blocking is still using the same address.
At the time of writing SiteMorse's spider operates from the ip address 212.100.249.69. The easiest way to implement <Limit> is via a .htaccess file. To block access to the SiteMorse bot create a text file called .htaccess in the root directory of your site, or if the file already exists edit it, and include this content in the file:
<Limit GET POST>
order deny,allow
deny from 212.100.249.69
</Limit>
Alternatively if you have access to the Apache httpd.conf file the directive can be included in that file, in any context - i.e. for the whole server, for a virtual host or for a single driectory.
For full details of the <Limit> directive see the official documentation for your version of Apache - version 1.3
/ version 2.0
.
3. Blocking user agents using SetEnvIfNoCase and <Limit> (Apache only)
A more reliable but also more complex method than using ip addresses is to use the user-agent string to restrict access. The user-agent string for the SiteMorse bot will always contain the characters 'b2w'.
This time we use the Apache mod_setenvif module directive SetEnvIfNoCase. This allows us to set an environment variable based on a regular expression. Note that we're using the case insensitive SetEnvIfNoCase instead of SetEnvIf. Here's how it looks for our SiteMorse block:
SetEnvIfNoCase User-Agent "^b2w" bad_bot
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>
The first line tests the user-agent string to see if it starts with 'b2w', and if it does it defines the environment variable bad_bot. We then use the <Limit> directive as before to deny access to our site if the bad_bot environment variable is defined.
As before this can be used in any context. However, note that to use the SetEnvIfNoCase directive in a .htaccess file requires FileInfo to be allowed to be overridden.
Again please check the official documentation for your version of Apache for full details: version 1.3
/ version 2.0
.
4. Blocking ip addresses or user agents using PHP
The final technique I'm going to cover involves blocking access from within your web content. While I wouldn't recommend this approach for blocking a bot from your entire site (the above techniques are far more efficient) it can be useful when you want to block a bot from very specific content, or serve alternative content based on the user agent or ip address of the bot. For example I used this in the past to serve SiteMorse a different version of the A to Z pages on ClacksWeb, to prevent it from spidering the page for more than one letter.
Here's a quick example of blocking access based on ip address or user-agent:
<?
if (($_SERVER["REMOTE_ADDR"] == "212.100.249.69") || ($_SERVER["HTTP_USER_AGENT"] == "b2w")) {
header("HTTP/1.0 403 Forbidden");
} else {
// serve content normally
}
?>
This code is most effective if you use a single controller script, but can be used in individual scripts as required. Note though that the use of the header function means that it must be the first output from the script.
Other platforms
- Jack Pickard covers blocking by IP address and user-agent string using .NET

- Noggle provides a couple of links to blocking resources for IIS

May 19, 2006
My top 3 public sector sites
William Heath from the Ideal Government Project
turned the tables on me a bit after my rant about the new DTI website, and asked:
What are your three top-rated public-service web sites? And is it just accessibility you focus on, or content and effectiveness?
To answer the second question first, I don't just focus on accessibility, but I do believe that it a high degree of accessibility is a fundamental characteristic of any quality website. It's a good general indicator - to be truly accessible a site must have a number of other inherent qualities, including valid, semantic mark-up, good information architecture and a usable interface.
I also like to see apparently minor features like clean, technology-neutral URLs, good 404 pages, robust error recovery, user-friendly functions like RSS feeds and mailing lists, and effective search. It's this attention to detail that separates a great site from an average site for me.
My three top-rated public service sites? It's a great question, and the answer is naturally very subjective. All of these sites have issues (as does every site I've ever worked on I hasten to add), but I'll go for:
Try as I might I couldn't come up with an exemplar from central government. I did consider the National Crime Squad
since it's been standards-based for a long time now, but it's a defunct body and the site won't be there much longer.
So what are your top three public sector sites? What gems have I missed? They needn't be UK or english-language sites - it would be really interesting to hear of some top notch sites from elsewhere.
May 17, 2006
DTI achieves new low
Usually it's accompanied by a feeling of disappointment, resignation and perhaps mild surprise. This week though I'm truly shocked by the mind-numbing, soul-crushing, bile-inducing awfulness of a new UK central government website. I've checked the date on this news release
at least half a dozen times in the hope that it says May 2000 and not May 2006, or will reveal itself to be a sick joke. But no luck, it's a fact, the DTI's newly revamped website
is about as shit as it's possible for a large, corporate website to be.
To make matters worse it's clear that they either don't know how shit it is, or don't care. Take their accessibility page
for example, which boldly claims AA-level standard
(sic), and provides a mine of useful information such as how to change the size and colour of text in Netscape. The entire site (thousands of pages at a guess) appears to be devoid of a single heading. It uses a javascript pop-up to provide a printable version of pages.
This time though I'm not just going to whinge about it here, I've been galvanised into action. I'm determined to do some digging to find out just what process was followed to produce this monstrosity, how much it cost and why the eGovernment Unit
, whose mission according to the PM is ensuring that IT supports the business transformation of Government itself so that we can provide better, more efficient, public services
, are failing so miserably in their responsibility to promote best practice across government
.
May 15, 2006
OPSI daisy
Apparently our favourite automated accessibility testing company, SiteMorse, has been working with the Office of Public Sector Information
(OPSI) to make the OPSI website accessible. The impressive press release at e-consultancy
tells us that the site is "accessible to all", that OPSI are aiming for AAA compliance, and that SiteMorse is part of the ideal solution for them to achieve it.
John Sheridan, who heads up the OPSI, goes so far as to say:
Automated testing was the obvious answer as it can check thousands of pages and site journey permutations in minutes, saving time and resources compared to manual testing. Of course there is still a need for manual testing for areas that cannot be checked automatically, e.g. images matching alternative text tags.
After reading the press release you'd be forgiven for thinking that the OPSI site must be a paragon of accessibility, representing the very best current practice and thinking around web accessibility. Sadly you'd be wrong. Sure it's better than many central government websites, but as I've documented here in the past that's not a very difficult thing to achieve.
I only spent 10 minutes on the OPSI site, but here are a list of the serious accessibility problems I identified in that short time (and which were obviously not picked up by SiteMorse):
- Links are distinguished by colour alone
- Text contrast is very low in some areas
- Placeholder text is used in form fields
- Forms accept submission of placeholder values
- Form controls have no related labels
- Search results are horribly unstructured
- Skip to content link is hidden using display:none
- The target for the skip to content link doesn't have IE "layout"
- There's no focus or active styling on links
- The discussion forums are riddled with HTML errors
- Non-semantic markup is frequently used, for example the list of years for UK statutory instruments is presented as a series of paragraphs
- The site breaks when the text size is increased 2 steps at 800x600 and 1024x768
These are all basic errors which any developer with an understanding of accessibility and web standards issues would have avoided during the design and build phases of site development.
The intention here isn't to pillory the OPSI. They've got a vast range of information across thousands of pages which they are trying to make as accessible as possible. The problem is that, like many local authorities
, they appear to have been seduced into thinking that the way to achieve accessibility is to run automated tests, then pick up the pieces. This approach is fundamentally flawed. Fixing the things found by automated software does not make an inaccessible site accessible.
Accessibility must be built in from the start, and that obviously requires an understanding of what makes an accessible site. The answer is to invest in your own knowledge of accessibility (buy some books, visit some forums, subscribe to some mailing lists) and to apply that knowledge and understanding to the design and build of your website. Then use the W3C validator
, and a free tool like TAW3
which are extremely helpful for finding typos rather than fundamental grammatical errors), and finally get some users to test it. Just don't believe the SiteMorse hype.
Thanks to Isolani
for the OPSI/SiteMorse link.
May 4, 2006
Website Accessibility 2006
From the blurb
:
Website Accessibility 2006 is a major UK conference on public sector website accessibility, giving the latest guidance and best-practice including the new PAS 78 guidelines.
Apart from the fact that I'm speaking at the event, I'm excited on three counts:
- It's being held at the wonderful Scotsman Hotel
in Edinburgh, without doubt the best hotel I've stayed in in the UK. We were there two years ago for my birthday
, and again last year before Christmas. It'll be strange being there without my wife, but I'll survive. - It's a damned good cast
, and I'm delighted to have been invited to share the stage with them all. - It's a major UK public sector event, and it's being held in Scotland. This happens all too infrequently IMHO, and given the venue there's the real possibility that this event will set a trend.
The only downside is that it's the day before I travel down to London for @media - I could really have done with the chance to observe the top quality speakers there first to get a few tips.