Gaylord Bots Worth Blocking
or What User-agents to Use if You Don't Want to Waste Your Precious Bandwidth on Mikey's Crappy Site
Every so often I come across some rogue bot hammering away at my site and messing up my logs with no regard for bandwidth. Or human life.
There are a number of ways of blocking these bandwidth-eating freaks. It's pretty simple to add a couple of entries to your .htaccess files, like so:
RewriteCond %{HTTP_USER_AGENT} useragent_to_block*
# Only apply to URLs that aren't already under folder /baddies/ to prevent a redirect loop
RewriteCond %{REQUEST_URI} !^/baddies/
RewriteRule ^(.*)$ http://mikeybeck.com/baddies/goaway.txt
This is similar to using a blacklist to block nasty bots. Some people take it a step (read: giant leap) further and use a whitelist instead. This is a list of the only allowed user-agents, and access is forbidden to everyone else. While this is a very effective method of blocking spam-bots and the like, it is also likely to reduce the number of genuine visitors to your site.
Anyway, on to The List...
These are the user agents I block because they hammer my site, appear to be scraping content or harvesting email addresses, or 'cos I just don't like 'em.
adsl-dynamic-pool-xxx.hcm.fpt.vn
Purebot
Sitebot
Exabot
DigExt
SiteBot
Anything with 'Java' in its name. I have nothing against the language, but those Java bots are freakishly annoying.
And anyone without a user-agent. Or with a blank user-agent string. 'Cos that's just dumb.
