Stupid Bots Worth Blocking

Posted on

or What User-agents to Use if You Don’t Want to Waste Your Precious Bandwidth on Mikey’s Crappy Site

Every so often I come across some rogue bot hammering away at my site and messing up my logs with no regard for my precious precious bandwidth.

There are a number of ways of blocking these bandwidth-eating freaks. It’s pretty simple to add a couple of entries to your .htaccess files, like so:

RewriteCond %{HTTP_USER_AGENT} useragent_to_block*
# Only apply to URLs that aren’t already under folder /baddies/ to prevent a redirect loop
RewriteCond %{REQUEST_URI} !^/baddies/
RewriteRule ^(.*)$
This is similar to using a blacklist to block nasty bots. Some people take it a step (read: giant leap) further and use a whitelist instead. This is a list of the only allowed user-agents, and access is forbidden to everyone else. While this is a very effective method of blocking spam-bots and the like, it is also likely to reduce the number of genuine visitors to your site.

Anyway, on to The List…

These are the user agents I block because they hammer my site, appear to be scraping content or harvesting email addresses, or ‘cos I just don’t like ‘em.

  • Purebot
  • Sitebot
  • Exabot
  • DigExt
  • SiteBot
  • Anything with ‘Java’ in its name. I have nothing against the language, but those Java bots are freakishly annoying.
  • And anyone without a user-agent. Or with a blank user-agent string. ‘Cos that’s just dumb.


Share this!



Leave a Reply