17 December 2021
A common nuisance for website owners are useless bots and crawlers. In particular so-called SEO bots can be a pain. They have a habit of eating up all your website’s resources by making an excessive number of hits, and you get nothing in return. Some of these bots look for a robots.txt file before they start hitting your website, but that is of little help if your website is attacked by a bot you didn’t know about.
You can quickly stop a bot in its tracks via your website’s .htaccess file. For instance, earlier today I found a bot called DataForSeoBot that was grinding a website to a halt. The bot used this user agent:
"Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)"
The following rule returns an error 403 (“forbidden”) if the user agent contains the (case-insensitive) string “dataforseobot”:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} "dataforseobot" [NC] RewriteRule "^.*$" - [F,L]
The rule is similar to the rules I used in the article about denying access to URLs. It again uses Apache’s mod_rewrite module. The main difference is that the rule matches a user agent (%{HTTP_USER_AGENT}
) rather than a URL (%{REQUEST_URI}
).
So, the rewrite condition checks if the user agent includes the string dataforseobot, and the NC
flag ignores the case. It is worth noting that the double quotes around the string are redundant in this example – you only need them if the string you want to match contains one or more spaces.
Next, the rewrite rule matches any string ("^.*$"
) and the F
flag returns an error 403. The L
flag tells Apache to not process any other rules in the .htaccess file.
You can use a simple regular expression to match multiple user agents. For instance, another naughty bot I encountered recently identified itself as “trendkite-akashic-crawler”. To match both the DataForceSeoBot and the Trendkit crawler you can use this rule:
RewriteEngine On RewriteCond %{HTTP_USER_AGENT} "(dataforseobot|trendkite-akashic)" [NC] RewriteRule "^.*$" - [F,L]
As said, the above rules return an error 403 if the user agent is matched. To check if your rules are working you can therefore look for the user agent in your website’s access log. Here, you can see the bot tried to access /foo.html and that Apache returned an error 403:
1.2.3.4 - - [13/Dec/2021:13:59:06 +0000] "GET /foo.html HTTP/1.1" 403 0 "-" "Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)"
If you have access to cURL then you also check your rules by spoofing the user agent:
$ curl -IL -A "Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot)" http://example.com/ HTTP/1.1 403 Forbidden ...
The curl command uses three options:
-I
returns just the server’s response headers, which includes the status code. It doesn’t download the web page.-L
makes cURL follow any redirects, such as a redirect from HTTP to HTTPS.-A
specifies the user agent that is sent to the server. This is what allows you to spoof the user agent.Ideally, you only need to tell what bots are and aren’t allowed to crawl your website via a robots.txt file. In practice, this approach doesn’t really work. There are too many bots, and new bots are let loose all the time. Keeping track of them quickly becomes a full time job.
There is the option to only allow specific bots. However, there are many bots that check if the bot is explicitly denied or allowed, and follow whatever the rule is for the Googlebot if the bot is not listed in the robots.txt file. This effectively gives them carte blanche, as very few websites deny the almighty Googlebot. Plus, there are also lots of bots that simply ignore robots.txt files.
In short, blocking naughty bots is a sensible approach. The bots will still try to crawl your website, but they are always denied access. Returning an error 403 uses hardly any resources, and the bots can therefore no longer cause your website to slow down.