Part 2: The Crawlers, Spiders, and Bots of Web Analytics 2.0
In Part 1, I blogged about a huge challenge in web analytics: bot, spider, crawler detection, analysis, and filteration. According to robotstxt.org ”robots” are:
“…programs that automatically traverse the web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced… the specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time… it is still a robot.”
In some regards it is “taboo” to speak about evil bots, but I think it’s a topic that needs to be brought out of the dark. If we have to pay for page views, I require and expect accurate page view metrics.
By understanding the characteristics of bots to help detection and purification of your web metrics, you will deliver more accurate analysis. Let’s take a deeper look at other bot behavior of Web Analytics 2.0:
- Javascript executing bots. Yes, they exist. Big time. They will invoke your beacon code, and generate large numbers of false page views, which your vendor should be filtering out. Maybe your vendor filters traffic where large numbers of page views are generated in a single visit, or maybe you are paying for them. Check your pathing reports or your ratios, like view:visit. Do see visits that have hundreds or thousands of page views? Chances are those are bots (or your configuration may be technically challenged).
A large search engine company’s client service folks even hinted at javascript executing bots being programmed to execute queries on internal search engines, thus invoking an embedded advertising module, inflating the number of requests for the ad module. While that doesn’t affect clickthrough or revenue and has nothing at all to do with the ad technology or the engine, it affects metrics. It confirms what I see.
Another favorite javascript executing bot I know likes to exploit HTML commenting left on in blogging architectures. The bot spams comments with nefarious link farms or other unsavory content. It executes javascript to do it.
Take for example, the ”rofovnifo” bot. It fills out blog comments with the following content, leaving links to the nefarious goldwarez.org (don’t go there!) coming from Comcast 67.160.110.199:
Hi. Looks good! Very useful, good stuff. Good resources here. Thanks much! G’night
A quick Google search reveals “rofovnifo“ really likes Cabela’s, sailing, Friendster, XBOX 360, car audio, card sharing, and world affairs–among many other things. If only “rofovnifo” were a legitimate human, then I’d believe the traffic he generated and continues to generate across the 94,500 links he’s visited across the world over the last several days! (UPDATE: 127,400 links… wait 128,000)
- Enter your site from various referrers using various methods. One bot that I find fascinating executes a query on a particular country’s Google site (let’s day Google.au), then the bot crawls site pages returned in the SERP (search results page). Next it indexes the content on the returned page. In the case I observed, this bot crawls a directory of subject matter related to a certain discipline. The directory has thousands of pages. So what I see is the bot coming from Google.au from a certain keyword, entering the site, and crawling 4,000+ pages during one visit that last a few hours. Not human.
- Come from different IP addresses or subnets. I’ll find bots that jump around IP addresses, but always hit the same page or pages for the same amount of time. Or the bots will come from different subnets and hit the same page at a given periodicity (like every 15 minutes).
- Repeatedly hit one page, repeatedly! In a former job, I noticed that one educational institution referred a bot to the site on only the first of the month. The bot would crawl 77,000 pages then stop. Are these clickfraud bots? What are they? Were they testing a prototype?
- Spoof their user agent. Nasty bots don’t want to say “hey, I am nasty bot” so they try to hide themselves by looking like normal user agents.
Mozilla compatible user agents look something like this:
xx.xxx.xx.xxx:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.307; .NET CLR 1.1.4321; Media Center PC 4.0; .NET)
Bots will spoof user agents to get by your filters. Sometime they are sloppy in doing so, which makes them instantly detectable. Notice the upper-case WINDOWS and the shortened NET reference:
xx.xxx.xx.xxx:Mozilla/4.0 (compatible; MSIE; WINDOWS 5.1; SV1; NET 1.0; Media Center PC 4.0; .NET)
- Use cookies. Wierd user agents that use cookies set off a huge red flag. You need a web analytics tool that allows you to cross the cookie dimension with user agents.
- Pass cookies. Even worse than bots that simply use cookies is when you see one cookie passed across multiple IP addresses visiting your site consecutively during a short-time period, then disappearing after crawling huge numbers of pages in that short duration… very odd indeed.
What do you see on your sites? How are you dealing with spiders and bots?
Check out Part 3 where I will provide recommendations of keeping your analytics up-to-date with Web 2.0 bots and spiders. Thanks for visiting!
Wendi Malley added the following ...
Great post Judah! I am learning a lot from this. Look forward to part III.
Wendi
Ken Pendergast added the following ...
I was just looking at a bot and spider building book the other day. Its a shame people only talk about them in a negative sense.
If you have every used a link checker on your site or a Google sitemap generator. Those tools scan every resource on your site in a matter of minutes. So you really have to think who else is doing this. Or using similar technology.
When checking the the reports to see 1 visit from a small country in South America, you can really wonder was this a real visitor or a crawler.
Judah added the following ...
Steve: That’s a remarkable observation. Thanks for sharing! The Russian bots I’ve seen tend to be very sophisticated indeed.
Eric: Thank you!
Wendi: Good to hear and thanks for mentioning it.
Ken: Yes, exactly… What are those bots doing? I agree that spiders and bots aren’t all bad, which is why can simply filter them from your reporting. You don’t have to block them at the firewall or not/collect or delete the data - unless you want to. Imagine the implications of blocking the Googlebot…
Judah Phillips at Web Analytics Demystified » Blog Archive » Part 1: Spiders, Bots, Page Views, and Web Analytics, Oh My! added the following ...
[…] Let’s take a closer look at the nuances of new-wave bots in Part 2. […]


Steve added the following ...
Another gem Judah!
It almost pains me to have to point out a way below trivial correction:
google.com.au. not google.au.
But that alone is a nice trick that some folk use to hide themselves. Those who have the ability to set their own Reverse DNS sometimes will fake coming from well known sites.
eg. An IP Address in Russia resolves to “googlebot.com.au” type of thing.
Which is another clue of an “Evil Bot”.
Cheers!