Part 3: Web Analytics Quality through Spider and Bot Recognition and Filtering
Welcome to Part 3: Web Analytics Quality through Spider and Bot Reconciliation. My goal with this series of posts was to open discussion about what Alex Langshur of PublicInsite called, in reference to these posts: the “quality of web analytics.” I’m talking about doing the most and the best we can with current technology to ensure quality and integrity in web analytics data collection and reporting - like I know we all do now, right!
It’s about doing the best we can at our jobs because we’re lucky to have them - at least that’s how I feel about my “real” job in web analytics.
To begin detecting bots and hunting spiders, here’s what I recommend:
- Meet industry standards. The Interactive Advertising Bureau publishes a list of spiders, bots and user agents that member companies have confirmed are robotic. You can add these patterns to your list of IP’s and agents to exclude. A question to be asked is whether your vendor subscribes to this list, and thus makes your analytics implementation compliant with the IAB.
- Use public resources. Peter Adam’s commented of an excellent resource in Part 1: Gary Keith’s Browscap project is a must have component of any web analytics / spider detection effort. A simple Google search reveals a list of resources to leverage.
- Use hybrid data collection. You can exploit the non-javascript executing bot busting power of page tags, and process log files to detect how spiders and bots crawl your site. That’s hybrid. You’ll need a tool that enables you to carve out spider and bot traffic from human traffic (certain tools do this). If you run the logs of a site you’ve page tagged, you can compare the traffic from both sources and look for identical traffic that has unusual patterns too.
- Identify and then filter unusual traffic patterns. Query your database and/or look at your reporting to identify and filter IP addresses and user agents or combinations thereof that view a total number of pages that you find suspicious. Look for continuous visits repeating continuously! Check out paths that have over N page views, where N is an arbitrary number larger than the “average pages viewed per visitor” on your site. What’s requesting your robots.txt? What are the attributes of the traffic look like coming from Africa, the Russian Federation, India, and China?
- Talk to your engineers. They are really smart and know lots of things about the web. They are your allies in web analysis. As web analytics practitioner, programmer, and all around smart dude, Steve McInerney mentioned: “…is the biggest giveaway I’ve found. Bots don’t request images/CSS etc as much as people do. Some tricky HTTP headers to force browsers to reload, say certain images or CSS, can go a long way towards helping identify bots - who won’t 304 the image or css.” Your CMO may not understand what Steve is saying, but your CTO will. And Steve may only understand when I say: Aussie! Aussie! Aussie!
- Keep your filter lists up to date. Obviously, doh! You’ll need to maintain your lists of exclusions. Someone needs to be dedicated to doing this process-based activity.
- Identify the “spiders and bots” metrics on your site and compare to suspicious traffic. I know the average number of pages viewed by a spider, the average visit duration per bot, and how other key metrics relate to spiders. I also know this for humans. When I am suspicious about traffic, I can compare attributes of visit or visitor to my bot and human profiles.
- Examine user agents. Look for irregular, unusual, and just plain weird user agents.
- Set up a tarpit. A tarpit is a server to which you redirect bot traffic and serve those bots random binary objects, very slowly, until they stop crawling. When the bot maker checks on the results, they find they have nothing and need to recrawl. The idea here is slowing bot evolution. If you block the ip’s then the bot makers will know you’ve found them and adjust their tactics accordingly to make them less detectable.
- Remember that not all bots are bad, so don’t prevent the good bots from crawling your site! The Googlebot and Yahoo!Slurp index your site’s content so that it is retrievable when people query their search engine. Don’t exclude bots using robots.txt or a tarpit unless you know that bot is nefarious or should be excluded. In practice, I rarely block bots from visiting a site. Rather I segment robotic traffic out of the numbers I report about site behavior and performance.
- Manage expectations. The truth is you can’t filter what you haven’t seen before. You can’t predict the future of new bot behavior. No matter what bots will get through your data collection methods. As an analyst you need to actively manage expectations by educating your user community, clients, and colleagues. Soon enough you’ll have associates who are savvy human filters letting you know when bots slip through.
- Create a business process for spider and bot reconciliation. The savvy web analytics team has sustainable, manageable, and repeatable processes for scheduled detection, filtering, removal, quality assurance, and exclusion list maintenence.
Happy Hunting! Click here for Part 4, my “Spider and Bot Capability Maturity Framework.” Thanks for visiting!
Image from link2city.com.
Phillip Crum added the following ...
Excellent article. Love those checklists!
Steve added the following ...
Highly Apropos:
http://www.informationweek.com/news/showArticle.jhtml?articleID=201001092
In essence, bots being used to win prizes from Microsoft for using the Live Search Engine.
I found the comments from a Panel CMO that his panel was (a) Immune to bots; but (b) not immune to Macro abuse; highly entertaining.
ie It is vulnerable to Bots. He’s just using weasel wording. Sigh.
Cheers!
WordPress URL Exclusion List for Analytics | aimClearBlog added the following ...
[…] Part 3: Web Analytics Quality through Spider and Bot Recognition and Filtering […]
WordPress Exclusion List for Analytics added the following ...
[…] Part 3: Web Analytics Quality through Spider and Bot Recognition and Filtering […]


Steve added the following ...
Oi! Oi! Oi!
Cheers!
(A proper response will be forthcoming…)