Web Analytics Blogs

Judah Phillips is an experienced web analytics practitioner and Internet expert currently working as a Senior Director at a large, global Internet company. His blog is full of useful, unbiased, actionable insights learned from the real-world practice of a process-oriented, integrated approach to strategic Web Analytics for improving business performance.

Subscribe to Judah Phillips weblog

Part 4: Spider and Bot Capability Maturity Framework for Web Analytics 2.0

Let’s conclude this series on the spiders and bots of web analytics 2.0 with a framework for ranking the spider and and bot capability maturity of  a web analytics practice! 

Before I do so, let me mention a few things:

  • How/if you detect spiders and bots depends on your data collection methods.  It is thought that page tags exclude spiders and bots because they don’t execute javascript or request images. Until they do. Log file data must be examined and filtered (either programmatically, manually, or both), and those filters need to be maintained by a web analytics professional.
  • Bots are used for a many purposes.  From inflating metrics, to indexing text for search, to chatting, to spamming, to data mining, to site performance monitoring, to click fraud, and more.
  • Bots are constantly evolving. Just like web analysis. :)
  • Not all bots are bad news.  You want bots crawling your site.  Some bots are good and helpful for your online business. Imagine if you blocked Googlebot.  Segment the robotic traffic into separate reporting (in the best case) and make sure its filtered from your externally reported numbers.   

Now without further adieu, here’s a web analytics 2.0 framework for ranking your spider and and bot capability maturity:   

  • Red.  You are in the red if the following applies:
    • I don’t understand the impact of spider and bot traffic nor do I measure, filter, or segment it at all. 
    • If I run an in-house solution, I haven’t updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification.
    • I don’t know when or how the Googlebot hits my site, but it sure does! 
    • I’ve never thought of reporting spider traffic for SEO.
    • I think the words “yahoo” and “slurp” refer to delicious soup.
    • Spiders scare me.
    • All bots are bad (no, they aren’t!)
  • Yellow. You are moving out of the red zone, into the yellow if:
    • I occasionally look for suspicious traffic. 
    • If I run an in-house solution, I have occasionally updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification, but I ask for verifications of compliance with industry standards.
    • I’ve heard of the Googlebot and maybe I use or I am thinking about using Site Maps (do it!). 
    • I know about Yahoo! Slurp.
    • It would be cool and useful to report on spider and bot traffic and learn which bots are good and bad. 
  • Blue.You are in the blue if at least every month or on an ad hoc basis:
    • I work with my vendor or in-house team to recognize and remove spider and bot traffic, conforming to industry standards
    • I do a monthly update of my filter and exclusions list, or I know my vendor does.
    • I capture bot traffic in my server log files, but I may or may not report on it.
    • I know all about the Googlebot, and I use Site Maps. 
    • I realize the Googlebot and Yahoo! Slurp behave differently on my site, but I don’t know exactly how.
    • I know which bots are good (like search bots) and bad (like content scrapers).
  • Green.  You are in the green if:
    • I have established a process for regularly removing spider and bot traffic and for keeping my lists of exclusions and filters up-to-date, or I am *absolutely certain* my vendor has such processes.
    • I am in compliance with industry standards.
    • I measure spider and bot traffic and segment it into distinct reporting separate from my human traffic reporting.
    • My SEOers love me, the web analyst, and I am involved in educating people in my company about bot traffic from IT to managers to consumers of reporting and analysis.
    • I know that there’s more than one Googlebot, and I know how its crawl differs from Yahoo! Slurp (hint: Slurp visits more).
    • I know detailed metrics like the “total time online” of all my bots. ;)

What I’ve covered in this four-part series only covers a little bit of what I know and what there is to know about spiders, bots, and crawlers and how they affect web analytics.   I’d enjoy hearing other experiences or opinions, so please share your comments if you feel like it.  Until next time, fine readers.  Thanks for visiting!

sengine1.gif

Steve added the following ...

And a nice wrap up Judah!
I’d add one more to your “Green”:

Words to the effect that: I share my data/findings with the IT and/or Security Team(s) so they can pro & re-actively block et al ‘bots that cause more harm than good.

eg. Folk who run security scanners against sites. Folk who abuse your search engine, by proxying their search queries onto yours. (Not speaking from experience at all. Oh no no! ;-) )

Sometimes you *must* be able to react in near real time to block what are effectively DoS attacks against a site.

And most importantly! Pre-position your Security Policies to give an appropriate level of independent power to those who can fix things. ie Outside change control.
eg. In the event of Fan Impaction, Steve will do whatever he thinks is best to keep the site running for most people.

This is usually balanced by being 100% responsible for any action taken. And obviously requires a fair bit of trust in those in whose hands such power is placed.

Yes we do have just such a policy. :-) I’ve enacted it twice in 5 years. Been *very* useful to have.

Cheers!

Judah added the following ...

Steve: Fantastic information. You bring up excellent points about bots and smurfing and the need for the web analyst to stay connected or at least have an organizational connection to IT. Thanks for sharing, and I’m glad to know you agree with me! :)

Marianina Chaplin added the following ...

Hu Judah, Thanks for a very interesting and useful post. As you correctly identify the problem with going with a hosted analytics solution is a frequent reliance on your vendor to identify and exclude spider traffic correctly and report on it separately. Freebies such as GA supposedly automatically filter out all the spiders for you but I have still heard of bot traffic included in GA when log files are analysed and compared to GA.

Judah added the following ...

Hi Marianina,

I agree that’s an issue, and it’s interesting to hear you as a practitioner confirm that you’ve noticed bot traffic slip through GA. Did you alert them? Did they do anything about it? Thanks for sharing your insights! :)

Judah Phillips at Web Analytics Demystified » Blog Archive » Part 3: Web Analytics Quality through Spider and Bot Recognition and Filtering added the following ...

[…] Hunting!  Click here for Part 4, my “Spider and Bot Capability Maturity Framework.” Thanks for […]


Add to the Conversation

Your email (required) will not be published.

Please note that contributions are moderated and may take a little while to appear.