Web Analytics Blogs

Judah Phillips is an experienced web analytics practitioner and Internet expert currently working as a Director at a large multichannel media company. His blog is full of useful, unbiased, actionable insights learned from the real-world practice of a process-oriented, integrated approach to strategic Web Analytics for improving business performance.

Subscribe to Judah Phillips weblog

Part 1: Spiders, Bots, Page Views, and Web Analytics, Oh My!

One fascinating area of web analytics is spider and bot detection and analysis.  I find it really amazing how many bots and spiders really exist out there on the “world wild web.” 

According to robotstxt.org ”robots” are:

“…programs that automatically traverse the web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced…the specific traveral algorithm; even if a robot applies some heuristic to the selection and order of documnets to visit and spaces out requests over a long space of time, it is still a robot.”

Detecting non-human traffic is essential for “doing” solid web analytics.  After all, if you can’t adequately detect and filter robotic traffic, then your basic web metrics numbers are dramatically overinflated, which compromises the predictive power of your analysis.

The industry seems to believe that most vendors do a decent job of excluding robotic traffic.  After all, bots don’t execute javascript, so they don’t fire the page tag, thus that nasty robotic traffic doesn’t get included in the metrics reported by the web analytics service provider.   That’s true as long as the bot doesn’t execute javascript… what if it does? Page tagging is certainly not perfect, and in my experience bots do slip through. 

When you are doing log file analysis or using a hybrid data collection model, you’ll end creating lists of ip’s or user agents for bots to filter and exclude.  If you’re new to the game, your list may remain manageable.  If you record ip’s and agents to filter for many years, you end up with a rather unwieldy list of exclusions, which probably should be in a database.   Does anyone know of a web analytics vendor that filters traffic based on data existing in a client’s bot database?   

Since I think log-file analysis is a neglected art, I practice it (and I’ve used tags from a few vendors in my day-to-day).  In general, I notice that bot traffic:

  • Crawls inordinate numbers of pages per visit when compared to human visitors.  The basic observation you’ll notice immediately is that bots tend to view a large number of pages during each visit.  They tend to spend a long time on site too.
  • Enters the site at various intervals for various durations.  I find bots that request pages at the same periodicity.  One bot I’ve filtered in the past hits the same page every hour at 10 minutes past the hour, every hour, every day, every week, every month, every year.  It comes from somewhere in India.  A bigger problem than worrying about whether my visitors from India huddle around the same computer. ;)
  • Crawls a site in unusual patterns. A common signature of a bot is the way in which it crawls a site.  Often a bot will walk the document object model, crawling the structure of the site.  I see these bots request every link, in order, on a page, including file downloads.  The logic behind these crawls is easy to identify, and thus filter.
  • Repeatedly requests pages that human visitors don’t access.  I see bots that must be programmed to only crawl XML files or only request only rich media files. 

Adding to the complexity of bot detection, I’ve noticed that bots evolving in the modern age of Web 2.0 also:

  • Execute javascript
  • Enter the site from various referrers using various methods
  • Come from different IP addresses and subnets
  • Repeatedly hit one page, repeatedly!
  • Spoof their user agents, thus not identifying themselves
  • Take cookies (they are cookie monsters).

Remember, not all bots are bad!  You want the Googlebot and other useful spiders to crawl your site. 

Let’s take a closer look at the nuances of new-wave bots in Part 2.

googlebot_searchengine_spider.gif

Jacques Warren added the following ...

Very interesting. Also, a quick tric for identifying the bots is to look at high frequency of visits from certain IPs, say, an IP with 14 visits on a day. Could be human, but most of the time it’s not.

Peter Adams added the following ...

This is actually more of problem than people cause to admit. One way to avoid these new fangled web 2.0 spiders is to trigger tracing nad cookie setting off of a special image request. I’ve found that while these bugs execute javascript, they almost never request images…

Also, Gary Keith’s Browscap project (http://browsers.garykeith.com/) is a must have component of any web analytics / spider detection effort.

Steve added the following ...

An excellent posting Judah! Speaking as a practitioner of the fine art of log analysis. ;-)

Several thoughts:
* I find robots inflate my numbers but rarely change the … pattern or vibe if you will. Do get unusual spikes, but generally they just increase numbers.
* Bots that hit with regularity are usually “is the Internet Alive?” monitoring type tools. If they didn’t ask upfront, I block them at the firewall. :-)
* I have a funky AWK script/regular expression, that does a fair job. Been meaning to wrap it in a proper perl script with a “bot per line” config/include file. Time… need more time…

Jacques: We have 4500 people that come from the same IP address at work. I’ve worked in places where that number gets into the high 10’s of thousands. All from a single IP address. Be careful of your assumptions. I usually cross check off cookies stored in the logs - and if the cookies match on images and pages.

Which incidentally Judah - is the biggest giveaway I’ve found. Bots don’t request images/CSS etc as much as people do. Some tricky HTTP headers to force browsers to reload, say certain images or CSS, can go a long way towards helping identify bots - who won’t 304 the image or css.

Fortunately it’s minor, but “Intent” of the ‘bot can matter. If I use MSIE to spider and archive a copy of your site, is that a bot? If I use wget to achieve the same thing? Tricky… :-)

Cheers!

Mat added the following ...

Also important is the differentiation of what you do with your “bot” traffic. As Steve pointed out, MSIE will archive your site - that technically may behave as a bot - but what value do you get from identifying and reporting on it as a bot?

It has been my experience that bot traffic is fairly useful stuff when it comes to search-bid analysis, identifying which companies to spend your SEO dollars with, etc… so I would want to separate out ACTUAL bot traffic to know who is crawling my site the most (and ergo has the best, latest, deepest knowledge of my content) where I’d rather just forget about MSIE altogether and use an exclusionary rule to keep a few records from being written to my database.

Judah added the following ...

Jacques: When looking at visit metrics for bot identification, I’d look for a large, equal number of page views and visits during a brief time period. But one needs to be careful because one ip address can be shared by multiple agents!

Peter: I agree this is a huge problem for accuracy and analysis. Thank you for sharing your opinion, insights, and learnings. Your tip is right on regarding images, and that resource is very useful. Keep up the excellent work on OWA! As a major proponent of open source, I think what you’re doing is fantastic and gaining gravity. :)

Steve: I hear you about the “vibe,” but I think that would depend on the type of user needs that the site satisfies. For example, content sites that are aggregated and scraped because they are juicy with good content can see bot traffic in excess of 70% in personal observations - that affects the pattern, if you will… but only if you don’t filter… :)

Good advice on the headers too. It echoes Peter’s recommendation from above too… Hard earned learning from your work on AWFFULL (http://www.stedee.id.au/awffull). ;)

The “bot’s intent” a counterpoint to the “searcher’s intent” for the modern web, ay?

Mat: Good points. One of the reason why I use the tools I do for web analytics is that SEOers need to know how relevant searchbots crawl sites. It’s amazingly helpful information.

I also agree that value may be generated simply from excluding the traffic and not reporting on it! Exclusionary rules are powerful. Indeed.

Thanks to all of you for commenting and reading my blog!

Steve added the following ...

Nooooooooooooo! My website! It’s been *Judah/.ed*!!! The poor server. It’s melting!!!!
;-)

*cough*. Sorry. Couldn’t resist really. Probably should have. Too many flashbacks to Wizard of Oz.

Actually wasn’t awffull, was one of the others: “visitors”. The problem I had was logs that had Apache mod_usertrack cookies embedded to betterer identify uniques. But no tools existed to actually make use of that - well that I had access to. I’d written a pretty nasty perl script. Slow. Overly complicated.
So I took that base and re-wrote in C and that became “visitors”.

Most of the code is wrapper. The core logic is quite small.
Basically check to see if we get more than one bite at a given cookie. See it twice or more, then it’s probably good.
I can take you through the detail if you’re really bored. Or: Read the Source. ;-)
I use comments, it’s not … too scary.

But part of making “visitors” work, was *really* understanding how people and browsers use a/the site. It’s not perfect, but I am seeing a remarkable similarity between it’s numbers and GA’s (we switched on a few weeks ago). Which did wonders for my ego. ;-)

I haven’t, yet, pulled this logic or code into awffull. Intend to. Looking more like being sooner than later, but too much crufty old code that needs trimming and cleaning first.

Cheers!

Judah Phillips at Web Analytics Demystified » Blog Archive » Part 2: The Crawlers, Spiders, and Bots of Web Analytics 2.0 added the following ...

[…] Part 1, I blogged about a huge challenge in web analytics: bot, spider, crawler detection, analysis, and filteration.  In some regards it is […]

Judah Phillips at Web Analytics Demystified » Blog Archive » Web Analytics and Data Collection: The Page Tag added the following ...

[…] maintain your filtering, but I don’t really think page tags are infallible at all.  Bots and spiders get […]

Judah Phillips at Web Analytics Demystified » Blog Archive » Part 1: Your Web Analytics Data Quality Stinks and Here’s Why! added the following ...

[…] and Bots.  If you haven’t read my series on spiders and bots, check it out.  Non-human traffic can inflate your metrics and diminish the predictive power of […]

Bad Bots Confound Web Analytics By Executing Javascript Tags » Slicksurface - Tech, Design & SEO Blog added the following ...

[…] Judah Phillips wrote a post last July where he talked about seeing a similar phenomena. He said he sees bots doing the following: […]


Add to the Conversation

Your email (required) will not be published.

Please note that contributions are moderated and may take a little while to appear.