Web Analytics Blogs

Judah Phillips is an experienced web analytics practitioner and Internet expert currently working as a Director at a large multichannel media company. His blog is full of useful, unbiased, actionable insights learned from the real-world practice of a process-oriented, integrated approach to strategic Web Analytics for improving business performance.

Subscribe to Judah Phillips weblog

Archive for July, 2007

Web Analytics and Data Collection: Log File Analysis

Web Analytics and log file analysis is misunderstood!  With so many options for data collection in web analytics, the log file has been somewhat disdained for many years.  In fact, I’d venture to guess that the majority of web analytics practitioners working today don’t “do log file analysis.” That’s unfortunate.  One can learn a lot about web analytics from examining log files - for example, “total time online” (with all it’s inherent issues :).

Why is server-side data collection (log file analysis) less popular than client-side page tagging?  It’s not about bots, accuracy, or ease of deployment… that’s marketing spin, somewhat true and false depending on context.  It’s may be because most web analytics practitioners currently use a hosted model. 

Page tagging enables ASP models  for web analytics.  It would be very difficult (i.e. resource intensive and costly) for a web analytics ASP to maintain a competitive service level if the ASP relied on its clients and their IT teams to maintain server side data collection- maintaining, transporting, and synching log files.   

Imagine the ASP model with log files!  It’s nightmarish and makes little sense:  hundreds and thousands of customers with just as many (or more) different servers behind firewalls generating many different log file formats.  Then, at various intervals all customers sending uniquely formatted logs for the ASP to parse, index, and make available for querying and retrieval in close to real time.  Yikes!! 

Page tagging creates a standard data structure for the ASP and a manageable data transport mechanism while minimizing complexity and reducing infrastructure cost in a way that log files just can’t.  Page tags enable ASP’s to meet rigorous service level agreements and contractual obligations for data processing in way they probably couldn’t with log files.  

So who does log file analysis for web analytics?  The answer is companies that have:

  • No desire to give up control of a web analytics infrastructure to young companies.  Web data is valuable and log files enable you to keep it all for however long and at whatever cost your business justifies.  You can processing your data using whatever tool you want, and reprocess at your discretion.
  • Existing business dimensions and structured data already existing in log files. Simple web analytics concepts like page views, visits, and “uniques” become more meaningful when sliced by relevant business objects from your companies existing log data.  For example, if you’ve passed a unique parameter in your query strings for years, you can program your log file tool to recognize the parameter.   Then reprocess your existing data to make use of it in your reporting. 
  • Deep data mining experience.  Data miners want data to mine, not static GUI’s or “my data” delivered in summary extracts at an incremental cost.  Log files can be parsed and indexed into the database that you, the company or the client, actually owns at the budget you actually set.
  • Know how to employ hybrid data collection.  Log files are complemented by page tags for instrumenting pages to track events and Web 2.0.
  • Complex integration requirements.  All the data exists on machines in your network.  Log files can be extended to incorporate new business dimensions.
  • A need to understand how spiders and bots behave on their site and the performance of their server(s).  Log files capture everything that happens on the web site.  The “kitchen_sink.html” is in there.  It’s up to you to customize the tool to meet the quality level you demand for filtered data.  As Mat and Steve mentioned in the comments, logs also enable you to view server errors, useful for site QA and debugging.

I’m sure there are other reasons too….  I’ve heard people tell me that:

  • It’s hard to move log files.  That depends on how many log files you have, what technology you are using to move them, and how often.
  • Log files are difficult to parse. Not really, especially if they are standardized.  You just need a flexible tool.  If things get non-standard, you need a tool that enables you to define your own formatting.
  • Log files take up a lot of space.  Yes, they do. How much space do you have?  What’s your companies existing standards for parsing log files?  How long will you store them?  You may find out you have plenty of “space.”  You will never have enough time though. ;)
  • Log files aren’t as accurate as page tags.  Huh? That all depends on how well you filter and exclude pages.  Accurate compared to what? To unvalidated data existing on servers you don’t see in some data center thousands of kilometers/miles away?  Logs capture visits from browsers that don’t execute javascript and thus will not appear in the metrics from tag-based systems…

So, you might be thinking I am a big fan of log files.  I am a web analytics realist.  Challenges with log files include:

  • Standardizing on a common format.  If you have a web server, you are at risk for someone doing something that creates a custom log file format.  If you are the new web analyst coming into an existing company, you may find custom fields in the logs.  Develop and champion a company standard.  It will make your life as a web analyst somewhat easier.
  • Creating a single “core” configuration that can be extended to deal with one-off formats.  If you live in world of custom log files, you can abstract a set of core fields, describe them, then extend the description when you deal with custom craziness.
  • Synchronizing/transporting log files.  The more servers you have, the more chances you have for failures in synchronization.  Application monitoring and recovery procedures are critical to preventing downtime.  Develop them.
  • Reprocessing log files.  Depending on your system reprocessing existing log file data can have myriad of effects - from data not being available, to data being partially available, to errors, to long waiting periods before the data is current and ready to query.
  • Receiving log files that are missing key fields or with key fields that have changed.   If you depend on systems, companies, and people that you do not control, fields can be stripped and formats changed in the blink of eye.  The web team needs to detect and manage these events.
  • Monitoring the operational workflow.  Someone on the web analytics team needs to manage the system and maintain the overall availability and integrity of log file processing and any issues that could result in the data not being synched or processed correctly. 
  • Managing change.  Since all your “hits” are captured in the log file, you’ll need to tell your web analytics tool how to correctly process the logs and remove data you don’t want to report.  You’ll need to identify and maintaining lists of filetype exclusions and the user agents/ip’s of robotic traffic.
  • Finding staff that “gets it.” The rule is that success in web analytics results from having competent people who have experience understanding the risks and rewards of web analytics data collection and executing on a web analytics strategy that generates value. 

As a web analyst you need to use the tools at your disposal to build your web analytics practice.  Log files are part of your toolkit.  Be weary of people who try to sway you one way or the other when it comes to web analytics data collection.  The method you ultimately use should be based on the data you want to collect in the context of the company’s goals, budget, and expertise

log_files.gif

Web Analytics and Data Collection: The Page Tag

Many methods exist for collecting different types of web analytics data- some much more accurate and useful than others: page tags, log file analyzerspacket sniffers, audience panels, and toolbars.  At this point in time, the page tag certainly seems to be the preferred method adopted by the web analytics industry for data collection.  It’s also core to other technologies, like multivariate testing. 

Various sources will tell you a multitude of “things” over time about page tags, depending on whether they think you are a noob or advanced practitioner.  There’s lots of useful information from many sources about page tags.  Still I find a lot of the average discussion about the utility of page tags to be somewhat true and false, depending on context.  For example, if you’ve never tagged pages, but you have processed log files, does it makes sense to entirely throw away legacy processes?  Or if you’ve mastered change management across all pages on your site but you’ve never parsed log files, do you want to have to deal with synching, moving, parsing, filtering log files?   These questions yield fantastic answers for fruitful discussion as you plan or extend your web analytics implementation.

I’ve heard the following “things” about page tags:

  • It’s easy to page tag your sites.  That really depends on the technologies used to build the site.  A static site with few simple pages is a different animal than a beast of a database-driven site with millions of complicated pages glued together using different technologies. 
  • Page tags are more reliable.  Reliable in what way I ask?  Page tags in hosted environment are processed by a data center many miles away.  I have no idea of the availability of the servers to support the load they are receiving from all the customers, nor do I have any insight into the raw data collected by the page tags, errors, and so on. 
  • Robots and spiders are always removed from the metrics.  Page tags do a better job than log file analyzers out-of-the-gate and especially if you don’t maintain your filtering, but I don’t really think page tags are infallible at all.  Bots and spiders get through.
  • You don’t have to deal with IT when page tagging.  It’s just a global include right?  You just give IT the code, and they include it.  For the most part that’s true, but IT has processes and procedures and your tagging needs to be QA’ed and perhaps even put through a “versioning” process.   Regardless, you’ll need to prep the corporation and your friend’s in IT for the big page tagging effort.
  • Change management is easy with page tags.  If you only use the vendor’s out-of-the-box page tag and have a global include, it’s somewhat easy to manage change.  But in my experience, a web analytics implementation requires using the tag in different ways across the site.  You’ll need to add attributes to the tag or script values into the tag.   In order to manage change, you’ll need to follow corporate processes.  In the case of web analytics page tagging, you may need to create those processes before you manage change… and process creation in corporate environments is sometimes not easy.  You may even need to follow a process for doing so.  :)

The biggest challenges I’ve had with page tagging include:

  • Ensuring complete code coverage across all pages.  The number of pages in your site, the way the site is built, and the technology used to build your site all need to be carefully assessed prior to beginning the page tagging effort.  The larger the portfolio of sites, the more difficult it will be to tag all of your pages.  Other challenges include the number of domains and subdomains, the technology standards used across your portfolio of sites, and whether, in more controlled environments, the web analytics team actually has the positional power to influence the change management process.
  • Determining a method for centralized tag management.  If you are running an internal solution, you’ll need page tag server(s) to collect the page tag data.  If you’re depending on your vendor’s servers, you may need to modify or update tags for various reasons (new campaigns and so on).  You may find new sites that need to be tagged.  New stakeholders may want special tags added to the site to support other purposes (such as multivariate testing).   How are you going to centrally manage all that?   
  • Orchestrating changes to tags across different site sections.  You may want to pass a new value in the tag or change something in the script on page X, but not page Y, or subdomain A, but not subdomain B.  A new campaign may need to be enabled.  You may have to update your tag to take advantage of new vendor functionality.  You’ll need a technology solution and process for centrally orchestrating and controlling tag changes.  Wielding a mighty CMS helps.
  • Reconciling tag metrics with log metrics and determining correct filtering.  It’s great fun to collect tag data then compare it to your log files to determine if all pages are being counted and how effectively you are filtering bots.  While looking at different data sources is a time consuming activity, certain business cases may demand it. 
  • Integrating tagging with a Content Management System.  Major sites use expensive CMS’ to create web sites.  The web analyst should work with the CMS team to build page tags into site pages.  Integration will help you with two points raised above: centralizing tag management and change management/orchestration.
  • Challenges with decodes and lookups using tags.  I’ve learned that it is difficult and in many cases impossible to decode a value or use lookup tables with a page tag.  For example, if I had a page that had a URI “/er45rw/e42f45erfwrq3r.html,” I can’t decode it to read “Web Analytics Blog” in my reports.  Or if it is possible to do simple decodes, I have to hardcode the decode on each page.   Hard coding on an evolving web site is never manageable over the time.
  • Latency.  If the page partially loads without executing the javascript, or if the user clicks through the page before the javascript has fired, the page view won’t be counted.  You must test to make sure your tag is firing properly.
  • Javascript turned off.  If the browser doesn’t execute javascript, the tag won’t fire and the visitor won’t be measured.  What effect will that have on your numbers? 
  • Cookie issues.  Tags may set third-party cookies.  Privacy policies don’t like the third party cookie.  As Justin Cutroni points out in the comments his tool sets first party cookies by default.  That’s good.
  • DNS changes.  To prevent third party cookie issues, you can trick out the DNS with a CNAME entry.  Say hello to someone in IT called the DNS admin!

In discussing the page tag in such a manner, you may think I don’t like the page tag.  Not true at all!  The page tag is a very useful data collection method in context.  I like it very much, especially in hybrid data collection.  Like any technology, just make sure you understand the implications of your data collection method on your implementation.   

page_tags.gif
Courtesy of Zeus.com.

Part 4: Spider and Bot Capability Maturity Framework for Web Analytics 2.0

Let’s conclude this series on the spiders and bots of web analytics 2.0 with a framework for ranking the spider and and bot capability maturity of  a web analytics practice! 

Before I do so, let me mention a few things:

  • How/if you detect spiders and bots depends on your data collection methods.  It is thought that page tags exclude spiders and bots because they don’t execute javascript or request images. Until they do. Log file data must be examined and filtered (either programmatically, manually, or both), and those filters need to be maintained by a web analytics professional.
  • Bots are used for a many purposes.  From inflating metrics, to indexing text for search, to chatting, to spamming, to data mining, to site performance monitoring, to click fraud, and more.
  • Bots are constantly evolving. Just like web analysis. :)
  • Not all bots are bad news.  You want bots crawling your site.  Some bots are good and helpful for your online business. Imagine if you blocked Googlebot.  Segment the robotic traffic into separate reporting (in the best case) and make sure its filtered from your externally reported numbers.   

Now without further adieu, here’s a web analytics 2.0 framework for ranking your spider and and bot capability maturity:   

  • Red.  You are in the red if the following applies:
    • I don’t understand the impact of spider and bot traffic nor do I measure, filter, or segment it at all. 
    • If I run an in-house solution, I haven’t updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification.
    • I don’t know when or how the Googlebot hits my site, but it sure does! 
    • I’ve never thought of reporting spider traffic for SEO.
    • I think the words “yahoo” and “slurp” refer to delicious soup.
    • Spiders scare me.
    • All bots are bad (no, they aren’t!)
  • Yellow. You are moving out of the red zone, into the yellow if:
    • I occasionally look for suspicious traffic. 
    • If I run an in-house solution, I have occasionally updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification, but I ask for verifications of compliance with industry standards.
    • I’ve heard of the Googlebot and maybe I use or I am thinking about using Site Maps (do it!). 
    • I know about Yahoo! Slurp.
    • It would be cool and useful to report on spider and bot traffic and learn which bots are good and bad. 
  • Blue.You are in the blue if at least every month or on an ad hoc basis:
    • I work with my vendor or in-house team to recognize and remove spider and bot traffic, conforming to industry standards
    • I do a monthly update of my filter and exclusions list, or I know my vendor does.
    • I capture bot traffic in my server log files, but I may or may not report on it.
    • I know all about the Googlebot, and I use Site Maps. 
    • I realize the Googlebot and Yahoo! Slurp behave differently on my site, but I don’t know exactly how.
    • I know which bots are good (like search bots) and bad (like content scrapers).
  • Green.  You are in the green if:
    • I have established a process for regularly removing spider and bot traffic and for keeping my lists of exclusions and filters up-to-date, or I am *absolutely certain* my vendor has such processes.
    • I am in compliance with industry standards.
    • I measure spider and bot traffic and segment it into distinct reporting separate from my human traffic reporting.
    • My SEOers love me, the web analyst, and I am involved in educating people in my company about bot traffic from IT to managers to consumers of reporting and analysis.
    • I know that there’s more than one Googlebot, and I know how its crawl differs from Yahoo! Slurp (hint: Slurp visits more).
    • I know detailed metrics like the “total time online” of all my bots. ;)

What I’ve covered in this four-part series only covers a little bit of what I know and what there is to know about spiders, bots, and crawlers and how they affect web analytics.   I’d enjoy hearing other experiences or opinions, so please share your comments if you feel like it.  Until next time, fine readers.  Thanks for visiting!

sengine1.gif

Part 3: Web Analytics Quality through Spider and Bot Recognition and Filtering

Welcome to Part 3: Web Analytics Quality through Spider and Bot Reconciliation. My goal with this series of posts was to open discussion about what Alex Langshur of PublicInsite called, in reference to these posts: the “quality of web analytics.”  I’m talking about doing the most and the best we can with current technology to ensure quality and integrity in web analytics data collection and reporting - like I know we all do now, right! :)

It’s about doing the best we can at our jobs because we’re lucky to have them - at least that’s how I feel about my “real” job in web analytics.

To begin detecting bots and hunting spiders, here’s what I recommend:

  • Meet industry standards.  The Interactive Advertising Bureau publishes a list of spiders, bots and user agents that member companies have confirmed are robotic.  You can add these patterns to your list of IP’s and agents to exclude.  A question to be asked is whether your vendor subscribes to this list, and thus makes your analytics implementation compliant with the IAB.
  • Use public resourcesPeter Adam’s commented of an excellent resource in Part 1: Gary Keith’s Browscap project is a must have component of any web analytics / spider detection effort. A simple Google search reveals a list of resources to leverage.
  • Use hybrid data collection.  You can exploit the non-javascript executing bot busting power of page tags, and process log files to detect how spiders and bots crawl your site.  That’s hybrid.  You’ll need a tool that enables you to carve out spider and bot traffic from human traffic (certain tools do this).  If you run the logs of a site you’ve page tagged, you can compare the traffic from both sources and look for identical traffic that has unusual patterns too.
  • Identify and then filter unusual traffic patterns. Query your database and/or look at your reporting to identify and filter IP addresses and user agents or combinations thereof that view a total number of pages that you find suspicious.   Look for continuous visits repeating continuously!  Check out paths that have over N page views, where N is an arbitrary number larger than the “average pages viewed per visitor” on your site.  What’s requesting your robots.txt?  What are the attributes of the traffic look like coming from Africa, the Russian Federation, India, and China? 
  • Talk to your engineers.  They are really smart and know lots of things about the web.  They are your allies in web analysis.  As web analytics practitioner, programmer, and all around smart dude, Steve McInerney mentioned: “…is the biggest giveaway I’ve found. Bots don’t request images/CSS etc as much as people do. Some tricky HTTP headers to force browsers to reload, say certain images or CSS, can go a long way towards helping identify bots - who won’t 304 the image or css.”  Your CMO may not understand what Steve is saying, but your CTO will.  And Steve may only understand when I say: Aussie! Aussie! Aussie!
  • Keep your filter lists up to date. Obviously, doh! You’ll need to maintain your lists of exclusions.  Someone needs to be dedicated to doing this process-based activity.
  • Identify the “spiders and bots” metrics on your site and compare to suspicious traffic.  I know the average number of pages viewed by a spider, the average visit duration per bot, and how other key metrics relate to spiders.  I also know this for humans.  When I am suspicious about traffic, I can compare attributes of visit or visitor to my bot and human profiles.
  • Examine user agents.  Look for irregular, unusual, and just plain weird user agents.
  • Set up a tarpit.  A tarpit is a server to which you redirect bot traffic and serve those bots random binary objects, very slowly, until they stop crawling.  When the bot maker checks on the results, they find they have nothing and need to recrawl.  The idea here is slowing bot evolution.  If you block the ip’s then the bot makers will know you’ve found them and adjust their tactics accordingly to make them less detectable.
  • Remember that not all bots are bad, so don’t prevent the good bots from crawling your site!  The Googlebot and Yahoo!Slurp index your site’s content so that it is retrievable when people query their search engine.  Don’t exclude bots using robots.txt or a tarpit unless you know that bot is nefarious or should be excluded.  In practice, I rarely block bots from visiting a site.  Rather I segment robotic traffic out of the numbers I report about site behavior and performance.
  • Manage expectations.  The truth is you can’t filter what you haven’t seen before.  You can’t predict the future of new bot behavior.  No matter what bots will get through your data collection methods.  As an analyst you need to actively manage expectations by educating your user community, clients, and colleagues.  Soon enough you’ll have associates who are savvy human filters letting you know when bots slip through.
  • Create a business process for spider and bot reconciliation.  The savvy web analytics team has sustainable, manageable, and repeatable processes for scheduled detection, filtering, removal, quality assurance, and exclusion list maintenence. 

Happy Hunting!  Click here for Part 4, my “Spider and Bot Capability Maturity Framework.” Thanks for visiting!

animated-search-engine_spider.gif

Image from link2city.com.

Part 2: The Crawlers, Spiders, and Bots of Web Analytics 2.0

In Part 1, I blogged about a huge challenge in web analytics: bot, spider, crawler detection, analysis, and filteration.  According to robotstxt.org ”robots” are:

“…programs that automatically traverse the web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced… the specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time… it is still a robot.”

In some regards it is “taboo” to speak about evil bots, but I think it’s a topic that needs to be brought out of the dark.  If we have to pay for page views, I require and expect accurate page view metrics. 

By understanding the characteristics of bots to help detection and purification of your web metrics, you will deliver more accurate analysis. Let’s take a deeper look at other bot behavior of Web Analytics 2.0:

  • Javascript executing bots.  Yes, they exist.  Big time.  They will invoke your beacon code, and generate large numbers of false page views, which your vendor should be filtering out.   Maybe your vendor filters traffic where large numbers of page views are generated in a single visit, or maybe you are paying for them.  Check your pathing reports or your ratios, like view:visit. Do see visits that have hundreds or thousands of page views?  Chances are those are bots (or your configuration may be technically challenged).

A large search engine company’s client service folks even hinted at javascript executing bots being programmed to execute queries on internal search engines, thus invoking an embedded advertising module, inflating the number of requests for the ad module.  While that doesn’t affect clickthrough or revenue and has nothing at all to do with the ad technology or the engine, it affects metrics. It confirms what I see.

Another favorite javascript executing bot I know likes to exploit HTML commenting left on in blogging architectures.  The bot spams comments with nefarious link farms or other unsavory content.  It executes javascript to do it.

Take for example, the ”rofovnifo” bot.  It fills out blog comments with the following content, leaving links to the nefarious goldwarez.org (don’t go there!) coming from Comcast 67.160.110.199:

Hi. Looks good! Very useful, good stuff. Good resources here. Thanks much!  G’night

A quick Google search reveals “rofovnifo really likes Cabela’s, sailing, Friendster, XBOX 360, car audio, card sharing, and world affairs–among many other things.  If only “rofovnifo” were a legitimate human, then I’d believe the traffic he generated and continues to generate across the 94,500 links he’s visited across the world  over the last several days! (UPDATE: 127,400 links… wait 128,000)

  • Enter your site from various referrers using various methods. One bot that I find fascinating executes a query on a particular country’s Google site (let’s day Google.au), then the bot crawls site pages returned in the SERP (search results page).  Next it indexes the content on the returned page. In the case I observed, this bot crawls a directory of subject matter related to a certain discipline.  The directory has thousands of pages.  So what I see is the bot coming from Google.au from a certain keyword, entering the site, and crawling 4,000+ pages during one visit that last a few hours.  Not human.
  • Come from different IP addresses or subnets. I’ll find bots that jump around IP addresses, but always hit the same page or pages for the same amount of time.  Or the bots will come from different subnets and hit the same page at a given periodicity (like every 15 minutes).
  • Repeatedly hit one page, repeatedly!  In a former job, I noticed that one educational institution referred a bot to the site on only the first of the month. The bot would crawl 77,000 pages then stop.  Are these clickfraud bots?  What are they?  Were they testing a prototype?
  • Spoof their user agent.  Nasty bots don’t want to say “hey, I am nasty bot” so they try to hide themselves by looking like normal user agents.

Mozilla compatible user agents look something like this:

xx.xxx.xx.xxx:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.307; .NET CLR 1.1.4321; Media Center PC 4.0; .NET) 

Bots will spoof user agents to get by your filters.  Sometime they are sloppy in doing so, which makes them instantly detectable.  Notice the upper-case WINDOWS and the shortened NET reference:

xx.xxx.xx.xxx:Mozilla/4.0 (compatible; MSIE; WINDOWS 5.1; SV1; NET 1.0; Media Center PC 4.0; .NET)

  • Use cookies.  Wierd user agents that use cookies set off a huge red flag.   You need a web analytics tool that allows you to cross the cookie dimension with user agents.
  • Pass cookies.  Even worse than bots that simply use cookies is when you see one cookie passed across multiple IP addresses visiting your site consecutively during a short-time period, then disappearing after crawling huge numbers of pages in that short duration… very odd indeed. 

What do you see on your sites?  How are you dealing with spiders and bots?

Check out Part 3 where I will provide recommendations of keeping your analytics up-to-date with Web 2.0 bots and spiders.  Thanks for visiting! 

web-analytics-robots-spiders-crawlers.jpg 

Part 1: Spiders, Bots, Page Views, and Web Analytics, Oh My!

One fascinating area of web analytics is spider and bot detection and analysis.  I find it really amazing how many bots and spiders really exist out there on the “world wild web.” 

According to robotstxt.org ”robots” are:

“…programs that automatically traverse the web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced…the specific traveral algorithm; even if a robot applies some heuristic to the selection and order of documnets to visit and spaces out requests over a long space of time, it is still a robot.”

Detecting non-human traffic is essential for “doing” solid web analytics.  After all, if you can’t adequately detect and filter robotic traffic, then your basic web metrics numbers are dramatically overinflated, which compromises the predictive power of your analysis.

The industry seems to believe that most vendors do a decent job of excluding robotic traffic.  After all, bots don’t execute javascript, so they don’t fire the page tag, thus that nasty robotic traffic doesn’t get included in the metrics reported by the web analytics service provider.   That’s true as long as the bot doesn’t execute javascript… what if it does? Page tagging is certainly not perfect, and in my experience bots do slip through. 

When you are doing log file analysis or using a hybrid data collection model, you’ll end creating lists of ip’s or user agents for bots to filter and exclude.  If you’re new to the game, your list may remain manageable.  If you record ip’s and agents to filter for many years, you end up with a rather unwieldy list of exclusions, which probably should be in a database.   Does anyone know of a web analytics vendor that filters traffic based on data existing in a client’s bot database?   

Since I think log-file analysis is a neglected art, I practice it (and I’ve used tags from a few vendors in my day-to-day).  In general, I notice that bot traffic:

  • Crawls inordinate numbers of pages per visit when compared to human visitors.  The basic observation you’ll notice immediately is that bots tend to view a large number of pages during each visit.  They tend to spend a long time on site too.
  • Enters the site at various intervals for various durations.  I find bots that request pages at the same periodicity.  One bot I’ve filtered in the past hits the same page every hour at 10 minutes past the hour, every hour, every day, every week, every month, every year.  It comes from somewhere in India.  A bigger problem than worrying about whether my visitors from India huddle around the same computer. ;)
  • Crawls a site in unusual patterns. A common signature of a bot is the way in which it crawls a site.  Often a bot will walk the document object model, crawling the structure of the site.  I see these bots request every link, in order, on a page, including file downloads.  The logic behind these crawls is easy to identify, and thus filter.
  • Repeatedly requests pages that human visitors don’t access.  I see bots that must be programmed to only crawl XML files or only request only rich media files. 

Adding to the complexity of bot detection, I’ve noticed that bots evolving in the modern age of Web 2.0 also:

  • Execute javascript
  • Enter the site from various referrers using various methods
  • Come from different IP addresses and subnets
  • Repeatedly hit one page, repeatedly!
  • Spoof their user agents, thus not identifying themselves
  • Take cookies (they are cookie monsters).

Remember, not all bots are bad!  You want the Googlebot and other useful spiders to crawl your site. 

Let’s take a closer look at the nuances of new-wave bots in Part 2.

googlebot_searchengine_spider.gif