Web Analytics Blogs

Judah Phillips is an experienced web analytics practitioner and Internet expert currently working as a Director at a large multichannel media company. His blog is full of useful, unbiased, actionable insights learned from the real-world practice of a process-oriented, integrated approach to strategic Web Analytics for improving business performance.

Subscribe to Judah Phillips weblog

Archive for 'Spiders and Bots'

AVG LinkScanner Obfuscates User Agent!

AVG has obfuscated their user agent.  One of the current agents for customers of their free and paid tool now cloaks itself as IE6:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

In addition to the easily detectable user agents:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)
User Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813)  
User Agent:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

This news is not good.  If you filter SV1 agent, you risk filtering legitimate traffic from the IE6 browser.  A few folks have commented to me that one should filter the user agent anyway, because 1) IE6 is in decline and 2) most IE6 users have .NET installed, which will show in the user agent.  Still filtering it makes me a little uneasy.

Is this the death toll for log file analysis and services provided by ABCe (since they can’t filter this user agent either)?  Maybe it is.  AVG is touting that agent lacks HTTP Accept-Encoding, which is just dandy, but that information isn’t normally captured in logs.

So the current situation is this:

  1. AVG has two user agents.  Both are filterable, but the SV1 agent is problematic to filter because you risk filtering legitimate traffic.
  2. Both agents in the current version request gifs in noscript tags, inflating counts in page tag implementations with noscript configurations.  AVG claims they will fix this issue.
  3. The bot uses”mad” bandwidth.  I’ve heard stories of bandwidth increasing 100x normal levels.  Some webmasters are serving dummy files to the recognizable user agents, some aren’t serving content to IE 6 browsers (crazy), and some are redirecting the bot back to AVG (thus inflating AVG’s bandwidth, LOL!).
  4. Evidence points to this bot NOT inflating clicks from paid search (i.e. PPC) and thus NOT committing click fraud.   But it doesn’t remain out of the realm of possibility that the scanner may be accessing an ad vendor click redirector and causing a click.  Not trying to spread FUD here, just making a point. 
  5. AVG is looking at option of checking either an external db (hosted by AVG) or a local cache to verify sites in SERP’s have been “scanned by AVG,” instead of repeatedly scanning sites every time they are listed in SERP, to reduce the bandwidth issue and minimize fraudulent entries in log files.
  6. AVG is thinking about enabling white listing of sites, so they are skipped by the scanner.
  7. AVG is thinking about exposing a meta-tag that instructs the scanner to ignore the site.

Good luck with this nasty bot!  Interestingly, here’s how you smurf a site with the AVG LinkScanner. 

AVG LinkScanner Bot Executes JavaScript?!?

The  well-researched answer is “no.”  The AVG LinkScanner Bot appears to prefetch the js and the gif (and pretty much everything else on the page), which for certain tools and their tag configurations generates false page views and visits (and the derivatives thereof), just like it’s “legitimate” traffic. 

If your tag configuration is set up with noscript tags, AVG will fetch the content in the tags, including the gif, which means that:

  • The bot may be infesting the data of customers of web analytics vendor who configure page tag-based data collection in this way. 
  • The bot may be inflating the data in such products/services offered by various web analytics companies.
  • Customers may be paying for server calls generated by this bot.

Vendors, of course, could easily filter the user agent to protect their customers:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813) 

But I haven’t heard a peep from any SaaS vendors about excluding the user agent, filtering already collected data, or refunding customers the cost of robotically generated server calls (regardless of AVG). Have you?

Think about this: many SaaS page tag vendors don’t provide detailed visitor-level data and user agent reporting.  That means that their customers have no ability to investigate this bot or detect it by filtering their reported data by the the true user agent.

I’ve been talking about JS executing bots screwing with web data for about a year nowSEOMoz and the folks at SlickSurface confirmed it quite recently (quoting me no less in their fantastic analysis).  So they do exist…

Now let me tell you a little story.  Once upon a time I was at a conference called eMetrics when the CEO of a company came up to me and said “hey I read your blog about bot detection, and I looked in my web metrics tool for traffic with high page view to visit ratios.”  Then he narrated a story to me about how he found a bunch of traffic that had page view to visit ratios of 5,000 to 1.”  I said “do you use page tags” He said “that’s all my vendor provides, so yeah.”  And I said “you’ve found a javascript executing bot in your data.”  “I know” he said. “Well did you call your vendor and let them know?”  I said.  Now for the punch line:  he told me that the vendor (who shall remain nameless) told him “well, the traffic executed server calls”  And they wouldn’t give him a refund!

It’s worth mentioning that this bot definitely affects log file tools and packet sniffer tools.  Both must be configured to filter the AVG LinkScanner user agent.

Now here’s the rub for me.  I use AVG!!!  But I now find it increasingly difficult to support the company or continue using their products.  Why?  Because they are wearing a “bad hat” here:

  • First, they are fully aware of the affect of this bot on web analytics systems. They just don’t seem to care (yet).  UPDATE:  They have set up a Google Group to discuss this issue.  They must understand how companies of all types in all sectors use web analytics data to optimize their sites, set their marketing budgets, determine expected server load, and much more.  What do their Internet Marketers think? 
  • Second, the Link Scanner tool may have a short shelf life and may offer limited protection.  Malware creators will easily adjust. Check out what my friend Steve McInerney, a very smart security expert, said on the Web Analytics Association’s Yahoo Forum:
What strikes me about this particular solution by AVG is how
incredibly … stupid it is on several fronts.
1. Noticeably impacting a users bandwidth is, technically, a security
breach in the first place, aka Denial of Service Attack.
2. Some of us live in countries that have rather severe bandwidth
charges/limits and the like, whom shall I send my excess bandwidth
bill to?
…(this) method is fundamentally
flawed. ie malware ignores any first request and only infects on a
second request - alternate cloaking. Whatever. This type of “solution”
only provides weak protection for a strictly limited period of time.
…not just “no security” but bad
security. Because folk feel they are being protected when they are
not, and hence will take greater risks and hence inflict greater harm
on themselves. :-( 
Ignoring the balance of positive to harm that this problem inflicts on
the users who use this product.
  • Third, AVG just doesn’t seem to “get it” yet.  They are potentially messing with the ability to drive commerce via data driven decision making, e-commerce analytics, site optimization, and online media measurement!  To quote The Register “chief of research Roger Thompson - who designed the AVG LinkScanner - indicated he may do away with that unique user agent. His chief concern is security, and he doesn’t want webmasters or malware writers gaming his scanner. “In order to detect the really tricky - and by association, the most important - malicious content, we need to look just like a browser driven by a human being,” he argues.

WebMasterWorld has some good stuff about to say here.  Read the Register’s first article here.  And check out the dude’s blog who broke the news first and responses from AVG here and here.

Interesting stuff. So what do you all think? Have you seen evidence of this bot in user agent data from your page tag solutions that use the noscript tag for the image? 

A Note on Web Analytics and Ad Server Metrics…

In wild world of online metrics, it’s a well known fact that metrics from web analytics tools and ad servers never match. Variances can be substantial. 

What I mean is that, given no “refresh rate,” the total impressions for a single ad unit, which should be served on every page request, never matches the number of total page views on the site during the same period of time.  Sigh.

Reasons why identically-named metrics from these two tools (like page views and unique visitors) don’t add up are numerous:

  • Different data collection methods.  Ad servers use page tags.  Many web analytics tools use page tags, but it’s not uncommon in web analytics to use additional methods, such as logs or packet sniffers.  The methods have no shared standards for collection or storage of the same data (like visit-level data).  Thus you get apples to strawberries comparisons when attempting to correlate the dimensions from different systems.
  • Unique data models.  Ad servers aren’t focused on counting page views and the other dimension of web analytics (visits, time, and so on).  Rather ad servers focus on serving and counting impressions served (and loads of related derivative calculations, like CTR, CPC, and the coolness of view–thru).   Metrics are based on an ad request and an ad code.  Ads aren’t targeted to a page (though that’s possible), but rather to a “zone” or “keyword.” What that means is that “page” dimension may not even exist in your ad server’s schema.  In other words, you aren’t looking at impressions measured on a page, but rather at the number of impressions served in a different conceptual construct.  That’s one of the reasons why people say metrics and ad-serving systems “don’t measure the same thing.” 
  • Untagged pages.  Just like analytics implementations suffer from challenges related to complete code coverage of page tags, so do ad serving implementations.  Companies need to determine how to centrally manage the deployment and orchestration of page tags *of all types* and verify all the pages have tags!  Don’t just expect it to work because tagging sounds so easy!  Suspect it won’t work, and determine what you’re going to do *before* you deploy.  Too late?  Time to reengineer. 
  • Non-JS executing clients.  Ad servers use page tags.  Not everyone and not all user agents execute javascript.  Everyone needs to realize that page tagging misses traffic as efficiently as it excludes it.  Period.  What percentage of the traffic you miss, you’ll never know… running and filtering your logs may provide an indication…
  • Ad blocking software.  Firefox’s Adblock Plus software is a big problem for sites that have a big techie audience, and it affects all sites.  Check your browser reporting and realize a good majority of those Mozilla users may be blocking your ads.  Look at the attitudinal data you have about visitor’s to gauge whether that’s a big issue for your online audience. 
  • Cookie issues.  Third-party cookies get blocked (often by privacy software).  Many ad servers still serve third party cookies, and many corporations have not tricked their DNS to accommodate this issue (ahem, CNAME).  We all know how cookie deletion affects unique visitor counts.
  • Refresh rates. One page rendered in the browser and many banner “refreshes” makes it really hard to correlate page views and impressions served.
  • No rich media installed, and no fallback.  If the client doesn’t have certain plug-ins, and you have no fallback, you miss ad revenue.  Meanwhile the tag executes and you count the traffic.
  • Robots, spiders, and crawlers, oh my.  The web is so robotic.  The problem is amazingly understated, especially by companies who want to bill you on page views.  Different data collection methods allow some level of bots to dirty the data.  Logs are harder to efficiently filter.  When the ad server uses tags, and the analytics tool uses logs, you may get some wildly different numbers. 
  • Mobile, Mobile, Mobile, Mobile.  Not all Internet-connected mobile devices will display ads, but web analytics tools will track the behavior of mobile visitors.
  • Latency.  Visitors who move through the site too quickly may not execute the tag, thus no data is sent back to the server(s).  Ever wonder why vendors tell you to put the tag “high” on the page?

The influence these issues have on your site varies depending on audience.  Investigate factors causing variance and deviation between metrics systems, and educate your audience on why the numbers differ.

adserver.gif

Part 2: Your Web Analytics Data Quality May Stink and Here’s Why!

In Part 1, I began a long list of reasons why your web analytics data quality may stink.  I’m continuing the list below (make sure you read Part 1 for context and to view the entire list)

  • Storing only visit level data.  May tools don’t have schemas that store raw data at the visitor level.  Instead they provide access to only visit level data.  For example, you may not be able to see all the page views during a single visit per ip address or cookied visitor.   Assess the impact of the vendor’s schema on your goals.  Companies that use analytics data to feed other systems or that want to use visitor attributes for content targeting, segmentation, optimization, or analysis may not be well-served by some vendor schemas.
  • Little to no decodes or lookups.  If you use numeric codes and non-human readable naming conventions in your data, they can pass through to your reporting and prevent your colleagues from understanding the reporting.  Strange codes look like hieroglyphics!  Decoding and looking up data can eliminate the problem of non-readability and strange numerical names in your reporting… While some would say this is a reporting issue, not a data issue,  I chose to include it because it’s at the surface… it’s the data your customers see.  Not all tools decode or lookup.  Some tools allow rewriting of data in the database.
  • Failure of key services supporting the application.  If you are dependent on page tags, synchronization software, web servers, databases, or any of the wondrous technology that makes it all work, failures are a real bummer.  Make sure you have monitoring and recovery processes in place so you don’t miss data!  When page tag collection fails (perhaps the page tag server went down ay?), the data is gone forever.  If the web server fails, then no logs are written, but no pages are served either - so is traffic missed?  But if the processes supporting log file analysis fail (i.e. data synch), watch out! 
  • Inadequate or incorrect implementation.  If you can’t cross dimensions (like finding out what keywords referred traffic to a page), filter all of your data (for example, filtering pages to see only those viewed by the iPhone), easily create new metrics, or if the numbers aren’t adding up, you may have not adequately or correctly implemented your software or communicated your requirements to your vendor’s professional services team. 
  • Limited, hard-to-extend data model. Powerful, actionable insights from web analytics are enabled by extending a data model to incorporate business specific dimensions.  For example, if every page has a category and an author, you may want to see a list of all the page views in that category or ranking of pages by most popular author.  To do that you may need to join data at the database level or take advantage of variables you pass in a page tag.  Various tools have different limits on if, how, and to what extent you can extend the data model.

So what do you do when you know your data quality is less than stellar?  Here’s some guidance:

  • Don’t worry, be happy. :-) Just by collecting the data you are collecting, you are doing better than a great majority of companies that do business on the Internet.  By asking questions about data and investigating the issues, you have a leg up on your competition.  Work on optimizing the data, expose flaws in site design or architecture that impede data collection, work with your vendor and seek help in the web analytics community if you run into real problems.  The Web Analytics Association’s Forum on Yahoo is a useful place for posting questions.  But whatever you do, stay positive and focused on solving your problems and making your web analytics practice more optimized.  Don’t get frustrated.
  • Recognize the limitations in the data and do not go gently into the night.  Ask the hard questions about sampling, schemas, data retention, processing, querying and reporting to understand where the holes and noise could be in your data.  Demand answers from your vendors and quick response times to your questions about data quality.  If you vendor is frustrating you by not being responsive, talk to the boss and the vendor’s bosses, escalate, escalate, escalate until you get resolution.
  • Understand the underlying elements of data collection and what can go wrong.  Learn about sessionization and why different tools and data collection methods have limitations.  Explore the more technical components of the backend, like the database and your web analytics schema - all your data is in one (or more)! Talk to your engineers.  Have them explain the technology in terms you understand.
  • Evaluate your tools.  Some tools are just better suited for particular business problems than other tools.  Log files tools enable you to constantly change assumptions and reprocess data.  Page tags provide a standard data collection and transport mechanism.

With hard work on your part, you can make you web analytics data smell like roses!  I know you can! :)

dataquality_renamed.jpg

Part 1: Your Web Analytics Data Quality May Stink and Here’s Why!

Web analytic’s data quality and accuracy of ”the numbers” are always questioned.  With so many sources of data from different systems and vendors - both free and paid- you must be able to reconcile deviations in data from different sources, and speak intelligently about data quality and accuracy to promote adoption of web analytics at your organization. 

There are so so so many reasons why web analytics data quality can stink.   I thought it would be fun to list some of the major reasons (over two posts):

  • Spiders and Bots.  If you haven’t read my series on spiders and bots, check it out.  Non-human traffic can inflate your metrics and diminish the predictive power of your analysis.  Regularly look for bots and update your filtering!
  • Untagged pages.  If the bulk of your page views are being generated by a single page, and you’ve failed to tag that page, you’ve lost data.  That’s always a bummer.  Tagging must be endemic to the web development process. Ask your team how they know that every page is tagged.  Really how do they know?  Verify and reverify.  Use a tool like WASP.   Obviously this isn’t a problem with log files.  
  • JS turned off.  If the browser doesn’t execute javascript, the page tag won’t fire and traffic will be missed.  If you’ve missed it, you don’t know it occurred, so you are constantly in the dark unless you compare your page tagged data to log file data, which isn’t easy at all!  How you account for missing this traffic, whether it is immaterial or not, is a business decision. Again, not a problem with log files.
  • Latency.  If the page tag doesn’t fire because it failed to load, the traffic is missed.  Vendors provide recommendations about the best place for a page tag.  Your development team may not believe them, or the “global include” that may or may not exist may or may not insert the code in the suggested spot.  It’s a good idea to listen to vendors when they provide configuration recommendations.  Once again, how you account for this data discrepancy is a business decision. Not a problem with log files.
  • Differences in sessionization.  Data divergence gets hairy when you are running two tools on the same site, or are replacing one tool with another.  That’s seems to be very common these days with Google Analytics.  Eric Enge over at Stone Temple Consulting in cooperation with my pal Jim Sterne has some data from the 2007 Web Analytics Shootout that you should read to help you understand how different vendor’s sessionize.
  • Sampling, sampling, sampling.  My friend Avinash Kaushikdoes a good job covering issues around sampling in web analytics.  I recommend reading his post!  I’ll add that statistical methods applied to web analytics data are completely valid; however, sampling at the site, page, or database level opens the possibility that you miss key data.  Sites looking for the “long tail” of visitors and using data sampling or other data trimming methods may not find it. 
  • No referrer passed.  Referrer analysis tells you what site people came from before they visited your site.  Bookmarks, typed pages, email campaigns, and bots don’t pass referrers.  Sometimes referrer information just isn’t passed by the browser.  If you overuse redirects on your sites, you may lose referrer data.  Not having this important information impacts SEO/SEM, linking campaigns, and affiliate partnering.  
  • Cookie configuration and deletion.  We all know about cookie deletion thanks to Eric’s research while at Jupiter.  In addition, if your server isn’t set-up right, you may not be setting the cookie on the first request. Talk to the web server guys and gals at your company to make sure your cookie handling is optimized for web analytics.
  • Proxies.  Proxies can filter out your referrers and make it look like everyone’s coming from the same ip address.  Most tools enable cookie-based visitor identification to work around the proxy issue, but if cookies aren’t set up right on your web server or configured correctly in your web analytics tool, assessing uniqueness when the bulk of traffic comes from the same IP will be problematic.     
  • Time spent metrics have severe limitations.  Single-page visits and the time spent on the last page in the visit aren’t measured in “total time online.”  Time-based metrics are schoolboy metrics, so don’t overemphasize them as stand-alone indicators of “engagement.”  Assess the impact of time spent in the context of goals.  Use time as a variable for segmentation and as input into a larger engagement metric. Use them in context, especially if you are selling advertising (and who isn’t)?  My friend Jim Novo has one of the best takes on time spent metrics I’ve read.
  • Failure to maintain exclusions and filters.  Data can be polluted if you are introducing new filetypes  and not excluding them from your top content reporting or if you are not maintaining your bot filters.  New bots are crawling your site right now.  What are those web developers doing now?   Do you know?  Are you checking?  What’s the process for doing so?
  • Little to no historic data.  If you’ve just tagged your pages, you may have no basis for historical comparisons for quite some time (like a year!).  Data has realized value from understanding current behavior, and potential value from predicting future behavior.  And you need data to do it.    
  • You’re not storing all your dynamic url’s.  Your URL’s may have many parameters in the query string (i.e. the name/value pairs after the “?”).  Some tools cut out this data and don’t make it available for querying or reporting in their applications.  If your site is database-driven and dynamic, not having access to the every URL request will limit your ability to do ad-hoc analysis, filtering, and segmentation.

Let’s continue this long list in Part 2!

dataquality_renamed1.jpg

Let’s continue this long list in Part 2!

Web Analytics and Data Collection: Log File Analysis

Web Analytics and log file analysis is misunderstood!  With so many options for data collection in web analytics, the log file has been somewhat disdained for many years.  In fact, I’d venture to guess that the majority of web analytics practitioners working today don’t “do log file analysis.” That’s unfortunate.  One can learn a lot about web analytics from examining log files - for example, “total time online” (with all it’s inherent issues :).

Why is server-side data collection (log file analysis) less popular than client-side page tagging?  It’s not about bots, accuracy, or ease of deployment… that’s marketing spin, somewhat true and false depending on context.  It’s may be because most web analytics practitioners currently use a hosted model. 

Page tagging enables ASP models  for web analytics.  It would be very difficult (i.e. resource intensive and costly) for a web analytics ASP to maintain a competitive service level if the ASP relied on its clients and their IT teams to maintain server side data collection- maintaining, transporting, and synching log files.   

Imagine the ASP model with log files!  It’s nightmarish and makes little sense:  hundreds and thousands of customers with just as many (or more) different servers behind firewalls generating many different log file formats.  Then, at various intervals all customers sending uniquely formatted logs for the ASP to parse, index, and make available for querying and retrieval in close to real time.  Yikes!! 

Page tagging creates a standard data structure for the ASP and a manageable data transport mechanism while minimizing complexity and reducing infrastructure cost in a way that log files just can’t.  Page tags enable ASP’s to meet rigorous service level agreements and contractual obligations for data processing in way they probably couldn’t with log files.  

So who does log file analysis for web analytics?  The answer is companies that have:

  • No desire to give up control of a web analytics infrastructure to young companies.  Web data is valuable and log files enable you to keep it all for however long and at whatever cost your business justifies.  You can processing your data using whatever tool you want, and reprocess at your discretion.
  • Existing business dimensions and structured data already existing in log files. Simple web analytics concepts like page views, visits, and “uniques” become more meaningful when sliced by relevant business objects from your companies existing log data.  For example, if you’ve passed a unique parameter in your query strings for years, you can program your log file tool to recognize the parameter.   Then reprocess your existing data to make use of it in your reporting. 
  • Deep data mining experience.  Data miners want data to mine, not static GUI’s or “my data” delivered in summary extracts at an incremental cost.  Log files can be parsed and indexed into the database that you, the company or the client, actually owns at the budget you actually set.
  • Know how to employ hybrid data collection.  Log files are complemented by page tags for instrumenting pages to track events and Web 2.0.
  • Complex integration requirements.  All the data exists on machines in your network.  Log files can be extended to incorporate new business dimensions.
  • A need to understand how spiders and bots behave on their site and the performance of their server(s).  Log files capture everything that happens on the web site.  The “kitchen_sink.html” is in there.  It’s up to you to customize the tool to meet the quality level you demand for filtered data.  As Mat and Steve mentioned in the comments, logs also enable you to view server errors, useful for site QA and debugging.

I’m sure there are other reasons too….  I’ve heard people tell me that:

  • It’s hard to move log files.  That depends on how many log files you have, what technology you are using to move them, and how often.
  • Log files are difficult to parse. Not really, especially if they are standardized.  You just need a flexible tool.  If things get non-standard, you need a tool that enables you to define your own formatting.
  • Log files take up a lot of space.  Yes, they do. How much space do you have?  What’s your companies existing standards for parsing log files?  How long will you store them?  You may find out you have plenty of “space.”  You will never have enough time though. ;)
  • Log files aren’t as accurate as page tags.  Huh? That all depends on how well you filter and exclude pages.  Accurate compared to what? To unvalidated data existing on servers you don’t see in some data center thousands of kilometers/miles away?  Logs capture visits from browsers that don’t execute javascript and thus will not appear in the metrics from tag-based systems…

So, you might be thinking I am a big fan of log files.  I am a web analytics realist.  Challenges with log files include:

  • Standardizing on a common format.  If you have a web server, you are at risk for someone doing something that creates a custom log file format.  If you are the new web analyst coming into an existing company, you may find custom fields in the logs.  Develop and champion a company standard.  It will make your life as a web analyst somewhat easier.
  • Creating a single “core” configuration that can be extended to deal with one-off formats.  If you live in world of custom log files, you can abstract a set of core fields, describe them, then extend the description when you deal with custom craziness.
  • Synchronizing/transporting log files.  The more servers you have, the more chances you have for failures in synchronization.  Application monitoring and recovery procedures are critical to preventing downtime.  Develop them.
  • Reprocessing log files.  Depending on your system reprocessing existing log file data can have myriad of effects - from data not being available, to data being partially available, to errors, to long waiting periods before the data is current and ready to query.
  • Receiving log files that are missing key fields or with key fields that have changed.   If you depend on systems, companies, and people that you do not control, fields can be stripped and formats changed in the blink of eye.  The web team needs to detect and manage these events.
  • Monitoring the operational workflow.  Someone on the web analytics team needs to manage the system and maintain the overall availability and integrity of log file processing and any issues that could result in the data not being synched or processed correctly. 
  • Managing change.  Since all your “hits” are captured in the log file, you’ll need to tell your web analytics tool how to correctly process the logs and remove data you don’t want to report.  You’ll need to identify and maintaining lists of filetype exclusions and the user agents/ip’s of robotic traffic.
  • Finding staff that “gets it.” The rule is that success in web analytics results from having competent people who have experience understanding the risks and rewards of web analytics data collection and executing on a web analytics strategy that generates value. 

As a web analyst you need to use the tools at your disposal to build your web analytics practice.  Log files are part of your toolkit.  Be weary of people who try to sway you one way or the other when it comes to web analytics data collection.  The method you ultimately use should be based on the data you want to collect in the context of the company’s goals, budget, and expertise

log_files.gif

Part 4: Spider and Bot Capability Maturity Framework for Web Analytics 2.0

Let’s conclude this series on the spiders and bots of web analytics 2.0 with a framework for ranking the spider and and bot capability maturity of  a web analytics practice! 

Before I do so, let me mention a few things:

  • How/if you detect spiders and bots depends on your data collection methods.  It is thought that page tags exclude spiders and bots because they don’t execute javascript or request images. Until they do. Log file data must be examined and filtered (either programmatically, manually, or both), and those filters need to be maintained by a web analytics professional.
  • Bots are used for a many purposes.  From inflating metrics, to indexing text for search, to chatting, to spamming, to data mining, to site performance monitoring, to click fraud, and more.
  • Bots are constantly evolving. Just like web analysis. :)
  • Not all bots are bad news.  You want bots crawling your site.  Some bots are good and helpful for your online business. Imagine if you blocked Googlebot.  Segment the robotic traffic into separate reporting (in the best case) and make sure its filtered from your externally reported numbers.   

Now without further adieu, here’s a web analytics 2.0 framework for ranking your spider and and bot capability maturity:   

  • Red.  You are in the red if the following applies:
    • I don’t understand the impact of spider and bot traffic nor do I measure, filter, or segment it at all. 
    • If I run an in-house solution, I haven’t updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification.
    • I don’t know when or how the Googlebot hits my site, but it sure does! 
    • I’ve never thought of reporting spider traffic for SEO.
    • I think the words “yahoo” and “slurp” refer to delicious soup.
    • Spiders scare me.
    • All bots are bad (no, they aren’t!)
  • Yellow. You are moving out of the red zone, into the yellow if:
    • I occasionally look for suspicious traffic. 
    • If I run an in-house solution, I have occasionally updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification, but I ask for verifications of compliance with industry standards.
    • I’ve heard of the Googlebot and maybe I use or I am thinking about using Site Maps (do it!). 
    • I know about Yahoo! Slurp.
    • It would be cool and useful to report on spider and bot traffic and learn which bots are good and bad. 
  • Blue.You are in the blue if at least every month or on an ad hoc basis:
    • I work with my vendor or in-house team to recognize and remove spider and bot traffic, conforming to industry standards
    • I do a monthly update of my filter and exclusions list, or I know my vendor does.
    • I capture bot traffic in my server log files, but I may or may not report on it.
    • I know all about the Googlebot, and I use Site Maps. 
    • I realize the Googlebot and Yahoo! Slurp behave differently on my site, but I don’t know exactly how.
    • I know which bots are good (like search bots) and bad (like content scrapers).
  • Green.  You are in the green if:
    • I have established a process for regularly removing spider and bot traffic and for keeping my lists of exclusions and filters up-to-date, or I am *absolutely certain* my vendor has such processes.
    • I am in compliance with industry standards.
    • I measure spider and bot traffic and segment it into distinct reporting separate from my human traffic reporting.
    • My SEOers love me, the web analyst, and I am involved in educating people in my company about bot traffic from IT to managers to consumers of reporting and analysis.
    • I know that there’s more than one Googlebot, and I know how its crawl differs from Yahoo! Slurp (hint: Slurp visits more).
    • I know detailed metrics like the “total time online” of all my bots. ;)

What I’ve covered in this four-part series only covers a little bit of what I know and what there is to know about spiders, bots, and crawlers and how they affect web analytics.   I’d enjoy hearing other experiences or opinions, so please share your comments if you feel like it.  Until next time, fine readers.  Thanks for visiting!

sengine1.gif

Part 3: Web Analytics Quality through Spider and Bot Recognition and Filtering

Welcome to Part 3: Web Analytics Quality through Spider and Bot Reconciliation. My goal with this series of posts was to open discussion about what Alex Langshur of PublicInsite called, in reference to these posts: the “quality of web analytics.”  I’m talking about doing the most and the best we can with current technology to ensure quality and integrity in web analytics data collection and reporting - like I know we all do now, right! :)

It’s about doing the best we can at our jobs because we’re lucky to have them - at least that’s how I feel about my “real” job in web analytics.

To begin detecting bots and hunting spiders, here’s what I recommend:

  • Meet industry standards.  The Interactive Advertising Bureau publishes a list of spiders, bots and user agents that member companies have confirmed are robotic.  You can add these patterns to your list of IP’s and agents to exclude.  A question to be asked is whether your vendor subscribes to this list, and thus makes your analytics implementation compliant with the IAB.
  • Use public resourcesPeter Adam’s commented of an excellent resource in Part 1: Gary Keith’s Browscap project is a must have component of any web analytics / spider detection effort. A simple Google search reveals a list of resources to leverage.
  • Use hybrid data collection.  You can exploit the non-javascript executing bot busting power of page tags, and process log files to detect how spiders and bots crawl your site.  That’s hybrid.  You’ll need a tool that enables you to carve out spider and bot traffic from human traffic (certain tools do this).  If you run the logs of a site you’ve page tagged, you can compare the traffic from both sources and look for identical traffic that has unusual patterns too.
  • Identify and then filter unusual traffic patterns. Query your database and/or look at your reporting to identify and filter IP addresses and user agents or combinations thereof that view a total number of pages that you find suspicious.   Look for continuous visits repeating continuously!  Check out paths that have over N page views, where N is an arbitrary number larger than the “average pages viewed per visitor” on your site.  What’s requesting your robots.txt?  What are the attributes of the traffic look like coming from Africa, the Russian Federation, India, and China? 
  • Talk to your engineers.  They are really smart and know lots of things about the web.  They are your allies in web analysis.  As web analytics practitioner, programmer, and all around smart dude, Steve McInerney mentioned: “…is the biggest giveaway I’ve found. Bots don’t request images/CSS etc as much as people do. Some tricky HTTP headers to force browsers to reload, say certain images or CSS, can go a long way towards helping identify bots - who won’t 304 the image or css.”  Your CMO may not understand what Steve is saying, but your CTO will.  And Steve may only understand when I say: Aussie! Aussie! Aussie!
  • Keep your filter lists up to date. Obviously, doh! You’ll need to maintain your lists of exclusions.  Someone needs to be dedicated to doing this process-based activity.
  • Identify the “spiders and bots” metrics on your site and compare to suspicious traffic.  I know the average number of pages viewed by a spider, the average visit duration per bot, and how other key metrics relate to spiders.  I also know this for humans.  When I am suspicious about traffic, I can compare attributes of visit or visitor to my bot and human profiles.
  • Examine user agents.  Look for irregular, unusual, and just plain weird user agents.
  • Set up a tarpit.  A tarpit is a server to which you redirect bot traffic and serve those bots random binary objects, very slowly, until they stop crawling.  When the bot maker checks on the results, they find they have nothing and need to recrawl.  The idea here is slowing bot evolution.  If you block the ip’s then the bot makers will know you’ve found them and adjust their tactics accordingly to make them less detectable.
  • Remember that not all bots are bad, so don’t prevent the good bots from crawling your site!  The Googlebot and Yahoo!Slurp index your site’s content so that it is retrievable when people query their search engine.  Don’t exclude bots using robots.txt or a tarpit unless you know that bot is nefarious or should be excluded.  In practice, I rarely block bots from visiting a site.  Rather I segment robotic traffic out of the numbers I report about site behavior and performance.
  • Manage expectations.  The truth is you can’t filter what you haven’t seen before.  You can’t predict the future of new bot behavior.  No matter what bots will get through your data collection methods.  As an analyst you need to actively manage expectations by educating your user community, clients, and colleagues.  Soon enough you’ll have associates who are savvy human filters letting you know when bots slip through.
  • Create a business process for spider and bot reconciliation.  The savvy web analytics team has sustainable, manageable, and repeatable processes for scheduled detection, filtering, removal, quality assurance, and exclusion list maintenence. 

Happy Hunting!  Click here for Part 4, my “Spider and Bot Capability Maturity Framework.” Thanks for visiting!

animated-search-engine_spider.gif

Image from link2city.com.

Part 2: The Crawlers, Spiders, and Bots of Web Analytics 2.0

In Part 1, I blogged about a huge challenge in web analytics: bot, spider, crawler detection, analysis, and filteration.  According to robotstxt.org ”robots” are:

“…programs that automatically traverse the web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced… the specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time… it is still a robot.”

In some regards it is “taboo” to speak about evil bots, but I think it’s a topic that needs to be brought out of the dark.  If we have to pay for page views, I require and expect accurate page view metrics. 

By understanding the characteristics of bots to help detection and purification of your web metrics, you will deliver more accurate analysis. Let’s take a deeper look at other bot behavior of Web Analytics 2.0:

  • Javascript executing bots.  Yes, they exist.  Big time.  They will invoke your beacon code, and generate large numbers of false page views, which your vendor should be filtering out.   Maybe your vendor filters traffic where large numbers of page views are generated in a single visit, or maybe you are paying for them.  Check your pathing reports or your ratios, like view:visit. Do see visits that have hundreds or thousands of page views?  Chances are those are bots (or your configuration may be technically challenged).

A large search engine company’s client service folks even hinted at javascript executing bots being programmed to execute queries on internal search engines, thus invoking an embedded advertising module, inflating the number of requests for the ad module.  While that doesn’t affect clickthrough or revenue and has nothing at all to do with the ad technology or the engine, it affects metrics. It confirms what I see.

Another favorite javascript executing bot I know likes to exploit HTML commenting left on in blogging architectures.  The bot spams comments with nefarious link farms or other unsavory content.  It executes javascript to do it.

Take for example, the ”rofovnifo” bot.  It fills out blog comments with the following content, leaving links to the nefarious goldwarez.org (don’t go there!) coming from Comcast 67.160.110.199:

Hi. Looks good! Very useful, good stuff. Good resources here. Thanks much!  G’night

A quick Google search reveals “rofovnifo really likes Cabela’s, sailing, Friendster, XBOX 360, car audio, card sharing, and world affairs–among many other things.  If only “rofovnifo” were a legitimate human, then I’d believe the traffic he generated and continues to generate across the 94,500 links he’s visited across the world  over the last several days! (UPDATE: 127,400 links… wait 128,000)

  • Enter your site from various referrers using various methods. One bot that I find fascinating executes a query on a particular country’s Google site (let’s day Google.au), then the bot crawls site pages returned in the SERP (search results page).  Next it indexes the content on the returned page. In the case I observed, this bot crawls a directory of subject matter related to a certain discipline.  The directory has thousands of pages.  So what I see is the bot coming from Google.au from a certain keyword, entering the site, and crawling 4,000+ pages during one visit that last a few hours.  Not human.
  • Come from different IP addresses or subnets. I’ll find bots that jump around IP addresses, but always hit the same page or pages for the same amount of time.  Or the bots will come from different subnets and hit the same page at a given periodicity (like every 15 minutes).
  • Repeatedly hit one page, repeatedly!  In a former job, I noticed that one educational institution referred a bot to the site on only the first of the month. The bot would crawl 77,000 pages then stop.  Are these clickfraud bots?  What are they?  Were they testing a prototype?
  • Spoof their user agent.  Nasty bots don’t want to say “hey, I am nasty bot” so they try to hide themselves by looking like normal user agents.

Mozilla compatible user agents look something like this:

xx.xxx.xx.xxx:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.307; .NET CLR 1.1.4321; Media Center PC 4.0; .NET) 

Bots will spoof user agents to get by your filters.  Sometime they are sloppy in doing so, which makes them instantly detectable.  Notice the upper-case WINDOWS and the shortened NET reference:

xx.xxx.xx.xxx:Mozilla/4.0 (compatible; MSIE; WINDOWS 5.1; SV1; NET 1.0; Media Center PC 4.0; .NET)

  • Use cookies.  Wierd user agents that use cookies set off a huge red flag.   You need a web analytics tool that allows you to cross the cookie dimension with user agents.
  • Pass cookies.  Even worse than bots that simply use cookies is when you see one cookie passed across multiple IP addresses visiting your site consecutively during a short-time period, then disappearing after crawling huge numbers of pages in that short duration… very odd indeed. 

What do you see on your sites?  How are you dealing with spiders and bots?

Check out Part 3 where I will provide recommendations of keeping your analytics up-to-date with Web 2.0 bots and spiders.  Thanks for visiting! 

web-analytics-robots-spiders-crawlers.jpg