Web Analytics Blogs

Judah Phillips is an experienced web analytics practitioner and Internet expert currently working as a Director at a large multichannel media company. His blog is full of useful, unbiased, actionable insights learned from the real-world practice of a process-oriented, integrated approach to strategic Web Analytics for improving business performance.

Subscribe to Judah Phillips weblog

Archive for 'Log File'

AVG LinkScanner Bot Executes JavaScript?!?

The  well-researched answer is “no.”  The AVG LinkScanner Bot appears to prefetch the js and the gif (and pretty much everything else on the page), which for certain tools and their tag configurations generates false page views and visits (and the derivatives thereof), just like it’s “legitimate” traffic. 

If your tag configuration is set up with noscript tags, AVG will fetch the content in the tags, including the gif, which means that:

  • The bot may be infesting the data of customers of web analytics vendor who configure page tag-based data collection in this way. 
  • The bot may be inflating the data in such products/services offered by various web analytics companies.
  • Customers may be paying for server calls generated by this bot.

Vendors, of course, could easily filter the user agent to protect their customers:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813) 

But I haven’t heard a peep from any SaaS vendors about excluding the user agent, filtering already collected data, or refunding customers the cost of robotically generated server calls (regardless of AVG). Have you?

Think about this: many SaaS page tag vendors don’t provide detailed visitor-level data and user agent reporting.  That means that their customers have no ability to investigate this bot or detect it by filtering their reported data by the the true user agent.

I’ve been talking about JS executing bots screwing with web data for about a year nowSEOMoz and the folks at SlickSurface confirmed it quite recently (quoting me no less in their fantastic analysis).  So they do exist…

Now let me tell you a little story.  Once upon a time I was at a conference called eMetrics when the CEO of a company came up to me and said “hey I read your blog about bot detection, and I looked in my web metrics tool for traffic with high page view to visit ratios.”  Then he narrated a story to me about how he found a bunch of traffic that had page view to visit ratios of 5,000 to 1.”  I said “do you use page tags” He said “that’s all my vendor provides, so yeah.”  And I said “you’ve found a javascript executing bot in your data.”  “I know” he said. “Well did you call your vendor and let them know?”  I said.  Now for the punch line:  he told me that the vendor (who shall remain nameless) told him “well, the traffic executed server calls”  And they wouldn’t give him a refund!

It’s worth mentioning that this bot definitely affects log file tools and packet sniffer tools.  Both must be configured to filter the AVG LinkScanner user agent.

Now here’s the rub for me.  I use AVG!!!  But I now find it increasingly difficult to support the company or continue using their products.  Why?  Because they are wearing a “bad hat” here:

  • First, they are fully aware of the affect of this bot on web analytics systems. They just don’t seem to care (yet).  UPDATE:  They have set up a Google Group to discuss this issue.  They must understand how companies of all types in all sectors use web analytics data to optimize their sites, set their marketing budgets, determine expected server load, and much more.  What do their Internet Marketers think? 
  • Second, the Link Scanner tool may have a short shelf life and may offer limited protection.  Malware creators will easily adjust. Check out what my friend Steve McInerney, a very smart security expert, said on the Web Analytics Association’s Yahoo Forum:
What strikes me about this particular solution by AVG is how
incredibly … stupid it is on several fronts.
1. Noticeably impacting a users bandwidth is, technically, a security
breach in the first place, aka Denial of Service Attack.
2. Some of us live in countries that have rather severe bandwidth
charges/limits and the like, whom shall I send my excess bandwidth
bill to?
…(this) method is fundamentally
flawed. ie malware ignores any first request and only infects on a
second request - alternate cloaking. Whatever. This type of “solution”
only provides weak protection for a strictly limited period of time.
…not just “no security” but bad
security. Because folk feel they are being protected when they are
not, and hence will take greater risks and hence inflict greater harm
on themselves. :-( 
Ignoring the balance of positive to harm that this problem inflicts on
the users who use this product.
  • Third, AVG just doesn’t seem to “get it” yet.  They are potentially messing with the ability to drive commerce via data driven decision making, e-commerce analytics, site optimization, and online media measurement!  To quote The Register “chief of research Roger Thompson - who designed the AVG LinkScanner - indicated he may do away with that unique user agent. His chief concern is security, and he doesn’t want webmasters or malware writers gaming his scanner. “In order to detect the really tricky - and by association, the most important - malicious content, we need to look just like a browser driven by a human being,” he argues.

WebMasterWorld has some good stuff about to say here.  Read the Register’s first article here.  And check out the dude’s blog who broke the news first and responses from AVG here and here.

Interesting stuff. So what do you all think? Have you seen evidence of this bot in user agent data from your page tag solutions that use the noscript tag for the image? 

Web Analytics Prognostications for 2008

What’s the future hold for Web Analytics in 2008?  Here are a few predictions:

  • Google Analytics releases a real API for getting (and perhaps setting) data.  As you know, I think GA is a fine tool for web analytics, but has severe limitations when you want to control over your data or to feed data into other systems.  Thus, I predict Google Analytics will go beyond the “Tracking API” and release a real API that allows you to at least get data out of the tool (if not set data as well).  Think of what Feedburner does with their REST-based Awareness API.  Wouldn’t that be nice to have with GA?!
  • HBX Analytics goes away.  I’d be more than a bit nervous if I were an HBX customer because Omniture is going to sunset HBX and migrate everyone to SiteCatalyst, then try to aggressively sell them the rest of the suite. 
  • Long live Visual Sciences.  VS is a powerful tool quite superior in some regards and very different than anything else Omniture offers.  It’s also real in-house software, not some blackbox.  VS’ extensible schema, flexibility in reporting, scalability, and performance is quite unparalleled in the industry.  I can’t envision Omniture killing it (unless they peel it apart in order to create Discover 3), like they will HBX. 
  • WebTrends rebrands.  I’m not sure if you agree, but imho WebTrends Marketing Lab was an attempt to rebrand WebTrends.  I expect that interim management will continue attempting to differentiate WebTrends by rebranding products and perhaps the entire company.
  • New and updated standards are released.  As a member of the IAB’s Measurement Council I can tell you that the IAB is getting ready to release the IAB Audience Measurement Reach Guidelines, which attempt to clarify and take a stand on various aspects of server/client-side analytics and audience measurement.  I also envision the WAA increasing the number of terms they define.  But standards are just dandy and quite meaningless unless they are adopted… thus…
  • Standards enforcement is attempted in order to propel adoption. Existing and forthcoming standards will be enforced in 2008.  Enforcement from the WAA will probably come in the form of a publication of a matrix or documentation citing which vendors adhere to the standards and to what degree, what’s missing, what’s different, and so on.  If decision-makers who control budgets believe in standards, this type of document will cause the question ”do you adhere?” to be asked.  If vendors start losing deals because the answer is “no, not at all,” vendors will adopt the standards. 
  • Internal data integration becomes more important for companies and problematic for ASP’s.  When we talk about “integration” I often think people can be a bit shortsighted.  They want to integrate data from other third-party services and tools (like Salesforce.com and their ad server).  While there is certainly real value in integrating external data with web analytics data, significant value comes from integrating web analytics with internal data, such as data residing in internally-hosted CRM systems, finance, subscription, and lead generation databases. Most vendors have barely figured out how to deal with detail-level external data integration in 2007, even though many customers are demanding it.  I expect that in 2008, internal data integration will be more commonly demanded and even more problematic for ASP’s. 
  • BI tools provide better support for and integration with Web Analytics tools.  The current allotment of “enterprise” level web analytics tools are inferior to the capabilities provided by business intelligence tools from companies like Business Objects or Cognos.  Expect these BI vendors to create features for dealing with web analytics data in 2008.  Either that, or these web analytics tools need to grow up and learn a few things from BI. 
  • Web Analytics as performance management.  KPI-based site optimization means using data to guide the modification of user experience to deliver on goals.   Since goals are measurable and can be plotted against performance, it’s totally logical to use web analytics as a performance management tool.  Expect to see that gestalt in tool usage come into vogue and be discussed more in 2008. 
  • Web Analytics as part of business process automation.  Having the marketing department fielding page tags with campaign codes may work for some (small) companies, but when you work for an enterprise with thousands of clients and simultaneous campaigns across multiple channels, endemic tagging and subsequent tool configuration becomes challeging.  As part of the web analytics process, I expect to see tools support some level of business process automation enabling web analytics.
  • Features for measuring the Mobile Web.  Right now, with a log file based tool, I can segment out Mobile traffic based on user agent.  If I want to use a page tag, I have to consider js limitations.  The mobile web is the next frontier, and I only know of one web analytics vendor who is doing a decent job measuring it right now, so I expect to see more features released this year for measuring Mobile.  

So that’s that.  Like a band named PIL once said in the song called Rise “I could be wrong, could be right!”  Am I off-base, misguided, accurate, do you disagree, agree, then let me know… I’d love to hear your thoughts and your predictions for Web Analytics 2008…

crystalball1.jpg

A Note on Web Analytics and Ad Server Metrics…

In wild world of online metrics, it’s a well known fact that metrics from web analytics tools and ad servers never match. Variances can be substantial. 

What I mean is that, given no “refresh rate,” the total impressions for a single ad unit, which should be served on every page request, never matches the number of total page views on the site during the same period of time.  Sigh.

Reasons why identically-named metrics from these two tools (like page views and unique visitors) don’t add up are numerous:

  • Different data collection methods.  Ad servers use page tags.  Many web analytics tools use page tags, but it’s not uncommon in web analytics to use additional methods, such as logs or packet sniffers.  The methods have no shared standards for collection or storage of the same data (like visit-level data).  Thus you get apples to strawberries comparisons when attempting to correlate the dimensions from different systems.
  • Unique data models.  Ad servers aren’t focused on counting page views and the other dimension of web analytics (visits, time, and so on).  Rather ad servers focus on serving and counting impressions served (and loads of related derivative calculations, like CTR, CPC, and the coolness of view–thru).   Metrics are based on an ad request and an ad code.  Ads aren’t targeted to a page (though that’s possible), but rather to a “zone” or “keyword.” What that means is that “page” dimension may not even exist in your ad server’s schema.  In other words, you aren’t looking at impressions measured on a page, but rather at the number of impressions served in a different conceptual construct.  That’s one of the reasons why people say metrics and ad-serving systems “don’t measure the same thing.” 
  • Untagged pages.  Just like analytics implementations suffer from challenges related to complete code coverage of page tags, so do ad serving implementations.  Companies need to determine how to centrally manage the deployment and orchestration of page tags *of all types* and verify all the pages have tags!  Don’t just expect it to work because tagging sounds so easy!  Suspect it won’t work, and determine what you’re going to do *before* you deploy.  Too late?  Time to reengineer. 
  • Non-JS executing clients.  Ad servers use page tags.  Not everyone and not all user agents execute javascript.  Everyone needs to realize that page tagging misses traffic as efficiently as it excludes it.  Period.  What percentage of the traffic you miss, you’ll never know… running and filtering your logs may provide an indication…
  • Ad blocking software.  Firefox’s Adblock Plus software is a big problem for sites that have a big techie audience, and it affects all sites.  Check your browser reporting and realize a good majority of those Mozilla users may be blocking your ads.  Look at the attitudinal data you have about visitor’s to gauge whether that’s a big issue for your online audience. 
  • Cookie issues.  Third-party cookies get blocked (often by privacy software).  Many ad servers still serve third party cookies, and many corporations have not tricked their DNS to accommodate this issue (ahem, CNAME).  We all know how cookie deletion affects unique visitor counts.
  • Refresh rates. One page rendered in the browser and many banner “refreshes” makes it really hard to correlate page views and impressions served.
  • No rich media installed, and no fallback.  If the client doesn’t have certain plug-ins, and you have no fallback, you miss ad revenue.  Meanwhile the tag executes and you count the traffic.
  • Robots, spiders, and crawlers, oh my.  The web is so robotic.  The problem is amazingly understated, especially by companies who want to bill you on page views.  Different data collection methods allow some level of bots to dirty the data.  Logs are harder to efficiently filter.  When the ad server uses tags, and the analytics tool uses logs, you may get some wildly different numbers. 
  • Mobile, Mobile, Mobile, Mobile.  Not all Internet-connected mobile devices will display ads, but web analytics tools will track the behavior of mobile visitors.
  • Latency.  Visitors who move through the site too quickly may not execute the tag, thus no data is sent back to the server(s).  Ever wonder why vendors tell you to put the tag “high” on the page?

The influence these issues have on your site varies depending on audience.  Investigate factors causing variance and deviation between metrics systems, and educate your audience on why the numbers differ.

adserver.gif