Archive for 'Page Tag'

AVG LinkScanner Bot Executes JavaScript?!?

The  well-researched answer is “no.”  The AVG LinkScanner Bot appears to prefetch the js and the gif (and pretty much everything else on the page), which for certain tools and their tag configurations generates false page views and visits (and the derivatives thereof), just like it’s “legitimate” traffic. 

If your tag configuration is set up with noscript tags, AVG will fetch the content in the tags, including the gif, which means that:

  • The bot may be infesting the data of customers of web analytics vendor who configure page tag-based data collection in this way. 
  • The bot may be inflating the data in such products/services offered by various web analytics companies.
  • Customers may be paying for server calls generated by this bot.

Vendors, of course, could easily filter the user agent to protect their customers:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;1813) 

But I haven’t heard a peep from any SaaS vendors about excluding the user agent, filtering already collected data, or refunding customers the cost of robotically generated server calls (regardless of AVG). Have you?

Think about this: many SaaS page tag vendors don’t provide detailed visitor-level data and user agent reporting.  That means that their customers have no ability to investigate this bot or detect it by filtering their reported data by the the true user agent.

I’ve been talking about JS executing bots screwing with web data for about a year nowSEOMoz and the folks at SlickSurface confirmed it quite recently (quoting me no less in their fantastic analysis).  So they do exist…

Now let me tell you a little story.  Once upon a time I was at a conference called eMetrics when the CEO of a company came up to me and said “hey I read your blog about bot detection, and I looked in my web metrics tool for traffic with high page view to visit ratios.”  Then he narrated a story to me about how he found a bunch of traffic that had page view to visit ratios of 5,000 to 1.”  I said “do you use page tags” He said “that’s all my vendor provides, so yeah.”  And I said “you’ve found a javascript executing bot in your data.”  “I know” he said. “Well did you call your vendor and let them know?”  I said.  Now for the punch line:  he told me that the vendor (who shall remain nameless) told him “well, the traffic executed server calls”  And they wouldn’t give him a refund!

It’s worth mentioning that this bot definitely affects log file tools and packet sniffer tools.  Both must be configured to filter the AVG LinkScanner user agent.

Now here’s the rub for me.  I use AVG!!!  But I now find it increasingly difficult to support the company or continue using their products.  Why?  Because they are wearing a “bad hat” here:

  • First, they are fully aware of the affect of this bot on web analytics systems. They just don’t seem to care (yet).  UPDATE:  They have set up a Google Group to discuss this issue.  They must understand how companies of all types in all sectors use web analytics data to optimize their sites, set their marketing budgets, determine expected server load, and much more.  What do their Internet Marketers think? 
  • Second, the Link Scanner tool may have a short shelf life and may offer limited protection.  Malware creators will easily adjust. Check out what my friend Steve McInerney, a very smart security expert, said on the Web Analytics Association’s Yahoo Forum:
What strikes me about this particular solution by AVG is how
incredibly ... stupid it is on several fronts.
1. Noticeably impacting a users bandwidth is, technically, a security
breach in the first place, aka Denial of Service Attack.
2. Some of us live in countries that have rather severe bandwidth
charges/limits and the like, whom shall I send my excess bandwidth
bill to?
...(this) method is fundamentally
flawed. ie malware ignores any first request and only infects on a
second request - alternate cloaking. Whatever. This type of "solution"
only provides weak protection for a strictly limited period of time.
...not just "no security" but bad
security. Because folk feel they are being protected when they are
not, and hence will take greater risks and hence inflict greater harm
on themselves. :-( 
Ignoring the balance of positive to harm that this problem inflicts on
the users who use this product.
  • Third, AVG just doesn’t seem to “get it” yet.  They are potentially messing with the ability to drive commerce via data driven decision making, e-commerce analytics, site optimization, and online media measurement!  To quote The Register “chief of research Roger Thompson - who designed the AVG LinkScanner - indicated he may do away with that unique user agent. His chief concern is security, and he doesn’t want webmasters or malware writers gaming his scanner. “In order to detect the really tricky - and by association, the most important - malicious content, we need to look just like a browser driven by a human being,” he argues.

WebMasterWorld has some good stuff about to say here.  Read the Register’s first article here.  And check out the dude’s blog who broke the news first and responses from AVG here and here.

Interesting stuff. So what do you all think? Have you seen evidence of this bot in user agent data from your page tag solutions that use the noscript tag for the image? 

Tracking Rich Internet Applications with Google Analytics

About a year ago, I wrote a guest blog post over on Robbin Steif’s blog about using Google Analytics for tracking Javascript and Flash events.  This weekend Jeremy Geelan, SVP over at Sys-Con Media, asked if he could republish the work.  Of course I said “yes.”  Then I noticed that a lot has happened to GA in a year (and more to come, ahem, API’s!).  What I had wrote was now incomplete, so what you’ll find below is my attempt to sum up “event tracking” using ga.js and the Great Google’s Event Tracking Data Model.  Let me know how I did covering it, and if you think I should clarify of expand on anything.

Since we all know about page tags, let’s get down to business with “the Google” and how it tracks “the Rich Media.”  Google Analytics currently has two different javascript page tags:

  • urchin.js.  The legacy version of the Google Analytics page tag.
  • ga.js.  The current, rebranded version of the Google Analytics page tag.

How you track rich media depends on which page tag you are using.  I’ll discuss using urchin.js first, then ga.js.  I’ll also provide some information about Google’s Event Tracking function for capturing specific “events” within their event architecture.

Tracking Rich Media using Urchin.js

In the legacy version of Google Analytics, the smarties at Google created a little JavaScript function called urchinTracker() that enables event tracking.  Use the JavaScript function with an argument specifying a name for the event. For example, the function:

javascript:urchinTracker(’/mysite/flashrichmedia/playbutton’); 

logs each occurrence of that Flash event as a page view of:

/mysite/flashrichmedia/playbutton

Some caveats:

  1. Always use a forward slash to begin the argument.
  2. Actual pages with these filenames do not need to exist.
  3. You can organize your events into any structure or hierarchy you want.

Important: Google says to place your tracking code “between the opening tag and the JavaScript call” if your pages include a call to urchinTracker(), utmLinker(), utmSetTrans(), or utmLinkPost(). For example, if the page view is the major event and the “play” event a minor event; then, your hierarchy would be Page View > Event, where the page contains an event, such that:

/mysite/ria_bittons/playbutton
/mysite/ria_bittons/pausebutton
/mysite/ria_bittons/playbutton
/mysite/ria_clips/clip

Some examples of the code (from Google Help):

on (release) {
// Track with no action
getURL(”javascript:urchinTracker(’/folder/file’);”);
}

This one above tracks when you click and release (although technically, it just notices the release) of a flash button (and records the file you specify as a page view).

on (release) {
//Track with action
getURL(”javascript:urchinTracker(’/folder/file’);”);
_root.gotoAndPlay(3);
myVar = “Flash Track Test”
}

The second one is the same, but by using a function, passing it a parameter, and identifying the instance you want to track, you can measure when your file was used in a specific scene in a little flash movie. So it is a more specific method for handling event tracking in Flash.

onClipEvent (enterFrame) {
getURL(”javascript:urchinTracker(’/folder/file’);”);
}

And the third one repeats the action throughout the movie so that each time the file is loaded, it gets tracked as an event. If you were to pass a unique file at the end of the movie, you could recognize it using this method (or the other methods) to know that the whole movie was watched (as long as your session doesn’t time out). Next, wait until Google updates your analytics, then check the Top Content report to see if it all worked. Now let’s discuss how to the exact same thing using the new trackPageview function released with ga.js.

Tracking Rich Media using ga.js

In the current version of Google Analytics, the brainiacs at Google created a little JavaScript function called trackPageview() that enables event tracking.  Use the JavaScript function with an argument specifying a name for the event.For example, the function:  

javascript:pageTracker._trackPageview (“/mysite/flashrichmedia/playbutton”);

logs each occurrence of that Flash event as a page view of:

/mysite/flashrichmedia/playbutton

Some caveats:

  1. Always use a forward slash to begin the argument and use quotes around the argument.
  2.  Actual pages with these filenames do not need to exist.
  3. You can organize your events into any structure or hierarchy

You must put calls to _get._getTracker and _initData above the call to _trackPageView.  For example, you would insert the following code:

<script type=”text/javascript”>
var pageTracker = _gat._getTracker(”UA-xxxxxx-x”);
pageTracker._initData();
pageTracker._trackPageview();
</script>

Here are some examples of the ga.js code (from Google Help) that replicate what I described above using the most recent code:

on (release) {
// Track with no action
getURL(”javascript:pageTracker._trackPageview(’/folder/file.html’);”);
}

This one above tracks when you click and release (although technically, it just notices the release) of a flash button (and records the file you specify as a page view).

on (release) {
//Track with action
getURL(”javascript:pageTracker._trackPageview(’/folder/file.html’);”);
_root.gotoAndPlay(3);
myVar = “Flash Track Test”;
}

The second one is the same, but by using a function, passing it a parameter, and identifying the instance you want to track, you can measure when your file was used in a specific scene in a little flash movie. So it is a more specific method for handling event tracking in Flash.

onClipEvent (enterFrame) {
getURL(”javascript:pageTracker._trackPageview(’/folder/file.html’);”);
}

And the third one repeats the action throughout the movie so that each time the file is loaded, it gets tracked as an event. If you were to pass a unique file at the end of the movie, you could recognize it using this method (or the other methods) to know that the whole movie was watched (as long as your session doesn’t time out).

Tracking Rich Media using Google Analytics Event Tracking

When Google released ga.js in fourth quarter 2007, Google also released a data model for tracking events.  It provides more flexibility and ease of customization than the methods I described above.   The data model makes use of:

  • Objects. These are named instances of the eventTracker class and appear within the reporting interface.

var videoTracker = pageTracker._createEventTracker(”Movies”);

  • Actions. A string you pass to an event tracker class instance as a parameter.

videoTracker._trackEvent(”Stop”);

  • Labels. An optional parameter you can supply for a named object.

downloadTracker._trackEvent(”Movies”, “/mymovies/movie1.mpg”);

  • Values. A numerical value assigned to a tracked object.

To set up event tracking you should:

1. Identify the events you want to track.
2. Create an event tracker instance for each set of events.
3. Call the _trackEvent() method on your page.
4. Enable “event tracking” in your profile.

To instantiate an event tracker object, you might do something like this:

var myEventObject = pageTracker._createEventTracker(”Object Name”);
myEventObject._trackEvent(”Required Action Name”, “Optional Label”, optionalValue);

createEventTracker() is order dependent and must be called after the main tracking code (ga.js) has been loaded.Next you would call the _trackEvent() method in your source code either on every page that contains the event or as part of the tracking code for every page:

_trackEvent(action, optional_label, optional_value)

If you wanted to track interaction with the Flash UI, such as the button on a Flash Video Player, you would create a videoTracker object with name “Video”:

var videoTracker = pageTracker._createEventTracker(’Video’);

Then, in your Flash code for the video player, you would call the videoTracker object and pass a value for the action and label for the event:

onRelease (button) { 
   ExternalInterface (”javascript:videoTracker._trackEvent(’Play’, ‘MyVideo’);”)
}

You could also use the ExternalInterface ActionScript function as an eval() function to parse FlashVars and attach them to every Flash UI element that needs a tracking action.  For example, the code below associates a Stop action for the Video object and retrieves the provided label and value from the FlashVars:

onRelease (button) { 
   ExternalInterface (”javascript:videoTracker._trackEvent(’Stop’” + label + “,” + value + “);”)
}

Adding event tracking code would generate event reports in the Content section of the Google Analytics Interface.  Pretty cool stuff, Google!

google-analytics-event-tracking.png

Video Analytics? Thoughts on Web Analytics for Internet Video…

Measuring video content with web analytics isn’t super difficult, but it has its nuances and challenges.  I’ve been thinking a bit about it lately, and have had some good conversations with a few people.  Folks I know are playing around with the likes of Joost, Vuze, and Hulu, TVUNetworks, as well as using BrightCove and Videoegg.  And, man, the popularity of BitTorrent and other swarm structure 4th gen P2P networks is larger than ever.

Simply speaking video measurement can be divided into the following types:

  • Instream measurement.  Refers to measuring the video itself and the various abstract elements of the video experience, such as duration metrics (average viewing time) and interaction metrics (number of stops, plays, pauses, rewinds, fast forwards, and clicks on video content).
  • Outstream measurement.  Refers to measuring the content environment and user experience surrounding the video, such as the conversion metrics (percentage of visits downloading or viewing a video), behavioral metrics (referrers to the video page, players used), and content metrics (percentage videos per channel, percentage videos viewed by topic, percent videos viewed by file type). 

By categorizing the web video analytics into these two buckets, you are better able to answer meaningfully the following questions, which must be considered prior to any rollout:

  1. What are the business objectives for rolling out video features on the site?
  2. What format are the videos in?
  3. Are the videos downloads or streams?
  4. Am I using a content distribution network or streaming video network?
  5. Does my web analytics tool have the features necessary for video measurement? Or should I look for a third party, niche vendor?
  6. What data collection method should I use?
  7. Do I understand event models?
  8. What KPI’s are relevant and important based on my business goals?

To help you formulate answers to those questions, here’s some thinking:

  • Business objectives.  You, the analyst, must understand why your company is rolling out video.  In other words, what’s the goal and what strategy underpins the goal?  While video is “the rage” right now, simply rolling out video because “everyone is doing it” is no strategy (though doing so may yield a strategy ;).  A goal for video deployment could be “to generate leads,” thus you measure the scenario conversion rate for the funnel resulting in the lead generation and video download (outstream video analysis).  The objective might be “to keep visitors on the site longer,” then you would measure duration and interaction (instream video analysis).  As you all know, I firmly believe that it the business goal that allows you to contextualize what you’re measuring so that you may build KPI’s.
  • Video format. Lots of different video file types exist: mpegs, qt, mov, swf, flv, avi, wma, ra, wmf, mp4 and more.  You’ll need to identify the video types you want to track so you can configure your web analytics tool to measure them.  Removing or adding filters or changing your tag’s javascript might be necessary. 
  • Download or streams.  Videos can be downloaded (by right clicking) or spawned in a media player.  They can also exist embedded on the page or in another object for on-page streaming.  Thus, the way you instrument your pages will differ based on the way you present the video content. For example, if you are streaming videos, you may want to use javascript (or a vendor provided scripting language) to instrument your pages to track the video.  If you are just hosting downloads, you may simply want to run your logs to detect the number of times videos were downloaded.
  • Content distribution network or video network. If your video content is distributed by a CDN or a video network, you will have to apply page tags on all the pages rendered by combining your server’s content with the content served by the CDN. Some video networks provide basic reporting that you can extend with a client-side page tagging solution.  Alternatively, you can process the logs provided by a CDN. The challenge with CDN log file processing is that you will most likely not be able to merge the data with your log files for the same site, resulting in two “profiles” of analytics data related to one site: one profile with the site analytics data and one with the CDN analytics data.
  • Data collection method.  If you’ve read this far in my blogivation, you probably picked up that the data collection method you have at your disposal will constrain or enable the way you measure video.  Page tags will enable you to instrument your pages with onclick functions that pass values to the javascript and in turn to the analytics server.  Packet sniffers and log files enable you to measure downloads without modifying code.   If you need modify your web analytics tool or tag configuration to track video filetypes, you can reprocess logs to access the data.  With tags any data related to downloads or interactions with the video object prior to the config change will be lost.
  • Web analytics tool features. Many web analytics tools will allow you track a video play or download in your page view reports, but only two tools support true event models: Unica NetInsight and Google Analytics.  At Emetrics San Fran in May 2007, Ian Houston and I gave a preso on “from page views to events.”  It looks like the vendors agreed, ay? ;)
  • Third party tools.  With the convergence of internet and television, we’re not many years away from having a single-screen for viewing the internet, tv, and movies.  Many of us already connect our TV’s to our computers (Windows Media Server), use Slingbox, have had Tivo for years, use BitTorrent and perhaps even consume content from the sites I listed at the beginning of this post.  Companies like Visible MeasuresZango, VidMetrix, and Maven Networks already provide some flavor of a video measurement solution too.
  • Event models provide the conceptual and logical framework for measuring interactions that are subordinate, equal, or a replacements for the page view.  Without getting into much detail, “events” are interactions such as the play, stop, pause in a video stream, or the pan, zoom events in a online mapping experience.  In order to articulate the instream video experience, you should understand what an event model is and how it applies in Web Analytics 2.0.
  • KPI’s.Based on business goals resulting from site strategy, you can build KPI’s related to instream and outstream video measurement.  For example:

Instream:

  • Percentage high duration streams
  • Percentage medium duration streams
  • Percentage low duration streams
  • Average viewing time per stream/overall across all streams
  • Percentage visits who complete stream
  • Percentage visits that stop stream within 10 seconds
  • Percentage visits when this stream was the last video viewed
  • Percentage visits when this stream was the first video viewed

Outstream:

  • Conversion rates by video filetype, video topic, channel, taxonomy node, referrer, geography, keyword, and so on
  • Average streams per visit
  • Percent visits/views from different channels (such as email, organic search, paid search, direct, offline)
  • Average time since last stream/video downloads
  • Average time between stream/video downloads
  • Repeat visit rate for visits involving a stream/video download

The Internet has come a long way since I saw my first streaming video over 9 years ago (VIVO for those old timers out there).  The options for consuming video content over the web are growing everyday (and not at all limited to YouTube, ay?).  I firmly believe video on the Internet is still in its infancy, and video measurement technologies both inside and outside of “web analytics” are quite embryonic.  What a huge space for growth! 

As the internet-originated video becomes even more pervasive for home entertainment and for business communication, companies will need to employ analysts who know how to create frameworks measuring video content.  Do you? 

videosegmentation.png

A Note on Web Analytics and Ad Server Metrics…

In wild world of online metrics, it’s a well known fact that metrics from web analytics tools and ad servers never match. Variances can be substantial. 

What I mean is that, given no “refresh rate,” the total impressions for a single ad unit, which should be served on every page request, never matches the number of total page views on the site during the same period of time.  Sigh.

Reasons why identically-named metrics from these two tools (like page views and unique visitors) don’t add up are numerous:

  • Different data collection methods.  Ad servers use page tags.  Many web analytics tools use page tags, but it’s not uncommon in web analytics to use additional methods, such as logs or packet sniffers.  The methods have no shared standards for collection or storage of the same data (like visit-level data).  Thus you get apples to strawberries comparisons when attempting to correlate the dimensions from different systems.
  • Unique data models.  Ad servers aren’t focused on counting page views and the other dimension of web analytics (visits, time, and so on).  Rather ad servers focus on serving and counting impressions served (and loads of related derivative calculations, like CTR, CPC, and the coolness of view–thru).   Metrics are based on an ad request and an ad code.  Ads aren’t targeted to a page (though that’s possible), but rather to a “zone” or “keyword.” What that means is that “page” dimension may not even exist in your ad server’s schema.  In other words, you aren’t looking at impressions measured on a page, but rather at the number of impressions served in a different conceptual construct.  That’s one of the reasons why people say metrics and ad-serving systems “don’t measure the same thing.” 
  • Untagged pages.  Just like analytics implementations suffer from challenges related to complete code coverage of page tags, so do ad serving implementations.  Companies need to determine how to centrally manage the deployment and orchestration of page tags *of all types* and verify all the pages have tags!  Don’t just expect it to work because tagging sounds so easy!  Suspect it won’t work, and determine what you’re going to do *before* you deploy.  Too late?  Time to reengineer. 
  • Non-JS executing clients.  Ad servers use page tags.  Not everyone and not all user agents execute javascript.  Everyone needs to realize that page tagging misses traffic as efficiently as it excludes it.  Period.  What percentage of the traffic you miss, you’ll never know… running and filtering your logs may provide an indication…
  • Ad blocking software.  Firefox’s Adblock Plus software is a big problem for sites that have a big techie audience, and it affects all sites.  Check your browser reporting and realize a good majority of those Mozilla users may be blocking your ads.  Look at the attitudinal data you have about visitor’s to gauge whether that’s a big issue for your online audience. 
  • Cookie issues.  Third-party cookies get blocked (often by privacy software).  Many ad servers still serve third party cookies, and many corporations have not tricked their DNS to accommodate this issue (ahem, CNAME).  We all know how cookie deletion affects unique visitor counts.
  • Refresh rates. One page rendered in the browser and many banner “refreshes” makes it really hard to correlate page views and impressions served.
  • No rich media installed, and no fallback.  If the client doesn’t have certain plug-ins, and you have no fallback, you miss ad revenue.  Meanwhile the tag executes and you count the traffic.
  • Robots, spiders, and crawlers, oh my.  The web is so robotic.  The problem is amazingly understated, especially by companies who want to bill you on page views.  Different data collection methods allow some level of bots to dirty the data.  Logs are harder to efficiently filter.  When the ad server uses tags, and the analytics tool uses logs, you may get some wildly different numbers. 
  • Mobile, Mobile, Mobile, Mobile.  Not all Internet-connected mobile devices will display ads, but web analytics tools will track the behavior of mobile visitors.
  • Latency.  Visitors who move through the site too quickly may not execute the tag, thus no data is sent back to the server(s).  Ever wonder why vendors tell you to put the tag “high” on the page?

The influence these issues have on your site varies depending on audience.  Investigate factors causing variance and deviation between metrics systems, and educate your audience on why the numbers differ.

adserver.gif

Part 2: Your Web Analytics Data Quality May Stink and Here’s Why!

In Part 1, I began a long list of reasons why your web analytics data quality may stink.  I’m continuing the list below (make sure you read Part 1 for context and to view the entire list)

  • Storing only visit level data.  May tools don’t have schemas that store raw data at the visitor level.  Instead they provide access to only visit level data.  For example, you may not be able to see all the page views during a single visit per ip address or cookied visitor.   Assess the impact of the vendor’s schema on your goals.  Companies that use analytics data to feed other systems or that want to use visitor attributes for content targeting, segmentation, optimization, or analysis may not be well-served by some vendor schemas.
  • Little to no decodes or lookups.  If you use numeric codes and non-human readable naming conventions in your data, they can pass through to your reporting and prevent your colleagues from understanding the reporting.  Strange codes look like hieroglyphics!  Decoding and looking up data can eliminate the problem of non-readability and strange numerical names in your reporting… While some would say this is a reporting issue, not a data issue,  I chose to include it because it’s at the surface… it’s the data your customers see.  Not all tools decode or lookup.  Some tools allow rewriting of data in the database.
  • Failure of key services supporting the application.  If you are dependent on page tags, synchronization software, web servers, databases, or any of the wondrous technology that makes it all work, failures are a real bummer.  Make sure you have monitoring and recovery processes in place so you don’t miss data!  When page tag collection fails (perhaps the page tag server went down ay?), the data is gone forever.  If the web server fails, then no logs are written, but no pages are served either - so is traffic missed?  But if the processes supporting log file analysis fail (i.e. data synch), watch out! 
  • Inadequate or incorrect implementation.  If you can’t cross dimensions (like finding out what keywords referred traffic to a page), filter all of your data (for example, filtering pages to see only those viewed by the iPhone), easily create new metrics, or if the numbers aren’t adding up, you may have not adequately or correctly implemented your software or communicated your requirements to your vendor’s professional services team. 
  • Limited, hard-to-extend data model. Powerful, actionable insights from web analytics are enabled by extending a data model to incorporate business specific dimensions.  For example, if every page has a category and an author, you may want to see a list of all the page views in that category or ranking of pages by most popular author.  To do that you may need to join data at the database level or take advantage of variables you pass in a page tag.  Various tools have different limits on if, how, and to what extent you can extend the data model.

So what do you do when you know your data quality is less than stellar?  Here’s some guidance:

  • Don’t worry, be happy. :-) Just by collecting the data you are collecting, you are doing better than a great majority of companies that do business on the Internet.  By asking questions about data and investigating the issues, you have a leg up on your competition.  Work on optimizing the data, expose flaws in site design or architecture that impede data collection, work with your vendor and seek help in the web analytics community if you run into real problems.  The Web Analytics Association’s Forum on Yahoo is a useful place for posting questions.  But whatever you do, stay positive and focused on solving your problems and making your web analytics practice more optimized.  Don’t get frustrated.
  • Recognize the limitations in the data and do not go gently into the night.  Ask the hard questions about sampling, schemas, data retention, processing, querying and reporting to understand where the holes and noise could be in your data.  Demand answers from your vendors and quick response times to your questions about data quality.  If you vendor is frustrating you by not being responsive, talk to the boss and the vendor’s bosses, escalate, escalate, escalate until you get resolution.
  • Understand the underlying elements of data collection and what can go wrong.  Learn about sessionization and why different tools and data collection methods have limitations.  Explore the more technical components of the backend, like the database and your web analytics schema - all your data is in one (or more)! Talk to your engineers.  Have them explain the technology in terms you understand.
  • Evaluate your tools.  Some tools are just better suited for particular business problems than other tools.  Log files tools enable you to constantly change assumptions and reprocess data.  Page tags provide a standard data collection and transport mechanism.

With hard work on your part, you can make you web analytics data smell like roses!  I know you can! :)

dataquality_renamed.jpg

Part 1: Your Web Analytics Data Quality May Stink and Here’s Why!

Web analytic’s data quality and accuracy of ”the numbers” are always questioned.  With so many sources of data from different systems and vendors - both free and paid- you must be able to reconcile deviations in data from different sources, and speak intelligently about data quality and accuracy to promote adoption of web analytics at your organization. 

There are so so so many reasons why web analytics data quality can stink.   I thought it would be fun to list some of the major reasons (over two posts):

  • Spiders and Bots.  If you haven’t read my series on spiders and bots, check it out.  Non-human traffic can inflate your metrics and diminish the predictive power of your analysis.  Regularly look for bots and update your filtering!
  • Untagged pages.  If the bulk of your page views are being generated by a single page, and you’ve failed to tag that page, you’ve lost data.  That’s always a bummer.  Tagging must be endemic to the web development process. Ask your team how they know that every page is tagged.  Really how do they know?  Verify and reverify.  Use a tool like WASP.   Obviously this isn’t a problem with log files.  
  • JS turned off.  If the browser doesn’t execute javascript, the page tag won’t fire and traffic will be missed.  If you’ve missed it, you don’t know it occurred, so you are constantly in the dark unless you compare your page tagged data to log file data, which isn’t easy at all!  How you account for missing this traffic, whether it is immaterial or not, is a business decision. Again, not a problem with log files.
  • Latency.  If the page tag doesn’t fire because it failed to load, the traffic is missed.  Vendors provide recommendations about the best place for a page tag.  Your development team may not believe them, or the “global include” that may or may not exist may or may not insert the code in the suggested spot.  It’s a good idea to listen to vendors when they provide configuration recommendations.  Once again, how you account for this data discrepancy is a business decision. Not a problem with log files.
  • Differences in sessionization.  Data divergence gets hairy when you are running two tools on the same site, or are replacing one tool with another.  That’s seems to be very common these days with Google Analytics.  Eric Enge over at Stone Temple Consulting in cooperation with my pal Jim Sterne has some data from the 2007 Web Analytics Shootout that you should read to help you understand how different vendor’s sessionize.
  • Sampling, sampling, sampling.  My friend Avinash Kaushikdoes a good job covering issues around sampling in web analytics.  I recommend reading his post!  I’ll add that statistical methods applied to web analytics data are completely valid; however, sampling at the site, page, or database level opens the possibility that you miss key data.  Sites looking for the “long tail” of visitors and using data sampling or other data trimming methods may not find it. 
  • No referrer passed.  Referrer analysis tells you what site people came from before they visited your site.  Bookmarks, typed pages, email campaigns, and bots don’t pass referrers.  Sometimes referrer information just isn’t passed by the browser.  If you overuse redirects on your sites, you may lose referrer data.  Not having this important information impacts SEO/SEM, linking campaigns, and affiliate partnering.  
  • Cookie configuration and deletion.  We all know about cookie deletion thanks to Eric’s research while at Jupiter.  In addition, if your server isn’t set-up right, you may not be setting the cookie on the first request. Talk to the web server guys and gals at your company to make sure your cookie handling is optimized for web analytics.
  • Proxies.  Proxies can filter out your referrers and make it look like everyone’s coming from the same ip address.  Most tools enable cookie-based visitor identification to work around the proxy issue, but if cookies aren’t set up right on your web server or configured correctly in your web analytics tool, assessing uniqueness when the bulk of traffic comes from the same IP will be problematic.     
  • Time spent metrics have severe limitations.  Single-page visits and the time spent on the last page in the visit aren’t measured in “total time online.”  Time-based metrics are schoolboy metrics, so don’t overemphasize them as stand-alone indicators of “engagement.”  Assess the impact of time spent in the context of goals.  Use time as a variable for segmentation and as input into a larger engagement metric. Use them in context, especially if you are selling advertising (and who isn’t)?  My friend Jim Novo has one of the best takes on time spent metrics I’ve read.
  • Failure to maintain exclusions and filters.  Data can be polluted if you are introducing new filetypes  and not excluding them from your top content reporting or if you are not maintaining your bot filters.  New bots are crawling your site right now.  What are those web developers doing now?   Do you know?  Are you checking?  What’s the process for doing so?
  • Little to no historic data.  If you’ve just tagged your pages, you may have no basis for historical comparisons for quite some time (like a year!).  Data has realized value from understanding current behavior, and potential value from predicting future behavior.  And you need data to do it.    
  • You’re not storing all your dynamic url’s.  Your URL’s may have many parameters in the query string (i.e. the name/value pairs after the “?”).  Some tools cut out this data and don’t make it available for querying or reporting in their applications.  If your site is database-driven and dynamic, not having access to the every URL request will limit your ability to do ad-hoc analysis, filtering, and segmentation.

Let’s continue this long list in Part 2!

dataquality_renamed1.jpg

Let’s continue this long list in Part 2!

Web Analytics and Data Collection: Log File Analysis

Web Analytics and log file analysis is misunderstood!  With so many options for data collection in web analytics, the log file has been somewhat disdained for many years.  In fact, I’d venture to guess that the majority of web analytics practitioners working today don’t “do log file analysis.” That’s unfortunate.  One can learn a lot about web analytics from examining log files - for example, “total time online” (with all it’s inherent issues :).

Why is server-side data collection (log file analysis) less popular than client-side page tagging?  It’s not about bots, accuracy, or ease of deployment… that’s marketing spin, somewhat true and false depending on context.  It’s may be because most web analytics practitioners currently use a hosted model. 

Page tagging enables ASP models  for web analytics.  It would be very difficult (i.e. resource intensive and costly) for a web analytics ASP to maintain a competitive service level if the ASP relied on its clients and their IT teams to maintain server side data collection- maintaining, transporting, and synching log files.   

Imagine the ASP model with log files!  It’s nightmarish and makes little sense:  hundreds and thousands of customers with just as many (or more) different servers behind firewalls generating many different log file formats.  Then, at various intervals all customers sending uniquely formatted logs for the ASP to parse, index, and make available for querying and retrieval in close to real time.  Yikes!! 

Page tagging creates a standard data structure for the ASP and a manageable data transport mechanism while minimizing complexity and reducing infrastructure cost in a way that log files just can’t.  Page tags enable ASP’s to meet rigorous service level agreements and contractual obligations for data processing in way they probably couldn’t with log files.  

So who does log file analysis for web analytics?  The answer is companies that have:

  • No desire to give up control of a web analytics infrastructure to young companies.  Web data is valuable and log files enable you to keep it all for however long and at whatever cost your business justifies.  You can processing your data using whatever tool you want, and reprocess at your discretion.
  • Existing business dimensions and structured data already existing in log files. Simple web analytics concepts like page views, visits, and “uniques” become more meaningful when sliced by relevant business objects from your companies existing log data.  For example, if you’ve passed a unique parameter in your query strings for years, you can program your log file tool to recognize the parameter.   Then reprocess your existing data to make use of it in your reporting. 
  • Deep data mining experience.  Data miners want data to mine, not static GUI’s or “my data” delivered in summary extracts at an incremental cost.  Log files can be parsed and indexed into the database that you, the company or the client, actually owns at the budget you actually set.
  • Know how to employ hybrid data collection.  Log files are complemented by page tags for instrumenting pages to track events and Web 2.0.
  • Complex integration requirements.  All the data exists on machines in your network.  Log files can be extended to incorporate new business dimensions.
  • A need to understand how spiders and bots behave on their site and the performance of their server(s).  Log files capture everything that happens on the web site.  The “kitchen_sink.html” is in there.  It’s up to you to customize the tool to meet the quality level you demand for filtered data.  As Mat and Steve mentioned in the comments, logs also enable you to view server errors, useful for site QA and debugging.

I’m sure there are other reasons too….  I’ve heard people tell me that:

  • It’s hard to move log files.  That depends on how many log files you have, what technology you are using to move them, and how often.
  • Log files are difficult to parse. Not really, especially if they are standardized.  You just need a flexible tool.  If things get non-standard, you need a tool that enables you to define your own formatting.
  • Log files take up a lot of space.  Yes, they do. How much space do you have?  What’s your companies existing standards for parsing log files?  How long will you store them?  You may find out you have plenty of “space.”  You will never have enough time though. ;)
  • Log files aren’t as accurate as page tags.  Huh? That all depends on how well you filter and exclude pages.  Accurate compared to what? To unvalidated data existing on servers you don’t see in some data center thousands of kilometers/miles away?  Logs capture visits from browsers that don’t execute javascript and thus will not appear in the metrics from tag-based systems…

So, you might be thinking I am a big fan of log files.  I am a web analytics realist.  Challenges with log files include:

  • Standardizing on a common format.  If you have a web server, you are at risk for someone doing something that creates a custom log file format.  If you are the new web analyst coming into an existing company, you may find custom fields in the logs.  Develop and champion a company standard.  It will make your life as a web analyst somewhat easier.
  • Creating a single “core” configuration that can be extended to deal with one-off formats.  If you live in world of custom log files, you can abstract a set of core fields, describe them, then extend the description when you deal with custom craziness.
  • Synchronizing/transporting log files.  The more servers you have, the more chances you have for failures in synchronization.  Application monitoring and recovery procedures are critical to preventing downtime.  Develop them.
  • Reprocessing log files.  Depending on your system reprocessing existing log file data can have myriad of effects - from data not being available, to data being partially available, to errors, to long waiting periods before the data is current and ready to query.
  • Receiving log files that are missing key fields or with key fields that have changed.   If you depend on systems, companies, and people that you do not control, fields can be stripped and formats changed in the blink of eye.  The web team needs to detect and manage these events.
  • Monitoring the operational workflow.  Someone on the web analytics team needs to manage the system and maintain the overall availability and integrity of log file processing and any issues that could result in the data not being synched or processed correctly. 
  • Managing change.  Since all your “hits” are captured in the log file, you’ll need to tell your web analytics tool how to correctly process the logs and remove data you don’t want to report.  You’ll need to identify and maintaining lists of filetype exclusions and the user agents/ip’s of robotic traffic.
  • Finding staff that “gets it.” The rule is that success in web analytics results from having competent people who have experience understanding the risks and rewards of web analytics data collection and executing on a web analytics strategy that generates value. 

As a web analyst you need to use the tools at your disposal to build your web analytics practice.  Log files are part of your toolkit.  Be weary of people who try to sway you one way or the other when it comes to web analytics data collection.  The method you ultimately use should be based on the data you want to collect in the context of the company’s goals, budget, and expertise

log_files.gif

Web Analytics and Data Collection: The Page Tag

Many methods exist for collecting different types of web analytics data- some much more accurate and useful than others: page tags, log file analyzerspacket sniffers, audience panels, and toolbars.  At this point in time, the page tag certainly seems to be the preferred method adopted by the web analytics industry for data collection.  It’s also core to other technologies, like multivariate testing. 

Various sources will tell you a multitude of “things” over time about page tags, depending on whether they think you are a noob or advanced practitioner.  There’s lots of useful information from many sources about page tags.  Still I find a lot of the average discussion about the utility of page tags to be somewhat true and false, depending on context.  For example, if you’ve never tagged pages, but you have processed log files, does it makes sense to entirely throw away legacy processes?  Or if you’ve mastered change management across all pages on your site but you’ve never parsed log files, do you want to have to deal with synching, moving, parsing, filtering log files?   These questions yield fantastic answers for fruitful discussion as you plan or extend your web analytics implementation.

I’ve heard the following “things” about page tags:

  • It’s easy to page tag your sites.  That really depends on the technologies used to build the site.  A static site with few simple pages is a different animal than a beast of a database-driven site with millions of complicated pages glued together using different technologies. 
  • Page tags are more reliable.  Reliable in what way I ask?  Page tags in hosted environment are processed by a data center many miles away.  I have no idea of the availability of the servers to support the load they are receiving from all the customers, nor do I have any insight into the raw data collected by the page tags, errors, and so on. 
  • Robots and spiders are always removed from the metrics.  Page tags do a better job than log file analyzers out-of-the-gate and especially if you don’t maintain your filtering, but I don’t really think page tags are infallible at all.  Bots and spiders get through.
  • You don’t have to deal with IT when page tagging.  It’s just a global include right?  You just give IT the code, and they include it.  For the most part that’s true, but IT has processes and procedures and your tagging needs to be QA’ed and perhaps even put through a “versioning” process.   Regardless, you’ll need to prep the corporation and your friend’s in IT for the big page tagging effort.
  • Change management is easy with page tags.  If you only use the vendor’s out-of-the-box page tag and have a global include, it’s somewhat easy to manage change.  But in my experience, a web analytics implementation requires using the tag in different ways across the site.  You’ll need to add attributes to the tag or script values into the tag.   In order to manage change, you’ll need to follow corporate processes.  In the case of web analytics page tagging, you may need to create those processes before you manage change… and process creation in corporate environments is sometimes not easy.  You may even need to follow a process for doing so.  :)

The biggest challenges I’ve had with page tagging include:

  • Ensuring complete code coverage across all pages.  The number of pages in your site, the way the site is built, and the technology used to build your site all need to be carefully assessed prior to beginning the page tagging effort.  The larger the portfolio of sites, the more difficult it will be to tag all of your pages.  Other challenges include the number of domains and subdomains, the technology standards used across your portfolio of sites, and whether, in more controlled environments, the web analytics team actually has the positional power to influence the change management process.
  • Determining a method for centralized tag management.  If you are running an internal solution, you’ll need page tag server(s) to collect the page tag data.  If you’re depending on your vendor’s servers, you may need to modify or update tags for various reasons (new campaigns and so on).  You may find new sites that need to be tagged.  New stakeholders may want special tags added to the site to support other purposes (such as multivariate testing).   How are you going to centrally manage all that?   
  • Orchestrating changes to tags across different site sections.  You may want to pass a new value in the tag or change something in the script on page X, but not page Y, or subdomain A, but not subdomain B.  A new campaign may need to be enabled.  You may have to update your tag to take advantage of new vendor functionality.  You’ll need a technology solution and process for centrally orchestrating and controlling tag changes.  Wielding a mighty CMS helps.
  • Reconciling tag metrics with log metrics and determining correct filtering.  It’s great fun to collect tag data then compare it to your log files to determine if all pages are being counted and how effectively you are filtering bots.  While looking at different data sources is a time consuming activity, certain business cases may demand it. 
  • Integrating tagging with a Content Management System.  Major sites use expensive CMS’ to create web sites.  The web analyst should work with the CMS team to build page tags into site pages.  Integration will help you with two points raised above: centralizing tag management and change management/orchestration.
  • Challenges with decodes and lookups using tags.  I’ve learned that it is difficult and in many cases impossible to decode a value or use lookup tables with a page tag.  For example, if I had a page that had a URI “/er45rw/e42f45erfwrq3r.html,” I can’t decode it to read “Web Analytics Blog” in my reports.  Or if it is possible to do simple decodes, I have to hardcode the decode on each page.   Hard coding on an evolving web site is never manageable over the time.
  • Latency.  If the page partially loads without executing the javascript, or if the user clicks through the page before the javascript has fired, the page view won’t be counted.  You must test to make sure your tag is firing properly.
  • Javascript turned off.  If the browser doesn’t execute javascript, the tag won’t fire and the visitor won’t be measured.  What effect will that have on your numbers? 
  • Cookie issues.  Tags may set third-party cookies.  Privacy policies don’t like the third party cookie.  As Justin Cutroni points out in the comments his tool sets first party cookies by default.  That’s good.
  • DNS changes.  To prevent third party cookie issues, you can trick out the DNS with a CNAME entry.  Say hello to someone in IT called the DNS admin!

In discussing the page tag in such a manner, you may think I don’t like the page tag.  Not true at all!  The page tag is a very useful data collection method in context.  I like it very much, especially in hybrid data collection.  Like any technology, just make sure you understand the implications of your data collection method on your implementation.   

page_tags.gif
Courtesy of Zeus.com.