Part 1: Your Web Analytics Data Quality May Stink and Here’s Why!
Web analytic’s data quality and accuracy of ”the numbers” are always questioned. With so many sources of data from different systems and vendors - both free and paid- you must be able to reconcile deviations in data from different sources, and speak intelligently about data quality and accuracy to promote adoption of web analytics at your organization.
There are so so so many reasons why web analytics data quality can stink. I thought it would be fun to list some of the major reasons (over two posts):
- Spiders and Bots. If you haven’t read my series on spiders and bots, check it out. Non-human traffic can inflate your metrics and diminish the predictive power of your analysis. Regularly look for bots and update your filtering!
- Untagged pages. If the bulk of your page views are being generated by a single page, and you’ve failed to tag that page, you’ve lost data. That’s always a bummer. Tagging must be endemic to the web development process. Ask your team how they know that every page is tagged. Really how do they know? Verify and reverify. Use a tool like WASP. Obviously this isn’t a problem with log files.
- JS turned off. If the browser doesn’t execute javascript, the page tag won’t fire and traffic will be missed. If you’ve missed it, you don’t know it occurred, so you are constantly in the dark unless you compare your page tagged data to log file data, which isn’t easy at all! How you account for missing this traffic, whether it is immaterial or not, is a business decision. Again, not a problem with log files.
- Latency. If the page tag doesn’t fire because it failed to load, the traffic is missed. Vendors provide recommendations about the best place for a page tag. Your development team may not believe them, or the “global include” that may or may not exist may or may not insert the code in the suggested spot. It’s a good idea to listen to vendors when they provide configuration recommendations. Once again, how you account for this data discrepancy is a business decision. Not a problem with log files.
- Differences in sessionization. Data divergence gets hairy when you are running two tools on the same site, or are replacing one tool with another. That’s seems to be very common these days with Google Analytics. Eric Enge over at Stone Temple Consulting in cooperation with my pal Jim Sterne has some data from the 2007 Web Analytics Shootout that you should read to help you understand how different vendor’s sessionize.
- Sampling, sampling, sampling. My friend Avinash Kaushikdoes a good job covering issues around sampling in web analytics. I recommend reading his post! I’ll add that statistical methods applied to web analytics data are completely valid; however, sampling at the site, page, or database level opens the possibility that you miss key data. Sites looking for the “long tail” of visitors and using data sampling or other data trimming methods may not find it.
- No referrer passed. Referrer analysis tells you what site people came from before they visited your site. Bookmarks, typed pages, email campaigns, and bots don’t pass referrers. Sometimes referrer information just isn’t passed by the browser. If you overuse redirects on your sites, you may lose referrer data. Not having this important information impacts SEO/SEM, linking campaigns, and affiliate partnering.
- Cookie configuration and deletion. We all know about cookie deletion thanks to Eric’s research while at Jupiter. In addition, if your server isn’t set-up right, you may not be setting the cookie on the first request. Talk to the web server guys and gals at your company to make sure your cookie handling is optimized for web analytics.
- Proxies. Proxies can filter out your referrers and make it look like everyone’s coming from the same ip address. Most tools enable cookie-based visitor identification to work around the proxy issue, but if cookies aren’t set up right on your web server or configured correctly in your web analytics tool, assessing uniqueness when the bulk of traffic comes from the same IP will be problematic.
- Time spent metrics have severe limitations. Single-page visits and the time spent on the last page in the visit aren’t measured in “total time online.” Time-based metrics are schoolboy metrics, so don’t overemphasize them as stand-alone indicators of “engagement.” Assess the impact of time spent in the context of goals. Use time as a variable for segmentation and as input into a larger engagement metric. Use them in context, especially if you are selling advertising (and who isn’t)? My friend Jim Novo has one of the best takes on time spent metrics I’ve read.
- Failure to maintain exclusions and filters. Data can be polluted if you are introducing new filetypes and not excluding them from your top content reporting or if you are not maintaining your bot filters. New bots are crawling your site right now. What are those web developers doing now? Do you know? Are you checking? What’s the process for doing so?
- Little to no historic data. If you’ve just tagged your pages, you may have no basis for historical comparisons for quite some time (like a year!). Data has realized value from understanding current behavior, and potential value from predicting future behavior. And you need data to do it.
- You’re not storing all your dynamic url’s. Your URL’s may have many parameters in the query string (i.e. the name/value pairs after the “?”). Some tools cut out this data and don’t make it available for querying or reporting in their applications. If your site is database-driven and dynamic, not having access to the every URL request will limit your ability to do ad-hoc analysis, filtering, and segmentation.
Let’s continue this long list in Part 2!
Let’s continue this long list in Part 2!


Judah Phillips at Web Analytics Demystified » Blog Archive » Part 2: Your Web Analytics Data Quality Stinks and Here’s Why! added the following ...
[…] In Part 1, I began a long list of reasons why your web analytics data quality may stink. I’m con… […]