Web Analytics Blogs

Judah Phillips is an experienced web analytics practitioner and Internet expert currently working as a Senior Director at a large, global Internet company. His blog is full of useful, unbiased, actionable insights learned from the real-world practice of a process-oriented, integrated approach to strategic Web Analytics for improving business performance.

Subscribe to Judah Phillips weblog

Archive for 'Reporting'

« Previous Entries Next Entries »

Web Analytics Standards: 26 Terms and Definitions from the Web Analytics Association

Web analytics standards are few and far between, which is why I’m glad to blogivate about the Web Analytics Association’s recently released standard definitions for 26 web analytics metrics.  I’m curious to see how the world will respond to these basic definitions.   Standard vocabulary and definitions educate new practitioners, enable consistency in discussions, and lead to shared understandings that foster and promote innovation.  IMHO, the web analytics industry can only benefit from standards.  I certainly think they help to:

  • Clarify misunderstanding and prevent confusion.  As the Internet continues to “go mainstream” and more money is invested in the “online channel,” the capital markets will continue to scrutinize and demand consistency in measurement.  The WAA standards set a new baseline for discussing internet measurement. 
  • Align other companies and bodies and people expressing standards and using non-standard vocabulary.  If the WAA definitions reach a tipping point through broad industry adoption, other standards-setting bodies and industry organization will adopt and follow suit.  However
  • Create a shared vocabulary.  It is not uncommon to hear references to objects in web analytics that are archaic (pages served), industry-specific (page impressions), or conceptually obsolete for certain goals (the number of “hits” as an indicator of site success).  The “names of things” are different across competing technologies.  I hope this document furthers discussion and leads to a common, shared global web analytics vocabulary.

So what are these new standards, you ask?  Here is the standard vocabulary (thanks to my friend Avinash Kaushik whose digitization of the document I have cut and pasted here :) :

  • Building Block Terms: Page, Page Views, Visits, Unique Visitors, New Visitor, Repeat Visitor, Repeat Visitor & Returning Visitor
  • Visit Characterization: Entry Page, Landing Page, Exit Page, Visit Duration, Referrer, Internal Referrer, External Referrer, Search Referrer, Visit Referrer, Original Referrer, Click-through, Click-through Rate/Ratio, Page Views per Visit
  • Content Characterization: Page Exit Ratio, Single-Page Visits, Single Page View Visits (Bounces), Bounce Rate
  • Conversion Metrics: Event, Conversion

Brief definitions for all these web metrics are listed below.  Make sure you download and read the full document.  There’s a lot more to it than listed below:

  • Page: A page is an analyst definable unit of content.
  • Page Views: The number of times a page (an analyst-definable unit of content) was viewed.
  • Visits/Sessions: A visit is an interaction, by an individual, with a website consisting of one or more requests for an analyst-definable unit of content (i.e. “page view”). If an individual has not taken another action (typically additional page views) on the site within a specified time period, the visit session will terminate.
  • Unique Visitors: The number of inferred individual people (filtered for spiders and robots), within a designated reporting timeframe, with activity consisting of one or more visits to a site. Each individual is counted only once in the unique visitor measure for the reporting period.
  • New Visitor: The number of Unique Visitors with activity including a first-ever Visit to a site during a reporting period.
  • Repeat Visitor: The number of Unique Visitors with activity consisting of two or more Visits to a site during a reporting period.
  • Return Visitor: The number of Unique Visitors with activity consisting of a Visit to a site during a reporting period and where the Unique Visitor also Visited the site prior to the reporting period.
  • Entry Page: The first page of a visit.
  • Landing Page: A page intended to identify the beginning of the user experience resulting from a defined marketing effort.
  • Exit Page: The last page on a site accessed during a visit, signifying the end of a visit/session.
  • Visit Duration: The length of time in a session. Calculation is typically the timestamp of the last activity in the session minus the timestamp of the first activity of the session.
  • Referrer: The referrer is the page URL that originally generated the request for the current page view or object.
  • Internal Referrer: The internal referrer is a page URL that is internal to the website or a web-property within the website as defined by the user.
  • External Referrer: The external referrer is a page URL where the traffic is external or outside of the website or a web-property defined by the user.
  • Search Referrer: The search referrer is an internal or external referrer for which the URL has been generated by a search function.
  • Visit Referrer: The visit referrer is the first referrer in a session, whether internal, external or null. 
  • Original Referrer: The original referrer is the first referrer in a visitor’s first session, whether internal, external or null.
  • Click-through: Number of times a link was clicked by a visitor.
  • Click-through Rate/Ratio: The number of click-throughs for a specific link divided by the number of times that link was viewed.
  • Page Views per Visit: The number of page views in a reporting period divided by number of visits in the same reporting period.
  • Page Exit Ratio: Number of exits from a page divided by total number of page views of that page.
  • Single-Page Visits: Visits that consist of one page regardless of the number of times the page was viewed.
  • Single Page View Visits (Bounces): Visits that consist of one page-view.
  • Bounce Rate: Single page view visits divided by entry pages.
  • Event: Any logged or recorded action that has a specific date and time assigned to it by either the browser or server.
  • Conversion: A visitor completing a target action.

In order for broad-based adoption and continued relevancy of these standards, I encourage the Web Analytics Association to: 

  • Create broad consensus and agreement.  I was surprised the Web Analytics Association didn’t release these standards for comment to the larger membership and the public before releasing these standard definitions.  While I support the standards, I fear the perception of “dropping” standards on practitioners and vendors without providing a period for public commentary may slow adoption as people grumble about the nuances of the language.  After all, not all vendor’s tools or reporting comply exactly to the subtleties in these standards.
  • Necessitate adoption by vendors and practitioners.  The old American expression says you say “po-tay-toe” I say “po-tah-toe;” I say “to-may-toe” you say “to-mah-toe.”   For broad adoption and usage of these standards, vendors need to integrate this vocabulary into graphical interfaces, reporting, documentation, training programs, and marketing messaging.  Consultants and practitioners need to “talk the talk.”  The Web Analytics Association should think about creating a “standards certification” program to verify adherence by certain companies and consultants.
  • Identify compliance by vendors.  Current vendor vocabulary doesn’t conform to the standards, and there is currently no persuasive argument for vendors to adopt the definitions and modify their offerings.  The WAA needs to let the public know which vendors comply and which don’t and to what degree!
  • Go beyond definitions to focus on interoperability.  Systems integration requires more than just definitions.  I’m looking forward to when these standards are described in XML.

Excellent job, Web Analytics Association!  If you haven’t joined, you should! 

Web Analytics and the Normal Distribution: More on Statistics and Web Data

Is web analytics data normally distributed?  That question calls for another question: what web analytics variables are you measuring?  That matters.  Numeric random variables (let’s call them data) are classified into the following types:

  • Discrete.  That means you count it.  The data arrives from a counting process.  In web analytics discrete random variables are counts of things like page views, visits, and unique visitors
  • Continuous.  That means you measure it.  The data arrives from a measurement process.  In web analytics continuous random variables are time-based metrics.

We do both in web analytics, don’t we?  We count some things.  We measure some stuff.  And if we’re smart and have the autonomy and positional power to do so, we apply process to counting and measuring web analytics data. 

We often talk about “counting” and “measuring” like they are the same activities.  In general day-to-day online business, that’s no big deal for conceptual conversations.  But in statistics, “counting” is different than “measuring.” 

Both discrete and continuous variables may be represented by probability distributions to assess the liklihood of an outcome.  To identify probability for discrete variables, use a “binomial distribution.”  Binomial distributions take into account the probability that an outcome will occur, so you may see some skewing when plotting the data that may make it look a bit “long tail.” 

For continous random variables use the “normal distribution.” Realize your data won’t always look exactly like a bell curve.  If it looks really different and ”long tail” you may be looking at a discrete variable better suited for a binomial distribution.    

So is web analytics data “normally distributed?”  The answer is that it depends on the type of data.  Even then, the answer is “probably not.” In fact, most business data doesn’t follow a perfectly normal distribution.  Yet every day in halls of academia, very book smart people teach statistics and tell you to apply it to business data.  Are they wrong?  Insane?  Misguided?

No they aren’t (well maybe you have to be slightly insane to teach stats).  Academics realize that most distributions are not normal and do not have equal measures of central tendency (i.e. mode, median, mean).  Skewness abounds!  The normal distribution, however, can be used to approximate “real-world” distributions that have different measures of central tendency. 

A theory called the “central limit theorem” states that “if the sum of the variables has a finite variance, then it will be approximately normally distributed ( i.e., following a normal or Gaussian distribution).” In other words as the sample gets larger the distribution of the mean can be approximated by the normal distribution.  And if I remember correctly statisticians have determined that with a sample size of at least thirty, the sample distribution of the mean will be approximately normal.   Fortunately, we web analysts often have millions of data points to use…

Some time ago I actually took average visit duration for one site for which I have real data for thousands of visits and did a Lilliefors Test of Normality.  The test found no evidence that the data wasn’t normally distributed even though it looked a bit odd and the skewness was 0.741426 and the kurtosis was 4.1525665. 

If you’re thinking about applying statistics to web data, make sure you identify whether the data you are looking at is discrete or continuous.  Don’t abandon the normal distribution for certain types of web analytics data just because it doesn’t exactly look like the Liberty Bell.  Test it for normality before applying the Gaussian statistics.  If the data is highly skewed, determine whether the level of error is in acceptable limits.  Look at using other distributions for discrete variables.  

normaldist.gif

Image from http://www.weibull.com/

Part 2: Your Web Analytics Data Quality May Stink and Here’s Why!

In Part 1, I began a long list of reasons why your web analytics data quality may stink.  I’m continuing the list below (make sure you read Part 1 for context and to view the entire list)

  • Storing only visit level data.  May tools don’t have schemas that store raw data at the visitor level.  Instead they provide access to only visit level data.  For example, you may not be able to see all the page views during a single visit per ip address or cookied visitor.   Assess the impact of the vendor’s schema on your goals.  Companies that use analytics data to feed other systems or that want to use visitor attributes for content targeting, segmentation, optimization, or analysis may not be well-served by some vendor schemas.
  • Little to no decodes or lookups.  If you use numeric codes and non-human readable naming conventions in your data, they can pass through to your reporting and prevent your colleagues from understanding the reporting.  Strange codes look like hieroglyphics!  Decoding and looking up data can eliminate the problem of non-readability and strange numerical names in your reporting… While some would say this is a reporting issue, not a data issue,  I chose to include it because it’s at the surface… it’s the data your customers see.  Not all tools decode or lookup.  Some tools allow rewriting of data in the database.
  • Failure of key services supporting the application.  If you are dependent on page tags, synchronization software, web servers, databases, or any of the wondrous technology that makes it all work, failures are a real bummer.  Make sure you have monitoring and recovery processes in place so you don’t miss data!  When page tag collection fails (perhaps the page tag server went down ay?), the data is gone forever.  If the web server fails, then no logs are written, but no pages are served either - so is traffic missed?  But if the processes supporting log file analysis fail (i.e. data synch), watch out! 
  • Inadequate or incorrect implementation.  If you can’t cross dimensions (like finding out what keywords referred traffic to a page), filter all of your data (for example, filtering pages to see only those viewed by the iPhone), easily create new metrics, or if the numbers aren’t adding up, you may have not adequately or correctly implemented your software or communicated your requirements to your vendor’s professional services team. 
  • Limited, hard-to-extend data model. Powerful, actionable insights from web analytics are enabled by extending a data model to incorporate business specific dimensions.  For example, if every page has a category and an author, you may want to see a list of all the page views in that category or ranking of pages by most popular author.  To do that you may need to join data at the database level or take advantage of variables you pass in a page tag.  Various tools have different limits on if, how, and to what extent you can extend the data model.

So what do you do when you know your data quality is less than stellar?  Here’s some guidance:

  • Don’t worry, be happy. :-) Just by collecting the data you are collecting, you are doing better than a great majority of companies that do business on the Internet.  By asking questions about data and investigating the issues, you have a leg up on your competition.  Work on optimizing the data, expose flaws in site design or architecture that impede data collection, work with your vendor and seek help in the web analytics community if you run into real problems.  The Web Analytics Association’s Forum on Yahoo is a useful place for posting questions.  But whatever you do, stay positive and focused on solving your problems and making your web analytics practice more optimized.  Don’t get frustrated.
  • Recognize the limitations in the data and do not go gently into the night.  Ask the hard questions about sampling, schemas, data retention, processing, querying and reporting to understand where the holes and noise could be in your data.  Demand answers from your vendors and quick response times to your questions about data quality.  If you vendor is frustrating you by not being responsive, talk to the boss and the vendor’s bosses, escalate, escalate, escalate until you get resolution.
  • Understand the underlying elements of data collection and what can go wrong.  Learn about sessionization and why different tools and data collection methods have limitations.  Explore the more technical components of the backend, like the database and your web analytics schema - all your data is in one (or more)! Talk to your engineers.  Have them explain the technology in terms you understand.
  • Evaluate your tools.  Some tools are just better suited for particular business problems than other tools.  Log files tools enable you to constantly change assumptions and reprocess data.  Page tags provide a standard data collection and transport mechanism.

With hard work on your part, you can make you web analytics data smell like roses!  I know you can! :)

dataquality_renamed.jpg

Part 1: Your Web Analytics Data Quality May Stink and Here’s Why!

Web analytic’s data quality and accuracy of ”the numbers” are always questioned.  With so many sources of data from different systems and vendors - both free and paid- you must be able to reconcile deviations in data from different sources, and speak intelligently about data quality and accuracy to promote adoption of web analytics at your organization. 

There are so so so many reasons why web analytics data quality can stink.   I thought it would be fun to list some of the major reasons (over two posts):

  • Spiders and Bots.  If you haven’t read my series on spiders and bots, check it out.  Non-human traffic can inflate your metrics and diminish the predictive power of your analysis.  Regularly look for bots and update your filtering!
  • Untagged pages.  If the bulk of your page views are being generated by a single page, and you’ve failed to tag that page, you’ve lost data.  That’s always a bummer.  Tagging must be endemic to the web development process. Ask your team how they know that every page is tagged.  Really how do they know?  Verify and reverify.  Use a tool like WASP.   Obviously this isn’t a problem with log files.  
  • JS turned off.  If the browser doesn’t execute javascript, the page tag won’t fire and traffic will be missed.  If you’ve missed it, you don’t know it occurred, so you are constantly in the dark unless you compare your page tagged data to log file data, which isn’t easy at all!  How you account for missing this traffic, whether it is immaterial or not, is a business decision. Again, not a problem with log files.
  • Latency.  If the page tag doesn’t fire because it failed to load, the traffic is missed.  Vendors provide recommendations about the best place for a page tag.  Your development team may not believe them, or the “global include” that may or may not exist may or may not insert the code in the suggested spot.  It’s a good idea to listen to vendors when they provide configuration recommendations.  Once again, how you account for this data discrepancy is a business decision. Not a problem with log files.
  • Differences in sessionization.  Data divergence gets hairy when you are running two tools on the same site, or are replacing one tool with another.  That’s seems to be very common these days with Google Analytics.  Eric Enge over at Stone Temple Consulting in cooperation with my pal Jim Sterne has some data from the 2007 Web Analytics Shootout that you should read to help you understand how different vendor’s sessionize.
  • Sampling, sampling, sampling.  My friend Avinash Kaushikdoes a good job covering issues around sampling in web analytics.  I recommend reading his post!  I’ll add that statistical methods applied to web analytics data are completely valid; however, sampling at the site, page, or database level opens the possibility that you miss key data.  Sites looking for the “long tail” of visitors and using data sampling or other data trimming methods may not find it. 
  • No referrer passed.  Referrer analysis tells you what site people came from before they visited your site.  Bookmarks, typed pages, email campaigns, and bots don’t pass referrers.  Sometimes referrer information just isn’t passed by the browser.  If you overuse redirects on your sites, you may lose referrer data.  Not having this important information impacts SEO/SEM, linking campaigns, and affiliate partnering.  
  • Cookie configuration and deletion.  We all know about cookie deletion thanks to Eric’s research while at Jupiter.  In addition, if your server isn’t set-up right, you may not be setting the cookie on the first request. Talk to the web server guys and gals at your company to make sure your cookie handling is optimized for web analytics.
  • Proxies.  Proxies can filter out your referrers and make it look like everyone’s coming from the same ip address.  Most tools enable cookie-based visitor identification to work around the proxy issue, but if cookies aren’t set up right on your web server or configured correctly in your web analytics tool, assessing uniqueness when the bulk of traffic comes from the same IP will be problematic.     
  • Time spent metrics have severe limitations.  Single-page visits and the time spent on the last page in the visit aren’t measured in “total time online.”  Time-based metrics are schoolboy metrics, so don’t overemphasize them as stand-alone indicators of “engagement.”  Assess the impact of time spent in the context of goals.  Use time as a variable for segmentation and as input into a larger engagement metric. Use them in context, especially if you are selling advertising (and who isn’t)?  My friend Jim Novo has one of the best takes on time spent metrics I’ve read.
  • Failure to maintain exclusions and filters.  Data can be polluted if you are introducing new filetypes  and not excluding them from your top content reporting or if you are not maintaining your bot filters.  New bots are crawling your site right now.  What are those web developers doing now?   Do you know?  Are you checking?  What’s the process for doing so?
  • Little to no historic data.  If you’ve just tagged your pages, you may have no basis for historical comparisons for quite some time (like a year!).  Data has realized value from understanding current behavior, and potential value from predicting future behavior.  And you need data to do it.    
  • You’re not storing all your dynamic url’s.  Your URL’s may have many parameters in the query string (i.e. the name/value pairs after the “?”).  Some tools cut out this data and don’t make it available for querying or reporting in their applications.  If your site is database-driven and dynamic, not having access to the every URL request will limit your ability to do ad-hoc analysis, filtering, and segmentation.

Let’s continue this long list in Part 2!

dataquality_renamed1.jpg

Let’s continue this long list in Part 2!

Web Analytics ROI and Value Generation? The Three Core Actions

Why do we do web analytics?  What value does it generate?  What is the ROI from web analytics?”  Companies trying to justify or in the process of allocating capital to “do web analytics” are wondering…  Recent research has shown that companies serious about web analytics need to invest money in people and technology.  But what do companies get in return for doing so? 

I believe that web analytics helps online business identify potential opportunities for taking action to:

  • Increase revenue.  Web analytics helps you make more money.
  • Decrease cost.  Web analytics helps you spend less money to make more money.
  • Improve operations.  Web analytics helps you work smarter and more efficiently.

Let’s explore some of the ways web analytics can help online business across these three core actions:

Increased revenue through:                                                                                    

  • More targeted advertising sales.  Content is monetized in a many ways, from cpm, cpa, cpl, ppc, and more.  Web analytics can tell you which of these methods for generating revenue and which advertising campaigns using those channels are performing most effectively.  External campaign effectiveness may be tracked using referrer data and related dimensional reporting.  Metrics related to internal campaigning, like microsites or special advertorial offerings, can be easily provided to advertisers and agencies to identify audience consistency and quality.
  • Better insights into audience segments to realize incremental revenue.  Segmentation refers to dividing a total population into groups based on one or more characteristics.  A good web analytics tool easily enables you to segment on dimensions and attributes relevant to your business.  Segmenting web data enables you to answer questions about which visitors visit when and with what frequency, depth, and duration, and more, which provides otherwise unknowable insights.  New incremental revenue streams may be realized by mapping newly discovered behavioral or demographic characteristics to existing advertiser or agency demand.
  • Creating effective online-marketing and editorial offerings.  Reports showing visit frequency, depth, recency, and the time periods when the online audience visits the site assist product managers, editors, and producers in optimizing, crafting, and targeting content and advertising, increasing reach and exposure time of advertiser messaging to key audience segments.
  • Ensuring pages effectively lead to conversion funnels. Metrics like bounce rate, conversion rate, clickstream pathing, and conversion metrics provide indications about how to modify or tailor pages to generate value.  Funnels can provide insights about which calls to action, content, pages, sections, and campaigns yield the best conversions.

Reduced costs by:

  • Increasing the effectiveness of online work products. By identifying, monitoring, and evaluating important KPI’s (key performance indicators and KKPI’s!), the business learns what works and what doesn’t work online.  Web performance data has amazing utility when evaluating, planning, and monitoring current and future trends when assessing how to reduce cost in an portfolio of online products.
  • Maximizing site operations, content, and opportunities for organic and paid search.  The performance of pay-per-click, paid inclusion, and contextual advertising and linking campaigns may be audited to eliminate projects that fail to meet goals based on conversion, revenue, or KPI’s.  By tying conversion to capital budgeting, online projects that fail to meet site hurdle rates may be tailored or eliminated.  The business can then better focus on the driving profitable revenue without misallocating resources.
  • Optimizing user experience and information architecture.  Overhead reducing tools like Google Site Optimizer and offerings from other companies providing site optimization services use web analytics data to programmatically alter a site to increase conversion and lift.  CMS automation can be driven off of web analytics data.
  • Pinpointing the performance of online marketing campaigns.  By creating custom KPI’s, metrics, and segmented conversion rate and slicing data via custom filters and business relevant dimensions, deep insights into online performance can be attained.  Misappropriated resources and efforts can be easily recognized and eliminated.

Improved operations via:

  • Deep understanding of site traffic, visitor activity, conversions, and online value-generation.  You can’t manage it, if you don’t know about. Companies most successful with web analytics dedicate a full-time staff to analyzing and contextualizing data and performance metrics from channels like organic and paid search, affiliate partnerships, and offline.  The best staff understands the impact of the web channel across the value chain.
  • Contextualizing strategic decisions with accurate data.  The ability for a corporation to gain insight and intelligence into its online activities provides management with transparency into performance.  Performance must be monitored to be improved, and there’s no other way to gain true insight into online performance to than using web analytics to guide web strategy.
  • Identifying site operational effectiveness in a timely manner.  When using log files, server errors and other impediments to online customer satisfaction can be quickly discovered and remediated, which reduces negative impact and minimizes risk. 
  • Predicting the impact of business decisions on performance.  By applying statistical methods to web analytics data, businesses increase their abililty to predict the impact of site changes on performance. 

Every Internet business can benefit from technology that positively impacts these three important business actions.  I recommend that you consider how your projects are framed across three actions whether you’re just thinking about getting involved with web analytics, if you’re growing your web analytics practice, or if you’ve already established web analytics at your company.

 phillipsroi.jpg
 

Web Analytics and Data Collection: Log File Analysis

Web Analytics and log file analysis is misunderstood!  With so many options for data collection in web analytics, the log file has been somewhat disdained for many years.  In fact, I’d venture to guess that the majority of web analytics practitioners working today don’t “do log file analysis.” That’s unfortunate.  One can learn a lot about web analytics from examining log files - for example, “total time online” (with all it’s inherent issues :).

Why is server-side data collection (log file analysis) less popular than client-side page tagging?  It’s not about bots, accuracy, or ease of deployment… that’s marketing spin, somewhat true and false depending on context.  It’s may be because most web analytics practitioners currently use a hosted model. 

Page tagging enables ASP models  for web analytics.  It would be very difficult (i.e. resource intensive and costly) for a web analytics ASP to maintain a competitive service level if the ASP relied on its clients and their IT teams to maintain server side data collection- maintaining, transporting, and synching log files.   

Imagine the ASP model with log files!  It’s nightmarish and makes little sense:  hundreds and thousands of customers with just as many (or more) different servers behind firewalls generating many different log file formats.  Then, at various intervals all customers sending uniquely formatted logs for the ASP to parse, index, and make available for querying and retrieval in close to real time.  Yikes!! 

Page tagging creates a standard data structure for the ASP and a manageable data transport mechanism while minimizing complexity and reducing infrastructure cost in a way that log files just can’t.  Page tags enable ASP’s to meet rigorous service level agreements and contractual obligations for data processing in way they probably couldn’t with log files.  

So who does log file analysis for web analytics?  The answer is companies that have:

  • No desire to give up control of a web analytics infrastructure to young companies.  Web data is valuable and log files enable you to keep it all for however long and at whatever cost your business justifies.  You can processing your data using whatever tool you want, and reprocess at your discretion.
  • Existing business dimensions and structured data already existing in log files. Simple web analytics concepts like page views, visits, and “uniques” become more meaningful when sliced by relevant business objects from your companies existing log data.  For example, if you’ve passed a unique parameter in your query strings for years, you can program your log file tool to recognize the parameter.   Then reprocess your existing data to make use of it in your reporting. 
  • Deep data mining experience.  Data miners want data to mine, not static GUI’s or “my data” delivered in summary extracts at an incremental cost.  Log files can be parsed and indexed into the database that you, the company or the client, actually owns at the budget you actually set.
  • Know how to employ hybrid data collection.  Log files are complemented by page tags for instrumenting pages to track events and Web 2.0.
  • Complex integration requirements.  All the data exists on machines in your network.  Log files can be extended to incorporate new business dimensions.
  • A need to understand how spiders and bots behave on their site and the performance of their server(s).  Log files capture everything that happens on the web site.  The “kitchen_sink.html” is in there.  It’s up to you to customize the tool to meet the quality level you demand for filtered data.  As Mat and Steve mentioned in the comments, logs also enable you to view server errors, useful for site QA and debugging.

I’m sure there are other reasons too….  I’ve heard people tell me that:

  • It’s hard to move log files.  That depends on how many log files you have, what technology you are using to move them, and how often.
  • Log files are difficult to parse. Not really, especially if they are standardized.  You just need a flexible tool.  If things get non-standard, you need a tool that enables you to define your own formatting.
  • Log files take up a lot of space.  Yes, they do. How much space do you have?  What’s your companies existing standards for parsing log files?  How long will you store them?  You may find out you have plenty of “space.”  You will never have enough time though. ;)
  • Log files aren’t as accurate as page tags.  Huh? That all depends on how well you filter and exclude pages.  Accurate compared to what? To unvalidated data existing on servers you don’t see in some data center thousands of kilometers/miles away?  Logs capture visits from browsers that don’t execute javascript and thus will not appear in the metrics from tag-based systems…

So, you might be thinking I am a big fan of log files.  I am a web analytics realist.  Challenges with log files include:

  • Standardizing on a common format.  If you have a web server, you are at risk for someone doing something that creates a custom log file format.  If you are the new web analyst coming into an existing company, you may find custom fields in the logs.  Develop and champion a company standard.  It will make your life as a web analyst somewhat easier.
  • Creating a single “core” configuration that can be extended to deal with one-off formats.  If you live in world of custom log files, you can abstract a set of core fields, describe them, then extend the description when you deal with custom craziness.
  • Synchronizing/transporting log files.  The more servers you have, the more chances you have for failures in synchronization.  Application monitoring and recovery procedures are critical to preventing downtime.  Develop them.
  • Reprocessing log files.  Depending on your system reprocessing existing log file data can have myriad of effects - from data not being available, to data being partially available, to errors, to long waiting periods before the data is current and ready to query.
  • Receiving log files that are missing key fields or with key fields that have changed.   If you depend on systems, companies, and people that you do not control, fields can be stripped and formats changed in the blink of eye.  The web team needs to detect and manage these events.
  • Monitoring the operational workflow.  Someone on the web analytics team needs to manage the system and maintain the overall availability and integrity of log file processing and any issues that could result in the data not being synched or processed correctly. 
  • Managing change.  Since all your “hits” are captured in the log file, you’ll need to tell your web analytics tool how to correctly process the logs and remove data you don’t want to report.  You’ll need to identify and maintaining lists of filetype exclusions and the user agents/ip’s of robotic traffic.
  • Finding staff that “gets it.” The rule is that success in web analytics results from having competent people who have experience understanding the risks and rewards of web analytics data collection and executing on a web analytics strategy that generates value. 

As a web analyst you need to use the tools at your disposal to build your web analytics practice.  Log files are part of your toolkit.  Be weary of people who try to sway you one way or the other when it comes to web analytics data collection.  The method you ultimately use should be based on the data you want to collect in the context of the company’s goals, budget, and expertise

log_files.gif

Web Analytics and Data Collection: The Page Tag

Many methods exist for collecting different types of web analytics data- some much more accurate and useful than others: page tags, log file analyzerspacket sniffers, audience panels, and toolbars.  At this point in time, the page tag certainly seems to be the preferred method adopted by the web analytics industry for data collection.  It’s also core to other technologies, like multivariate testing. 

Various sources will tell you a multitude of “things” over time about page tags, depending on whether they think you are a noob or advanced practitioner.  There’s lots of useful information from many sources about page tags.  Still I find a lot of the average discussion about the utility of page tags to be somewhat true and false, depending on context.  For example, if you’ve never tagged pages, but you have processed log files, does it makes sense to entirely throw away legacy processes?  Or if you’ve mastered change management across all pages on your site but you’ve never parsed log files, do you want to have to deal with synching, moving, parsing, filtering log files?   These questions yield fantastic answers for fruitful discussion as you plan or extend your web analytics implementation.

I’ve heard the following “things” about page tags:

  • It’s easy to page tag your sites.  That really depends on the technologies used to build the site.  A static site with few simple pages is a different animal than a beast of a database-driven site with millions of complicated pages glued together using different technologies. 
  • Page tags are more reliable.  Reliable in what way I ask?  Page tags in hosted environment are processed by a data center many miles away.  I have no idea of the availability of the servers to support the load they are receiving from all the customers, nor do I have any insight into the raw data collected by the page tags, errors, and so on. 
  • Robots and spiders are always removed from the metrics.  Page tags do a better job than log file analyzers out-of-the-gate and especially if you don’t maintain your filtering, but I don’t really think page tags are infallible at all.  Bots and spiders get through.
  • You don’t have to deal with IT when page tagging.  It’s just a global include right?  You just give IT the code, and they include it.  For the most part that’s true, but IT has processes and procedures and your tagging needs to be QA’ed and perhaps even put through a “versioning” process.   Regardless, you’ll need to prep the corporation and your friend’s in IT for the big page tagging effort.
  • Change management is easy with page tags.  If you only use the vendor’s out-of-the-box page tag and have a global include, it’s somewhat easy to manage change.  But in my experience, a web analytics implementation requires using the tag in different ways across the site.  You’ll need to add attributes to the tag or script values into the tag.   In order to manage change, you’ll need to follow corporate processes.  In the case of web analytics page tagging, you may need to create those processes before you manage change… and process creation in corporate environments is sometimes not easy.  You may even need to follow a process for doing so.  :)

The biggest challenges I’ve had with page tagging include:

  • Ensuring complete code coverage across all pages.  The number of pages in your site, the way the site is built, and the technology used to build your site all need to be carefully assessed prior to beginning the page tagging effort.  The larger the portfolio of sites, the more difficult it will be to tag all of your pages.  Other challenges include the number of domains and subdomains, the technology standards used across your portfolio of sites, and whether, in more controlled environments, the web analytics team actually has the positional power to influence the change management process.
  • Determining a method for centralized tag management.  If you are running an internal solution, you’ll need page tag server(s) to collect the page tag data.  If you’re depending on your vendor’s servers, you may need to modify or update tags for various reasons (new campaigns and so on).  You may find new sites that need to be tagged.  New stakeholders may want special tags added to the site to support other purposes (such as multivariate testing).   How are you going to centrally manage all that?   
  • Orchestrating changes to tags across different site sections.  You may want to pass a new value in the tag or change something in the script on page X, but not page Y, or subdomain A, but not subdomain B.  A new campaign may need to be enabled.  You may have to update your tag to take advantage of new vendor functionality.  You’ll need a technology solution and process for centrally orchestrating and controlling tag changes.  Wielding a mighty CMS helps.
  • Reconciling tag metrics with log metrics and determining correct filtering.  It’s great fun to collect tag data then compare it to your log files to determine if all pages are being counted and how effectively you are filtering bots.  While looking at different data sources is a time consuming activity, certain business cases may demand it. 
  • Integrating tagging with a Content Management System.  Major sites use expensive CMS’ to create web sites.  The web analyst should work with the CMS team to build page tags into site pages.  Integration will help you with two points raised above: centralizing tag management and change management/orchestration.
  • Challenges with decodes and lookups using tags.  I’ve learned that it is difficult and in many cases impossible to decode a value or use lookup tables with a page tag.  For example, if I had a page that had a URI “/er45rw/e42f45erfwrq3r.html,” I can’t decode it to read “Web Analytics Blog” in my reports.  Or if it is possible to do simple decodes, I have to hardcode the decode on each page.   Hard coding on an evolving web site is never manageable over the time.
  • Latency.  If the page partially loads without executing the javascript, or if the user clicks through the page before the javascript has fired, the page view won’t be counted.  You must test to make sure your tag is firing properly.
  • Javascript turned off.  If the browser doesn’t execute javascript, the tag won’t fire and the visitor won’t be measured.  What effect will that have on your numbers? 
  • Cookie issues.  Tags may set third-party cookies.  Privacy policies don’t like the third party cookie.  As Justin Cutroni points out in the comments his tool sets first party cookies by default.  That’s good.
  • DNS changes.  To prevent third party cookie issues, you can trick out the DNS with a CNAME entry.  Say hello to someone in IT called the DNS admin!

In discussing the page tag in such a manner, you may think I don’t like the page tag.  Not true at all!  The page tag is a very useful data collection method in context.  I like it very much, especially in hybrid data collection.  Like any technology, just make sure you understand the implications of your data collection method on your implementation.   

page_tags.gif
Courtesy of Zeus.com.

Part 4: Spider and Bot Capability Maturity Framework for Web Analytics 2.0

Let’s conclude this series on the spiders and bots of web analytics 2.0 with a framework for ranking the spider and and bot capability maturity of  a web analytics practice! 

Before I do so, let me mention a few things:

  • How/if you detect spiders and bots depends on your data collection methods.  It is thought that page tags exclude spiders and bots because they don’t execute javascript or request images. Until they do. Log file data must be examined and filtered (either programmatically, manually, or both), and those filters need to be maintained by a web analytics professional.
  • Bots are used for a many purposes.  From inflating metrics, to indexing text for search, to chatting, to spamming, to data mining, to site performance monitoring, to click fraud, and more.
  • Bots are constantly evolving. Just like web analysis. :)
  • Not all bots are bad news.  You want bots crawling your site.  Some bots are good and helpful for your online business. Imagine if you blocked Googlebot.  Segment the robotic traffic into separate reporting (in the best case) and make sure its filtered from your externally reported numbers.   

Now without further adieu, here’s a web analytics 2.0 framework for ranking your spider and and bot capability maturity:   

  • Red.  You are in the red if the following applies:
    • I don’t understand the impact of spider and bot traffic nor do I measure, filter, or segment it at all. 
    • If I run an in-house solution, I haven’t updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification.
    • I don’t know when or how the Googlebot hits my site, but it sure does! 
    • I’ve never thought of reporting spider traffic for SEO.
    • I think the words “yahoo” and “slurp” refer to delicious soup.
    • Spiders scare me.
    • All bots are bad (no, they aren’t!)
  • Yellow. You are moving out of the red zone, into the yellow if:
    • I occasionally look for suspicious traffic. 
    • If I run an in-house solution, I have occasionally updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification, but I ask for verifications of compliance with industry standards.
    • I’ve heard of the Googlebot and maybe I use or I am thinking about using Site Maps (do it!). 
    • I know about Yahoo! Slurp.
    • It would be cool and useful to report on spider and bot traffic and learn which bots are good and bad. 
  • Blue.You are in the blue if at least every month or on an ad hoc basis:
    • I work with my vendor or in-house team to recognize and remove spider and bot traffic, conforming to industry standards
    • I do a monthly update of my filter and exclusions list, or I know my vendor does.
    • I capture bot traffic in my server log files, but I may or may not report on it.
    • I know all about the Googlebot, and I use Site Maps. 
    • I realize the Googlebot and Yahoo! Slurp behave differently on my site, but I don’t know exactly how.
    • I know which bots are good (like search bots) and bad (like content scrapers).
  • Green.  You are in the green if:
    • I have established a process for regularly removing spider and bot traffic and for keeping my lists of exclusions and filters up-to-date, or I am *absolutely certain* my vendor has such processes.
    • I am in compliance with industry standards.
    • I measure spider and bot traffic and segment it into distinct reporting separate from my human traffic reporting.
    • My SEOers love me, the web analyst, and I am involved in educating people in my company about bot traffic from IT to managers to consumers of reporting and analysis.
    • I know that there’s more than one Googlebot, and I know how its crawl differs from Yahoo! Slurp (hint: Slurp visits more).
    • I know detailed metrics like the “total time online” of all my bots. ;)

What I’ve covered in this four-part series only covers a little bit of what I know and what there is to know about spiders, bots, and crawlers and how they affect web analytics.   I’d enjoy hearing other experiences or opinions, so please share your comments if you feel like it.  Until next time, fine readers.  Thanks for visiting!

sengine1.gif

Part 3: Web Analytics Quality through Spider and Bot Recognition and Filtering

Welcome to Part 3: Web Analytics Quality through Spider and Bot Reconciliation. My goal with this series of posts was to open discussion about what Alex Langshur of PublicInsite called, in reference to these posts: the “quality of web analytics.”  I’m talking about doing the most and the best we can with current technology to ensure quality and integrity in web analytics data collection and reporting - like I know we all do now, right! :)

It’s about doing the best we can at our jobs because we’re lucky to have them - at least that’s how I feel about my “real” job in web analytics.

To begin detecting bots and hunting spiders, here’s what I recommend:

  • Meet industry standards.  The Interactive Advertising Bureau publishes a list of spiders, bots and user agents that member companies have confirmed are robotic.  You can add these patterns to your list of IP’s and agents to exclude.  A question to be asked is whether your vendor subscribes to this list, and thus makes your analytics implementation compliant with the IAB.
  • Use public resourcesPeter Adam’s commented of an excellent resource in Part 1: Gary Keith’s Browscap project is a must have component of any web analytics / spider detection effort. A simple Google search reveals a list of resources to leverage.
  • Use hybrid data collection.  You can exploit the non-javascript executing bot busting power of page tags, and process log files to detect how spiders and bots crawl your site.  That’s hybrid.  You’ll need a tool that enables you to carve out spider and bot traffic from human traffic (certain tools do this).  If you run the logs of a site you’ve page tagged, you can compare the traffic from both sources and look for identical traffic that has unusual patterns too.
  • Identify and then filter unusual traffic patterns. Query your database and/or look at your reporting to identify and filter IP addresses and user agents or combinations thereof that view a total number of pages that you find suspicious.   Look for continuous visits repeating continuously!  Check out paths that have over N page views, where N is an arbitrary number larger than the “average pages viewed per visitor” on your site.  What’s requesting your robots.txt?  What are the attributes of the traffic look like coming from Africa, the Russian Federation, India, and China? 
  • Talk to your engineers.  They are really smart and know lots of things about the web.  They are your allies in web analysis.  As web analytics practitioner, programmer, and all around smart dude, Steve McInerney mentioned: “…is the biggest giveaway I’ve found. Bots don’t request images/CSS etc as much as people do. Some tricky HTTP headers to force browsers to reload, say certain images or CSS, can go a long way towards helping identify bots - who won’t 304 the image or css.”  Your CMO may not understand what Steve is saying, but your CTO will.  And Steve may only understand when I say: Aussie! Aussie! Aussie!
  • Keep your filter lists up to date. Obviously, doh! You’ll need to maintain your lists of exclusions.  Someone needs to be dedicated to doing this process-based activity.
  • Identify the “spiders and bots” metrics on your site and compare to suspicious traffic.  I know the average number of pages viewed by a spider, the average visit duration per bot, and how other key metrics relate to spiders.  I also know this for humans.  When I am suspicious about traffic, I can compare attributes of visit or visitor to my bot and human profiles.
  • Examine user agents.  Look for irregular, unusual, and just plain weird user agents.
  • Set up a tarpit.  A tarpit is a server to which you redirect bot traffic and serve those bots random binary objects, very slowly, until they stop crawling.  When the bot maker checks on the results, they find they have nothing and need to recrawl.  The idea here is slowing bot evolution.  If you block the ip’s then the bot makers will know you’ve found them and adjust their tactics accordingly to make them less detectable.
  • Remember that not all bots are bad, so don’t prevent the good bots from crawling your site!  The Googlebot and Yahoo!Slurp index your site’s content so that it is retrievable when people query their search engine.  Don’t exclude bots using robots.txt or a tarpit unless you know that bot is nefarious or should be excluded.  In practice, I rarely block bots from visiting a site.  Rather I segment robotic traffic out of the numbers I report about site behavior and performance.
  • Manage expectations.  The truth is you can’t filter what you haven’t seen before.  You can’t predict the future of new bot behavior.  No matter what bots will get through your data collection methods.  As an analyst you need to actively manage expectations by educating your user community, clients, and colleagues.  Soon enough you’ll have associates who are savvy human filters letting you know when bots slip through.
  • Create a business process for spider and bot reconciliation.  The savvy web analytics team has sustainable, manageable, and repeatable processes for scheduled detection, filtering, removal, quality assurance, and exclusion list maintenence. 

Happy Hunting!  Click here for Part 4, my “Spider and Bot Capability Maturity Framework.” Thanks for visiting!

animated-search-engine_spider.gif

Image from link2city.com.