Web Analytics Blogs

Judah Phillips is an experienced web analytics practitioner and Internet expert currently working as a Senior Director at a large, global Internet company. His blog is full of useful, unbiased, actionable insights learned from the real-world practice of a process-oriented, integrated approach to strategic Web Analytics for improving business performance.

Subscribe to Judah Phillips weblog

Archive for 'Methodology'

« Previous Entries Next Entries »

Part 2: Why Companies Switch Web Analytics Tools

As I explained in Part 1 of this lengthy blog post, companies switch web analytics tools due to a combination of issues related to:

Let’s tackle some thoughts on the the last three (Customer Service and Support, Fulfillment, The Company Itself):

  • Customer Service (and the post-hoc judgement of it)
    • Post sales:
      • Unresponsive or inadequate customer support and professional services.  So many items fall into this category.  From missed phone calls and emails not returned, boilerplate/unfocused proposals,  generalized answers from support to other forms of poor communication, to followup and failures in delivery of service level agreements, to vendors not spending enough time talking to clients after the first year as they do during the first year (as Benry pointed out in his comment to Part 1).  All makes a company begin thinking money and time may be spent better elsewhere…
    • Pre-sales: 
      • Sell them what they, make them buy what they need later.  In this bucket fall items that cause upsetting realizations  that lead to talk about ”the switch”well after the sales process ends - from inadequate professional services implementations to the need to buy another one of the vendor’s products to store all of your data or segment it… 
      • Smoke and mirrors.  Pilots and sales demos look slick but the results of your initial implementation may greatly differ.  After all, your team needs to learn a new technology.  When your team realizes that it will take X months and X hours of dedicated resources to create the same type of functionality you saw in a simple, canned pilot, then the seed for switching is sown.  Keep in mind, vendors tune their pilots for performance and to exhibit the best features they have.  They sell you an executable, documentation, and professional services, not the pilot (no matter what they say… otherwise get in writing). 
  • Lack of fulfillment:
    • Implementation problems.  Tagging all of your pages or processing your log files is easier to talk about in meetings than do in real life.  If you’ve spent two years trying to tag your pages across all your companies domains, you may just get frustrated and move on to another solution.      
    • Complexity.  xys_435, this=12r, pass this variable, make this request, do this, do that.  Companies without sufficient expertise inhouse and dedicated resources may get sick of the esoteric proprietary nature of a tool and not want to hire vendor professional services, so they look for technology that is easier to wield using internal skill sets.  Couple that with an acute inability to extend the data model, to use API’s, to access the data in an open database, to decode, and to use lookup tables, and viola, the “switcheroo” gets brought up.
    • Not customizable enough to solve business challenges.  Can’t save reports, add metrics on the fly, use whatever filter you want, extend the data model to include custom business dimensions?  Or you can do it, but it costs too much?  Dang.  Maybe it’s time to take a look at what current technology offers?   
    • Limited ability to integrate cross-channel data.  Isolated silos of data are not good.  Integration is necessary for realizing insights across online and offline channels.  Resistance to this fact is futile.  Products that resist this fact won’t last for long in companies that realize it.
  • The Company Itself:
    • New or changing goals.  As a company learns how to “do” web analytics, they may realize the tool is incapable of meeting new goals and move on.
    • Current tool doesn’t fit into business process. As companies engage in Business Process Management (BPM) and Master Data Management (MDM) initiatives and develop service-oriented architectures (SOA), web analytics tools need to fit in.  A company may judge their current tool doesn’t support integration with X or enablement with Y, then begin looking for a new vendor.
    • Organizational alignment. If a company isn’t organized in way that promotes the usage of the analytics tool to achieve business goals, or if the deployment of the tool is not managed by a business person who ensures technical development achieves business goals, then the true value of the analytics technology may never be achieved. Business owners will move on and demand a new web analytics tool.
    • Inability to meet internal service level agreements. If the system doesn’t enable the team to quickly adapt to evolving business demands related to internet measurement, SLA’s may slip,  which leads to thinking about technology change.  
    • New management.  Change comes with new management.  Always.  Inevitably.  Sometimes that change comes from standardizing on a tool that management has had success with in the past. 
    • Limited resources to support tool.  Given a hard to identify ROI, the company may limit resources.  As a result the tool withers on the vine and gets replaced.
    • No EXPERTISE.  With a dearth of qualified, experience web analysts, high salaries, and high demand, a company may just not be able to find expertise needed to maintain the current tool and move to something simpler.

The good news is 38% of “best in class” companies haven’t switched (yet), so some vendors are doing it right for some clients.  And 570 Fortune 1000 companies have no idea what I’m talking about yet at all, so that bodes well for you, the analyst or consultant, and you, the vendor, and us, the industry.

If you have switched and feel like sharing why, please do!  Thanks for reading!

whyswitchgreatestsoftwareonearth.gif

Part 1: Why Companies Switch Web Analytics Tools

Switching web analytics tools is fairly common (see Part 2).  According to a recent survey by analyst John Lovett (now at Jupiter), research suggests that 62% of “best-in-class” companies are currently on their second or third web analytics vendor platform.  An update of Eric Peterson’s Vendor Discovery Tool, due out any day now, indicates that only 43% of the Fortune 1000 have even deployed web analytics technology.  That means only 430 Fortune 1000 companies have deployed web analytics tools.  If Fortune 1000 companies are considered “best-in-class,” then, ouch, web analytics vendors could really improve customer satisfaction and retention!  My friend Robbin Steif over at LunaMetrics recently asked ”why?”  Why do companies switch web analytics tools so often?

The answer is that every company reaches a threshold moment of judgment about a particular technology - the instance the decision is made to either keep the technology or ”move on” and begin searching for new technology to replace the old.  This day of reckoning results from a combination of issues related to:

The same goes for web analytic’s tools.  Challenges related to these five items fester until they cause a company to begin thinking about switching.   For example (the list continues in Part 2):

  • Price:
    • Licensing or Subscription Costs.  Whether your running an in-house or hosted web analytics environment, it costs money.  Companies care more about things like the capital budgeting, debt service, and cash flow than they do about web analytics tools.  In times of limited resources or when the tool isn’t producing the desired return for a given cost, change may occur.
    • Per seat costs.  Every person who accesses the web analytics application may be thought of as having “a seat.”  How many seats do you need for your team?  your company?  The answer really depends on the scale of your implementation.  Seats aren’t inexpensive, and if you haven’t considered how many people in your organization need access the tool, you may soon find you can’t scale your implementation without significant revenue expenditures.  Per seat costs are generally factored into overall costs (sometimes as a line item).
    • Maintenance or upgrading costs.  Version upgrades or closely-related new products may not be “free.”  And that may make companies, that already feel limited with their current implementation, very ”unhappy” and cause “the switch.”
    • Costs for specific features.  Vendor’s basic pricing-models limit access to data filtering and drill-down features.  Slicing data across more than one dimension may require additional components or entirely new products, which have cost.  Some companies don’t dig deep enough during due diligence, only to buy a tool that they know can show “keywords per page” but hasn’t been configured or costs additional money to do so.   
  • Features:
    • Limited features for custom reporting, segmentation, and ad/post hoc analysis by the business user.  Similar to the bullet above, Companies may offer a simple reporting tool with a limited number of ways to filter, segment, or correlate data.  To do more, the company may need to purchase additional features or completely new products not in the budget.  To lower cost, the company starts looking elsewhere for a comparable service. For example, time spent metrics are easy to calculate.  They’ve been in log files for years.  But don’t assume every page tag vendor has the ability to calculate these metrics out-of-the-box.
    • Inadequate data storage. Companies may not have asked “how long is my raw data stored” or if the data after a certain period of time.  It’s a real bummer when you want to backtest and realize you don’t have the data you need. Take for example, URL length, some companies may trim your URL’s unless you ask them not to.  You won’t know that until you realize it’s a problem.  At that point, not having historical data to test may cause managers to question the tool’s capabilities and to start thinking about the switch.
    • Limited data model, lookups, and decodes.  If the company wants deeper insights across more dimensions, the current product may have a limited data model that isn’t extensible to your business.  Decoding parameter or using lookup tables might not be possible.  The model may not enable you to track and report on things like Events, Spider/Bots, IP addresses, and file downloads.  If you can’t extend the schema, you limit your ability to use custom business dimensions already existing in your data warehouse or in other applications. 
    • Closed systems with no API’s or methods for rewriting data.  No API or methods beyond simple exporting of filetypes for getting data out of the system limits companies from achieving a 360 view of online customer experience.  If you want to do cross-channel marketing merging print and online, a simple reporting tool will not suffice (no matter how pretty the reports and trendlines). Companies that wake up to this fact, switch. 
    • Data not separate from presentation.  To integrate data across systems you need to have access to it.  For such needs, data is best in an open database, not only available via a web interface or on a “pay-to-use” basis.   Companies that wake up to this fact, switch.    

This list continues in Part 2.  Please click here to begin reading Part 2!

whyswitchgreatestsoftwareonearth.gif

Web Analytics and the Normal Distribution: More on Statistics and Web Data

Is web analytics data normally distributed?  That question calls for another question: what web analytics variables are you measuring?  That matters.  Numeric random variables (let’s call them data) are classified into the following types:

  • Discrete.  That means you count it.  The data arrives from a counting process.  In web analytics discrete random variables are counts of things like page views, visits, and unique visitors
  • Continuous.  That means you measure it.  The data arrives from a measurement process.  In web analytics continuous random variables are time-based metrics.

We do both in web analytics, don’t we?  We count some things.  We measure some stuff.  And if we’re smart and have the autonomy and positional power to do so, we apply process to counting and measuring web analytics data. 

We often talk about “counting” and “measuring” like they are the same activities.  In general day-to-day online business, that’s no big deal for conceptual conversations.  But in statistics, “counting” is different than “measuring.” 

Both discrete and continuous variables may be represented by probability distributions to assess the liklihood of an outcome.  To identify probability for discrete variables, use a “binomial distribution.”  Binomial distributions take into account the probability that an outcome will occur, so you may see some skewing when plotting the data that may make it look a bit “long tail.” 

For continous random variables use the “normal distribution.” Realize your data won’t always look exactly like a bell curve.  If it looks really different and ”long tail” you may be looking at a discrete variable better suited for a binomial distribution.    

So is web analytics data “normally distributed?”  The answer is that it depends on the type of data.  Even then, the answer is “probably not.” In fact, most business data doesn’t follow a perfectly normal distribution.  Yet every day in halls of academia, very book smart people teach statistics and tell you to apply it to business data.  Are they wrong?  Insane?  Misguided?

No they aren’t (well maybe you have to be slightly insane to teach stats).  Academics realize that most distributions are not normal and do not have equal measures of central tendency (i.e. mode, median, mean).  Skewness abounds!  The normal distribution, however, can be used to approximate “real-world” distributions that have different measures of central tendency. 

A theory called the “central limit theorem” states that “if the sum of the variables has a finite variance, then it will be approximately normally distributed ( i.e., following a normal or Gaussian distribution).” In other words as the sample gets larger the distribution of the mean can be approximated by the normal distribution.  And if I remember correctly statisticians have determined that with a sample size of at least thirty, the sample distribution of the mean will be approximately normal.   Fortunately, we web analysts often have millions of data points to use…

Some time ago I actually took average visit duration for one site for which I have real data for thousands of visits and did a Lilliefors Test of Normality.  The test found no evidence that the data wasn’t normally distributed even though it looked a bit odd and the skewness was 0.741426 and the kurtosis was 4.1525665. 

If you’re thinking about applying statistics to web data, make sure you identify whether the data you are looking at is discrete or continuous.  Don’t abandon the normal distribution for certain types of web analytics data just because it doesn’t exactly look like the Liberty Bell.  Test it for normality before applying the Gaussian statistics.  If the data is highly skewed, determine whether the level of error is in acceptable limits.  Look at using other distributions for discrete variables.  

normaldist.gif

Image from http://www.weibull.com/

Web Analytics, Keywords, and a Question Someone Asked Me…

Web analytics and keyword metrics came up in a conversation I had last evening with a friend of mine from my days in “information retrieval“ - when Googol was a really, really large number, and we called keywords ”queries…”  Over a Belgian beer (a Cantillion), I was asked to “name the top couple of metrics I’d want to know about a set of existing keywords if I were selecting a few to continue to optimize or buy?”  

I told him that any keyword-related metric should be analyzed within the context of campaign objectives, which in order to be measured and reported need to be defined before the campaign begins.   Macro level campaign goals should be identified before performing micro-level keyword analysis.  Once campaign goals are known, analysis can focus on achieving the optimal keyword mix to fulfill them.  A single, keyword-related metric should rarely be taken as a stand-alone indicator of performance. 

Here’s a synopsis of what metrics I told him I think are useful to examine when performing keyword analysis:

  • Referrers.  At a basic level, identifying the sites that sent keyword traffic is common sense (like not excluding the Googlebot ;).  You may uncover keywords for which your site’s content “accidently ranks” on a particular engine.  These rankings may not be immediately obvious from a straight list of top-performing keywords.  By digging deeper into keyword referrers, you may find sites like these: forex-cash-fast.info, gambling1×2.com, nhadep.net, nghenhac.com, and xn--q2yr34f.com.  Clickfraud?  Poor targeting by an engine?  Lost money?   So many questions can be asked from keyword referrers!
  • Geography. Show me my keywords segmented by dimensions like Continent, Country, City, Zip Code to assist in planning geo-targeted campaigns and identifying the broad content themes that appeal to the geographic long tail.
  • Number of Visits and Percentage of Total Site Visits.  Raw visit and percentage totals indicate the “reach” of the keyword- the degree to which a keyword has penetrated a target audience.  I could compare the number of visits to the number of searches for that keyword using Overture’s Keyword Selector Tool to assess reach and correlate whether the cost to buy or the effort to optimize the keyword is acheiving the desired effect.
  • Average Visit Duration.  It’s not an engagement metric, but average visit duration does tell you whether or not the visitor remained on your site and if so for how long.  It can be useful when taken into context with the page-view to visit ratio and segmented by other dimensions, such as conversion rate.  
  • Page View to Visit Ratio.  One of my favorite metrics on a per keyword basis is the view:visit ratio.  This ratio identifies the average number of pages viewed per visit for that keyword.  If your keyword should convert the visitor from the landing page, and you are seeing a page view to visit ratio greater than one, what’s up?  If your trying to persuade visitors to enter some sort of non-linear or linear, multistep funnel leading to a conversion, and your page-view to visit ratio is one, what’s up?
  • Bounce rate.  A key metric that identifies what percentage of visitors enter the site on the keyword’s landing page and immediately leave.   If your bounce rate for a keyword is over 35% and you are targeting that keyword, you should think about landing page optimization.
  • Conversion rate. Conversion rate is the percentage of visitors referred by the keyword who succeeded in completing a pre-identified, value generating event on the site, such as a purchase or registration.  Conversion rates measure how well the keyword acted as a trigger for driving on-site revenue.  By segmenting your keywords based on conversion rate or other dimensions, you may notice broad content themes that drive on-site success events.  These themes could be used in persuasive messaging that includes hyperlinked points of resolution moving visitors into the non-linear conversion funnel.

Then I told him to “segment, segment, segment.” :-)

Many metrics and dimensions can be applied to the analysis of keywords beyond the few I listed above.  What metrics do you look at on a per keyword basis when planning search engine optimization efforts or when planning paid search campaigning?

organicsearch_keywords.jpg

unica_keywords2.bmp

Part 2: Your Web Analytics Data Quality May Stink and Here’s Why!

In Part 1, I began a long list of reasons why your web analytics data quality may stink.  I’m continuing the list below (make sure you read Part 1 for context and to view the entire list)

  • Storing only visit level data.  May tools don’t have schemas that store raw data at the visitor level.  Instead they provide access to only visit level data.  For example, you may not be able to see all the page views during a single visit per ip address or cookied visitor.   Assess the impact of the vendor’s schema on your goals.  Companies that use analytics data to feed other systems or that want to use visitor attributes for content targeting, segmentation, optimization, or analysis may not be well-served by some vendor schemas.
  • Little to no decodes or lookups.  If you use numeric codes and non-human readable naming conventions in your data, they can pass through to your reporting and prevent your colleagues from understanding the reporting.  Strange codes look like hieroglyphics!  Decoding and looking up data can eliminate the problem of non-readability and strange numerical names in your reporting… While some would say this is a reporting issue, not a data issue,  I chose to include it because it’s at the surface… it’s the data your customers see.  Not all tools decode or lookup.  Some tools allow rewriting of data in the database.
  • Failure of key services supporting the application.  If you are dependent on page tags, synchronization software, web servers, databases, or any of the wondrous technology that makes it all work, failures are a real bummer.  Make sure you have monitoring and recovery processes in place so you don’t miss data!  When page tag collection fails (perhaps the page tag server went down ay?), the data is gone forever.  If the web server fails, then no logs are written, but no pages are served either - so is traffic missed?  But if the processes supporting log file analysis fail (i.e. data synch), watch out! 
  • Inadequate or incorrect implementation.  If you can’t cross dimensions (like finding out what keywords referred traffic to a page), filter all of your data (for example, filtering pages to see only those viewed by the iPhone), easily create new metrics, or if the numbers aren’t adding up, you may have not adequately or correctly implemented your software or communicated your requirements to your vendor’s professional services team. 
  • Limited, hard-to-extend data model. Powerful, actionable insights from web analytics are enabled by extending a data model to incorporate business specific dimensions.  For example, if every page has a category and an author, you may want to see a list of all the page views in that category or ranking of pages by most popular author.  To do that you may need to join data at the database level or take advantage of variables you pass in a page tag.  Various tools have different limits on if, how, and to what extent you can extend the data model.

So what do you do when you know your data quality is less than stellar?  Here’s some guidance:

  • Don’t worry, be happy. :-) Just by collecting the data you are collecting, you are doing better than a great majority of companies that do business on the Internet.  By asking questions about data and investigating the issues, you have a leg up on your competition.  Work on optimizing the data, expose flaws in site design or architecture that impede data collection, work with your vendor and seek help in the web analytics community if you run into real problems.  The Web Analytics Association’s Forum on Yahoo is a useful place for posting questions.  But whatever you do, stay positive and focused on solving your problems and making your web analytics practice more optimized.  Don’t get frustrated.
  • Recognize the limitations in the data and do not go gently into the night.  Ask the hard questions about sampling, schemas, data retention, processing, querying and reporting to understand where the holes and noise could be in your data.  Demand answers from your vendors and quick response times to your questions about data quality.  If you vendor is frustrating you by not being responsive, talk to the boss and the vendor’s bosses, escalate, escalate, escalate until you get resolution.
  • Understand the underlying elements of data collection and what can go wrong.  Learn about sessionization and why different tools and data collection methods have limitations.  Explore the more technical components of the backend, like the database and your web analytics schema - all your data is in one (or more)! Talk to your engineers.  Have them explain the technology in terms you understand.
  • Evaluate your tools.  Some tools are just better suited for particular business problems than other tools.  Log files tools enable you to constantly change assumptions and reprocess data.  Page tags provide a standard data collection and transport mechanism.

With hard work on your part, you can make you web analytics data smell like roses!  I know you can! :)

dataquality_renamed.jpg

Web Analytics ROI and Value Generation? The Three Core Actions

Why do we do web analytics?  What value does it generate?  What is the ROI from web analytics?”  Companies trying to justify or in the process of allocating capital to “do web analytics” are wondering…  Recent research has shown that companies serious about web analytics need to invest money in people and technology.  But what do companies get in return for doing so? 

I believe that web analytics helps online business identify potential opportunities for taking action to:

  • Increase revenue.  Web analytics helps you make more money.
  • Decrease cost.  Web analytics helps you spend less money to make more money.
  • Improve operations.  Web analytics helps you work smarter and more efficiently.

Let’s explore some of the ways web analytics can help online business across these three core actions:

Increased revenue through:                                                                                    

  • More targeted advertising sales.  Content is monetized in a many ways, from cpm, cpa, cpl, ppc, and more.  Web analytics can tell you which of these methods for generating revenue and which advertising campaigns using those channels are performing most effectively.  External campaign effectiveness may be tracked using referrer data and related dimensional reporting.  Metrics related to internal campaigning, like microsites or special advertorial offerings, can be easily provided to advertisers and agencies to identify audience consistency and quality.
  • Better insights into audience segments to realize incremental revenue.  Segmentation refers to dividing a total population into groups based on one or more characteristics.  A good web analytics tool easily enables you to segment on dimensions and attributes relevant to your business.  Segmenting web data enables you to answer questions about which visitors visit when and with what frequency, depth, and duration, and more, which provides otherwise unknowable insights.  New incremental revenue streams may be realized by mapping newly discovered behavioral or demographic characteristics to existing advertiser or agency demand.
  • Creating effective online-marketing and editorial offerings.  Reports showing visit frequency, depth, recency, and the time periods when the online audience visits the site assist product managers, editors, and producers in optimizing, crafting, and targeting content and advertising, increasing reach and exposure time of advertiser messaging to key audience segments.
  • Ensuring pages effectively lead to conversion funnels. Metrics like bounce rate, conversion rate, clickstream pathing, and conversion metrics provide indications about how to modify or tailor pages to generate value.  Funnels can provide insights about which calls to action, content, pages, sections, and campaigns yield the best conversions.

Reduced costs by:

  • Increasing the effectiveness of online work products. By identifying, monitoring, and evaluating important KPI’s (key performance indicators and KKPI’s!), the business learns what works and what doesn’t work online.  Web performance data has amazing utility when evaluating, planning, and monitoring current and future trends when assessing how to reduce cost in an portfolio of online products.
  • Maximizing site operations, content, and opportunities for organic and paid search.  The performance of pay-per-click, paid inclusion, and contextual advertising and linking campaigns may be audited to eliminate projects that fail to meet goals based on conversion, revenue, or KPI’s.  By tying conversion to capital budgeting, online projects that fail to meet site hurdle rates may be tailored or eliminated.  The business can then better focus on the driving profitable revenue without misallocating resources.
  • Optimizing user experience and information architecture.  Overhead reducing tools like Google Site Optimizer and offerings from other companies providing site optimization services use web analytics data to programmatically alter a site to increase conversion and lift.  CMS automation can be driven off of web analytics data.
  • Pinpointing the performance of online marketing campaigns.  By creating custom KPI’s, metrics, and segmented conversion rate and slicing data via custom filters and business relevant dimensions, deep insights into online performance can be attained.  Misappropriated resources and efforts can be easily recognized and eliminated.

Improved operations via:

  • Deep understanding of site traffic, visitor activity, conversions, and online value-generation.  You can’t manage it, if you don’t know about. Companies most successful with web analytics dedicate a full-time staff to analyzing and contextualizing data and performance metrics from channels like organic and paid search, affiliate partnerships, and offline.  The best staff understands the impact of the web channel across the value chain.
  • Contextualizing strategic decisions with accurate data.  The ability for a corporation to gain insight and intelligence into its online activities provides management with transparency into performance.  Performance must be monitored to be improved, and there’s no other way to gain true insight into online performance to than using web analytics to guide web strategy.
  • Identifying site operational effectiveness in a timely manner.  When using log files, server errors and other impediments to online customer satisfaction can be quickly discovered and remediated, which reduces negative impact and minimizes risk. 
  • Predicting the impact of business decisions on performance.  By applying statistical methods to web analytics data, businesses increase their abililty to predict the impact of site changes on performance. 

Every Internet business can benefit from technology that positively impacts these three important business actions.  I recommend that you consider how your projects are framed across three actions whether you’re just thinking about getting involved with web analytics, if you’re growing your web analytics practice, or if you’ve already established web analytics at your company.

 phillipsroi.jpg
 

Web Analytics and Data Collection: Log File Analysis

Web Analytics and log file analysis is misunderstood!  With so many options for data collection in web analytics, the log file has been somewhat disdained for many years.  In fact, I’d venture to guess that the majority of web analytics practitioners working today don’t “do log file analysis.” That’s unfortunate.  One can learn a lot about web analytics from examining log files - for example, “total time online” (with all it’s inherent issues :).

Why is server-side data collection (log file analysis) less popular than client-side page tagging?  It’s not about bots, accuracy, or ease of deployment… that’s marketing spin, somewhat true and false depending on context.  It’s may be because most web analytics practitioners currently use a hosted model. 

Page tagging enables ASP models  for web analytics.  It would be very difficult (i.e. resource intensive and costly) for a web analytics ASP to maintain a competitive service level if the ASP relied on its clients and their IT teams to maintain server side data collection- maintaining, transporting, and synching log files.   

Imagine the ASP model with log files!  It’s nightmarish and makes little sense:  hundreds and thousands of customers with just as many (or more) different servers behind firewalls generating many different log file formats.  Then, at various intervals all customers sending uniquely formatted logs for the ASP to parse, index, and make available for querying and retrieval in close to real time.  Yikes!! 

Page tagging creates a standard data structure for the ASP and a manageable data transport mechanism while minimizing complexity and reducing infrastructure cost in a way that log files just can’t.  Page tags enable ASP’s to meet rigorous service level agreements and contractual obligations for data processing in way they probably couldn’t with log files.  

So who does log file analysis for web analytics?  The answer is companies that have:

  • No desire to give up control of a web analytics infrastructure to young companies.  Web data is valuable and log files enable you to keep it all for however long and at whatever cost your business justifies.  You can processing your data using whatever tool you want, and reprocess at your discretion.
  • Existing business dimensions and structured data already existing in log files. Simple web analytics concepts like page views, visits, and “uniques” become more meaningful when sliced by relevant business objects from your companies existing log data.  For example, if you’ve passed a unique parameter in your query strings for years, you can program your log file tool to recognize the parameter.   Then reprocess your existing data to make use of it in your reporting. 
  • Deep data mining experience.  Data miners want data to mine, not static GUI’s or “my data” delivered in summary extracts at an incremental cost.  Log files can be parsed and indexed into the database that you, the company or the client, actually owns at the budget you actually set.
  • Know how to employ hybrid data collection.  Log files are complemented by page tags for instrumenting pages to track events and Web 2.0.
  • Complex integration requirements.  All the data exists on machines in your network.  Log files can be extended to incorporate new business dimensions.
  • A need to understand how spiders and bots behave on their site and the performance of their server(s).  Log files capture everything that happens on the web site.  The “kitchen_sink.html” is in there.  It’s up to you to customize the tool to meet the quality level you demand for filtered data.  As Mat and Steve mentioned in the comments, logs also enable you to view server errors, useful for site QA and debugging.

I’m sure there are other reasons too….  I’ve heard people tell me that:

  • It’s hard to move log files.  That depends on how many log files you have, what technology you are using to move them, and how often.
  • Log files are difficult to parse. Not really, especially if they are standardized.  You just need a flexible tool.  If things get non-standard, you need a tool that enables you to define your own formatting.
  • Log files take up a lot of space.  Yes, they do. How much space do you have?  What’s your companies existing standards for parsing log files?  How long will you store them?  You may find out you have plenty of “space.”  You will never have enough time though. ;)
  • Log files aren’t as accurate as page tags.  Huh? That all depends on how well you filter and exclude pages.  Accurate compared to what? To unvalidated data existing on servers you don’t see in some data center thousands of kilometers/miles away?  Logs capture visits from browsers that don’t execute javascript and thus will not appear in the metrics from tag-based systems…

So, you might be thinking I am a big fan of log files.  I am a web analytics realist.  Challenges with log files include:

  • Standardizing on a common format.  If you have a web server, you are at risk for someone doing something that creates a custom log file format.  If you are the new web analyst coming into an existing company, you may find custom fields in the logs.  Develop and champion a company standard.  It will make your life as a web analyst somewhat easier.
  • Creating a single “core” configuration that can be extended to deal with one-off formats.  If you live in world of custom log files, you can abstract a set of core fields, describe them, then extend the description when you deal with custom craziness.
  • Synchronizing/transporting log files.  The more servers you have, the more chances you have for failures in synchronization.  Application monitoring and recovery procedures are critical to preventing downtime.  Develop them.
  • Reprocessing log files.  Depending on your system reprocessing existing log file data can have myriad of effects - from data not being available, to data being partially available, to errors, to long waiting periods before the data is current and ready to query.
  • Receiving log files that are missing key fields or with key fields that have changed.   If you depend on systems, companies, and people that you do not control, fields can be stripped and formats changed in the blink of eye.  The web team needs to detect and manage these events.
  • Monitoring the operational workflow.  Someone on the web analytics team needs to manage the system and maintain the overall availability and integrity of log file processing and any issues that could result in the data not being synched or processed correctly. 
  • Managing change.  Since all your “hits” are captured in the log file, you’ll need to tell your web analytics tool how to correctly process the logs and remove data you don’t want to report.  You’ll need to identify and maintaining lists of filetype exclusions and the user agents/ip’s of robotic traffic.
  • Finding staff that “gets it.” The rule is that success in web analytics results from having competent people who have experience understanding the risks and rewards of web analytics data collection and executing on a web analytics strategy that generates value. 

As a web analyst you need to use the tools at your disposal to build your web analytics practice.  Log files are part of your toolkit.  Be weary of people who try to sway you one way or the other when it comes to web analytics data collection.  The method you ultimately use should be based on the data you want to collect in the context of the company’s goals, budget, and expertise

log_files.gif

Web Analytics and Data Collection: The Page Tag

Many methods exist for collecting different types of web analytics data- some much more accurate and useful than others: page tags, log file analyzerspacket sniffers, audience panels, and toolbars.  At this point in time, the page tag certainly seems to be the preferred method adopted by the web analytics industry for data collection.  It’s also core to other technologies, like multivariate testing. 

Various sources will tell you a multitude of “things” over time about page tags, depending on whether they think you are a noob or advanced practitioner.  There’s lots of useful information from many sources about page tags.  Still I find a lot of the average discussion about the utility of page tags to be somewhat true and false, depending on context.  For example, if you’ve never tagged pages, but you have processed log files, does it makes sense to entirely throw away legacy processes?  Or if you’ve mastered change management across all pages on your site but you’ve never parsed log files, do you want to have to deal with synching, moving, parsing, filtering log files?   These questions yield fantastic answers for fruitful discussion as you plan or extend your web analytics implementation.

I’ve heard the following “things” about page tags:

  • It’s easy to page tag your sites.  That really depends on the technologies used to build the site.  A static site with few simple pages is a different animal than a beast of a database-driven site with millions of complicated pages glued together using different technologies. 
  • Page tags are more reliable.  Reliable in what way I ask?  Page tags in hosted environment are processed by a data center many miles away.  I have no idea of the availability of the servers to support the load they are receiving from all the customers, nor do I have any insight into the raw data collected by the page tags, errors, and so on. 
  • Robots and spiders are always removed from the metrics.  Page tags do a better job than log file analyzers out-of-the-gate and especially if you don’t maintain your filtering, but I don’t really think page tags are infallible at all.  Bots and spiders get through.
  • You don’t have to deal with IT when page tagging.  It’s just a global include right?  You just give IT the code, and they include it.  For the most part that’s true, but IT has processes and procedures and your tagging needs to be QA’ed and perhaps even put through a “versioning” process.   Regardless, you’ll need to prep the corporation and your friend’s in IT for the big page tagging effort.
  • Change management is easy with page tags.  If you only use the vendor’s out-of-the-box page tag and have a global include, it’s somewhat easy to manage change.  But in my experience, a web analytics implementation requires using the tag in different ways across the site.  You’ll need to add attributes to the tag or script values into the tag.   In order to manage change, you’ll need to follow corporate processes.  In the case of web analytics page tagging, you may need to create those processes before you manage change… and process creation in corporate environments is sometimes not easy.  You may even need to follow a process for doing so.  :)

The biggest challenges I’ve had with page tagging include:

  • Ensuring complete code coverage across all pages.  The number of pages in your site, the way the site is built, and the technology used to build your site all need to be carefully assessed prior to beginning the page tagging effort.  The larger the portfolio of sites, the more difficult it will be to tag all of your pages.  Other challenges include the number of domains and subdomains, the technology standards used across your portfolio of sites, and whether, in more controlled environments, the web analytics team actually has the positional power to influence the change management process.
  • Determining a method for centralized tag management.  If you are running an internal solution, you’ll need page tag server(s) to collect the page tag data.  If you’re depending on your vendor’s servers, you may need to modify or update tags for various reasons (new campaigns and so on).  You may find new sites that need to be tagged.  New stakeholders may want special tags added to the site to support other purposes (such as multivariate testing).   How are you going to centrally manage all that?   
  • Orchestrating changes to tags across different site sections.  You may want to pass a new value in the tag or change something in the script on page X, but not page Y, or subdomain A, but not subdomain B.  A new campaign may need to be enabled.  You may have to update your tag to take advantage of new vendor functionality.  You’ll need a technology solution and process for centrally orchestrating and controlling tag changes.  Wielding a mighty CMS helps.
  • Reconciling tag metrics with log metrics and determining correct filtering.  It’s great fun to collect tag data then compare it to your log files to determine if all pages are being counted and how effectively you are filtering bots.  While looking at different data sources is a time consuming activity, certain business cases may demand it. 
  • Integrating tagging with a Content Management System.  Major sites use expensive CMS’ to create web sites.  The web analyst should work with the CMS team to build page tags into site pages.  Integration will help you with two points raised above: centralizing tag management and change management/orchestration.
  • Challenges with decodes and lookups using tags.  I’ve learned that it is difficult and in many cases impossible to decode a value or use lookup tables with a page tag.  For example, if I had a page that had a URI “/er45rw/e42f45erfwrq3r.html,” I can’t decode it to read “Web Analytics Blog” in my reports.  Or if it is possible to do simple decodes, I have to hardcode the decode on each page.   Hard coding on an evolving web site is never manageable over the time.
  • Latency.  If the page partially loads without executing the javascript, or if the user clicks through the page before the javascript has fired, the page view won’t be counted.  You must test to make sure your tag is firing properly.
  • Javascript turned off.  If the browser doesn’t execute javascript, the tag won’t fire and the visitor won’t be measured.  What effect will that have on your numbers? 
  • Cookie issues.  Tags may set third-party cookies.  Privacy policies don’t like the third party cookie.  As Justin Cutroni points out in the comments his tool sets first party cookies by default.  That’s good.
  • DNS changes.  To prevent third party cookie issues, you can trick out the DNS with a CNAME entry.  Say hello to someone in IT called the DNS admin!

In discussing the page tag in such a manner, you may think I don’t like the page tag.  Not true at all!  The page tag is a very useful data collection method in context.  I like it very much, especially in hybrid data collection.  Like any technology, just make sure you understand the implications of your data collection method on your implementation.   

page_tags.gif
Courtesy of Zeus.com.

Part 4: Spider and Bot Capability Maturity Framework for Web Analytics 2.0

Let’s conclude this series on the spiders and bots of web analytics 2.0 with a framework for ranking the spider and and bot capability maturity of  a web analytics practice! 

Before I do so, let me mention a few things:

  • How/if you detect spiders and bots depends on your data collection methods.  It is thought that page tags exclude spiders and bots because they don’t execute javascript or request images. Until they do. Log file data must be examined and filtered (either programmatically, manually, or both), and those filters need to be maintained by a web analytics professional.
  • Bots are used for a many purposes.  From inflating metrics, to indexing text for search, to chatting, to spamming, to data mining, to site performance monitoring, to click fraud, and more.
  • Bots are constantly evolving. Just like web analysis. :)
  • Not all bots are bad news.  You want bots crawling your site.  Some bots are good and helpful for your online business. Imagine if you blocked Googlebot.  Segment the robotic traffic into separate reporting (in the best case) and make sure its filtered from your externally reported numbers.   

Now without further adieu, here’s a web analytics 2.0 framework for ranking your spider and and bot capability maturity:   

  • Red.  You are in the red if the following applies:
    • I don’t understand the impact of spider and bot traffic nor do I measure, filter, or segment it at all. 
    • If I run an in-house solution, I haven’t updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification.
    • I don’t know when or how the Googlebot hits my site, but it sure does! 
    • I’ve never thought of reporting spider traffic for SEO.
    • I think the words “yahoo” and “slurp” refer to delicious soup.
    • Spiders scare me.
    • All bots are bad (no, they aren’t!)
  • Yellow. You are moving out of the red zone, into the yellow if:
    • I occasionally look for suspicious traffic. 
    • If I run an in-house solution, I have occasionally updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification, but I ask for verifications of compliance with industry standards.
    • I’ve heard of the Googlebot and maybe I use or I am thinking about using Site Maps (do it!). 
    • I know about Yahoo! Slurp.
    • It would be cool and useful to report on spider and bot traffic and learn which bots are good and bad. 
  • Blue.You are in the blue if at least every month or on an ad hoc basis:
    • I work with my vendor or in-house team to recognize and remove spider and bot traffic, conforming to industry standards
    • I do a monthly update of my filter and exclusions list, or I know my vendor does.
    • I capture bot traffic in my server log files, but I may or may not report on it.
    • I know all about the Googlebot, and I use Site Maps. 
    • I realize the Googlebot and Yahoo! Slurp behave differently on my site, but I don’t know exactly how.
    • I know which bots are good (like search bots) and bad (like content scrapers).
  • Green.  You are in the green if:
    • I have established a process for regularly removing spider and bot traffic and for keeping my lists of exclusions and filters up-to-date, or I am *absolutely certain* my vendor has such processes.
    • I am in compliance with industry standards.
    • I measure spider and bot traffic and segment it into distinct reporting separate from my human traffic reporting.
    • My SEOers love me, the web analyst, and I am involved in educating people in my company about bot traffic from IT to managers to consumers of reporting and analysis.
    • I know that there’s more than one Googlebot, and I know how its crawl differs from Yahoo! Slurp (hint: Slurp visits more).
    • I know detailed metrics like the “total time online” of all my bots. ;)

What I’ve covered in this four-part series only covers a little bit of what I know and what there is to know about spiders, bots, and crawlers and how they affect web analytics.   I’d enjoy hearing other experiences or opinions, so please share your comments if you feel like it.  Until next time, fine readers.  Thanks for visiting!

sengine1.gif