Web Analytics Blogs

Judah Phillips is an experienced web analytics practitioner and Internet expert currently working as a Director at a large multichannel media company. His blog is full of useful, unbiased, actionable insights learned from the real-world practice of a process-oriented, integrated approach to strategic Web Analytics for improving business performance.

Subscribe to Judah Phillips weblog

Archive for 'Web 2.0'

« Previous Entries Next Entries »

Web Analytics, Keywords, and a Question Someone Asked Me…

Web analytics and keyword metrics came up in a conversation I had last evening with a friend of mine from my days in “information retrieval“ - when Googol was a really, really large number, and we called keywords ”queries…”  Over a Belgian beer (a Cantillion), I was asked to “name the top couple of metrics I’d want to know about a set of existing keywords if I were selecting a few to continue to optimize or buy?”  

I told him that any keyword-related metric should be analyzed within the context of campaign objectives, which in order to be measured and reported need to be defined before the campaign begins.   Macro level campaign goals should be identified before performing micro-level keyword analysis.  Once campaign goals are known, analysis can focus on achieving the optimal keyword mix to fulfill them.  A single, keyword-related metric should rarely be taken as a stand-alone indicator of performance. 

Here’s a synopsis of what metrics I told him I think are useful to examine when performing keyword analysis:

  • Referrers.  At a basic level, identifying the sites that sent keyword traffic is common sense (like not excluding the Googlebot ;).  You may uncover keywords for which your site’s content “accidently ranks” on a particular engine.  These rankings may not be immediately obvious from a straight list of top-performing keywords.  By digging deeper into keyword referrers, you may find sites like these: forex-cash-fast.info, gambling1×2.com, nhadep.net, nghenhac.com, and xn--q2yr34f.com.  Clickfraud?  Poor targeting by an engine?  Lost money?   So many questions can be asked from keyword referrers!
  • Geography. Show me my keywords segmented by dimensions like Continent, Country, City, Zip Code to assist in planning geo-targeted campaigns and identifying the broad content themes that appeal to the geographic long tail.
  • Number of Visits and Percentage of Total Site Visits.  Raw visit and percentage totals indicate the “reach” of the keyword- the degree to which a keyword has penetrated a target audience.  I could compare the number of visits to the number of searches for that keyword using Overture’s Keyword Selector Tool to assess reach and correlate whether the cost to buy or the effort to optimize the keyword is acheiving the desired effect.
  • Average Visit Duration.  It’s not an engagement metric, but average visit duration does tell you whether or not the visitor remained on your site and if so for how long.  It can be useful when taken into context with the page-view to visit ratio and segmented by other dimensions, such as conversion rate.  
  • Page View to Visit Ratio.  One of my favorite metrics on a per keyword basis is the view:visit ratio.  This ratio identifies the average number of pages viewed per visit for that keyword.  If your keyword should convert the visitor from the landing page, and you are seeing a page view to visit ratio greater than one, what’s up?  If your trying to persuade visitors to enter some sort of non-linear or linear, multistep funnel leading to a conversion, and your page-view to visit ratio is one, what’s up?
  • Bounce rate.  A key metric that identifies what percentage of visitors enter the site on the keyword’s landing page and immediately leave.   If your bounce rate for a keyword is over 35% and you are targeting that keyword, you should think about landing page optimization.
  • Conversion rate. Conversion rate is the percentage of visitors referred by the keyword who succeeded in completing a pre-identified, value generating event on the site, such as a purchase or registration.  Conversion rates measure how well the keyword acted as a trigger for driving on-site revenue.  By segmenting your keywords based on conversion rate or other dimensions, you may notice broad content themes that drive on-site success events.  These themes could be used in persuasive messaging that includes hyperlinked points of resolution moving visitors into the non-linear conversion funnel.

Then I told him to “segment, segment, segment.” :-)

Many metrics and dimensions can be applied to the analysis of keywords beyond the few I listed above.  What metrics do you look at on a per keyword basis when planning search engine optimization efforts or when planning paid search campaigning?

organicsearch_keywords.jpg

unica_keywords2.bmp

Part 1: Your Web Analytics Data Quality May Stink and Here’s Why!

Web analytic’s data quality and accuracy of ”the numbers” are always questioned.  With so many sources of data from different systems and vendors - both free and paid- you must be able to reconcile deviations in data from different sources, and speak intelligently about data quality and accuracy to promote adoption of web analytics at your organization. 

There are so so so many reasons why web analytics data quality can stink.   I thought it would be fun to list some of the major reasons (over two posts):

  • Spiders and Bots.  If you haven’t read my series on spiders and bots, check it out.  Non-human traffic can inflate your metrics and diminish the predictive power of your analysis.  Regularly look for bots and update your filtering!
  • Untagged pages.  If the bulk of your page views are being generated by a single page, and you’ve failed to tag that page, you’ve lost data.  That’s always a bummer.  Tagging must be endemic to the web development process. Ask your team how they know that every page is tagged.  Really how do they know?  Verify and reverify.  Use a tool like WASP.   Obviously this isn’t a problem with log files.  
  • JS turned off.  If the browser doesn’t execute javascript, the page tag won’t fire and traffic will be missed.  If you’ve missed it, you don’t know it occurred, so you are constantly in the dark unless you compare your page tagged data to log file data, which isn’t easy at all!  How you account for missing this traffic, whether it is immaterial or not, is a business decision. Again, not a problem with log files.
  • Latency.  If the page tag doesn’t fire because it failed to load, the traffic is missed.  Vendors provide recommendations about the best place for a page tag.  Your development team may not believe them, or the “global include” that may or may not exist may or may not insert the code in the suggested spot.  It’s a good idea to listen to vendors when they provide configuration recommendations.  Once again, how you account for this data discrepancy is a business decision. Not a problem with log files.
  • Differences in sessionization.  Data divergence gets hairy when you are running two tools on the same site, or are replacing one tool with another.  That’s seems to be very common these days with Google Analytics.  Eric Enge over at Stone Temple Consulting in cooperation with my pal Jim Sterne has some data from the 2007 Web Analytics Shootout that you should read to help you understand how different vendor’s sessionize.
  • Sampling, sampling, sampling.  My friend Avinash Kaushikdoes a good job covering issues around sampling in web analytics.  I recommend reading his post!  I’ll add that statistical methods applied to web analytics data are completely valid; however, sampling at the site, page, or database level opens the possibility that you miss key data.  Sites looking for the “long tail” of visitors and using data sampling or other data trimming methods may not find it. 
  • No referrer passed.  Referrer analysis tells you what site people came from before they visited your site.  Bookmarks, typed pages, email campaigns, and bots don’t pass referrers.  Sometimes referrer information just isn’t passed by the browser.  If you overuse redirects on your sites, you may lose referrer data.  Not having this important information impacts SEO/SEM, linking campaigns, and affiliate partnering.  
  • Cookie configuration and deletion.  We all know about cookie deletion thanks to Eric’s research while at Jupiter.  In addition, if your server isn’t set-up right, you may not be setting the cookie on the first request. Talk to the web server guys and gals at your company to make sure your cookie handling is optimized for web analytics.
  • Proxies.  Proxies can filter out your referrers and make it look like everyone’s coming from the same ip address.  Most tools enable cookie-based visitor identification to work around the proxy issue, but if cookies aren’t set up right on your web server or configured correctly in your web analytics tool, assessing uniqueness when the bulk of traffic comes from the same IP will be problematic.     
  • Time spent metrics have severe limitations.  Single-page visits and the time spent on the last page in the visit aren’t measured in “total time online.”  Time-based metrics are schoolboy metrics, so don’t overemphasize them as stand-alone indicators of “engagement.”  Assess the impact of time spent in the context of goals.  Use time as a variable for segmentation and as input into a larger engagement metric. Use them in context, especially if you are selling advertising (and who isn’t)?  My friend Jim Novo has one of the best takes on time spent metrics I’ve read.
  • Failure to maintain exclusions and filters.  Data can be polluted if you are introducing new filetypes  and not excluding them from your top content reporting or if you are not maintaining your bot filters.  New bots are crawling your site right now.  What are those web developers doing now?   Do you know?  Are you checking?  What’s the process for doing so?
  • Little to no historic data.  If you’ve just tagged your pages, you may have no basis for historical comparisons for quite some time (like a year!).  Data has realized value from understanding current behavior, and potential value from predicting future behavior.  And you need data to do it.    
  • You’re not storing all your dynamic url’s.  Your URL’s may have many parameters in the query string (i.e. the name/value pairs after the “?”).  Some tools cut out this data and don’t make it available for querying or reporting in their applications.  If your site is database-driven and dynamic, not having access to the every URL request will limit your ability to do ad-hoc analysis, filtering, and segmentation.

Let’s continue this long list in Part 2!

dataquality_renamed1.jpg

Let’s continue this long list in Part 2!

Web Analytics ROI and Value Generation? The Three Core Actions

Why do we do web analytics?  What value does it generate?  What is the ROI from web analytics?”  Companies trying to justify or in the process of allocating capital to “do web analytics” are wondering…  Recent research has shown that companies serious about web analytics need to invest money in people and technology.  But what do companies get in return for doing so? 

I believe that web analytics helps online business identify potential opportunities for taking action to:

  • Increase revenue.  Web analytics helps you make more money.
  • Decrease cost.  Web analytics helps you spend less money to make more money.
  • Improve operations.  Web analytics helps you work smarter and more efficiently.

Let’s explore some of the ways web analytics can help online business across these three core actions:

Increased revenue through:                                                                                    

  • More targeted advertising sales.  Content is monetized in a many ways, from cpm, cpa, cpl, ppc, and more.  Web analytics can tell you which of these methods for generating revenue and which advertising campaigns using those channels are performing most effectively.  External campaign effectiveness may be tracked using referrer data and related dimensional reporting.  Metrics related to internal campaigning, like microsites or special advertorial offerings, can be easily provided to advertisers and agencies to identify audience consistency and quality.
  • Better insights into audience segments to realize incremental revenue.  Segmentation refers to dividing a total population into groups based on one or more characteristics.  A good web analytics tool easily enables you to segment on dimensions and attributes relevant to your business.  Segmenting web data enables you to answer questions about which visitors visit when and with what frequency, depth, and duration, and more, which provides otherwise unknowable insights.  New incremental revenue streams may be realized by mapping newly discovered behavioral or demographic characteristics to existing advertiser or agency demand.
  • Creating effective online-marketing and editorial offerings.  Reports showing visit frequency, depth, recency, and the time periods when the online audience visits the site assist product managers, editors, and producers in optimizing, crafting, and targeting content and advertising, increasing reach and exposure time of advertiser messaging to key audience segments.
  • Ensuring pages effectively lead to conversion funnels. Metrics like bounce rate, conversion rate, clickstream pathing, and conversion metrics provide indications about how to modify or tailor pages to generate value.  Funnels can provide insights about which calls to action, content, pages, sections, and campaigns yield the best conversions.

Reduced costs by:

  • Increasing the effectiveness of online work products. By identifying, monitoring, and evaluating important KPI’s (key performance indicators and KKPI’s!), the business learns what works and what doesn’t work online.  Web performance data has amazing utility when evaluating, planning, and monitoring current and future trends when assessing how to reduce cost in an portfolio of online products.
  • Maximizing site operations, content, and opportunities for organic and paid search.  The performance of pay-per-click, paid inclusion, and contextual advertising and linking campaigns may be audited to eliminate projects that fail to meet goals based on conversion, revenue, or KPI’s.  By tying conversion to capital budgeting, online projects that fail to meet site hurdle rates may be tailored or eliminated.  The business can then better focus on the driving profitable revenue without misallocating resources.
  • Optimizing user experience and information architecture.  Overhead reducing tools like Google Site Optimizer and offerings from other companies providing site optimization services use web analytics data to programmatically alter a site to increase conversion and lift.  CMS automation can be driven off of web analytics data.
  • Pinpointing the performance of online marketing campaigns.  By creating custom KPI’s, metrics, and segmented conversion rate and slicing data via custom filters and business relevant dimensions, deep insights into online performance can be attained.  Misappropriated resources and efforts can be easily recognized and eliminated.

Improved operations via:

  • Deep understanding of site traffic, visitor activity, conversions, and online value-generation.  You can’t manage it, if you don’t know about. Companies most successful with web analytics dedicate a full-time staff to analyzing and contextualizing data and performance metrics from channels like organic and paid search, affiliate partnerships, and offline.  The best staff understands the impact of the web channel across the value chain.
  • Contextualizing strategic decisions with accurate data.  The ability for a corporation to gain insight and intelligence into its online activities provides management with transparency into performance.  Performance must be monitored to be improved, and there’s no other way to gain true insight into online performance to than using web analytics to guide web strategy.
  • Identifying site operational effectiveness in a timely manner.  When using log files, server errors and other impediments to online customer satisfaction can be quickly discovered and remediated, which reduces negative impact and minimizes risk. 
  • Predicting the impact of business decisions on performance.  By applying statistical methods to web analytics data, businesses increase their abililty to predict the impact of site changes on performance. 

Every Internet business can benefit from technology that positively impacts these three important business actions.  I recommend that you consider how your projects are framed across three actions whether you’re just thinking about getting involved with web analytics, if you’re growing your web analytics practice, or if you’ve already established web analytics at your company.

 phillipsroi.jpg
 

Web Analytics and Data Collection: Log File Analysis

Web Analytics and log file analysis is misunderstood!  With so many options for data collection in web analytics, the log file has been somewhat disdained for many years.  In fact, I’d venture to guess that the majority of web analytics practitioners working today don’t “do log file analysis.” That’s unfortunate.  One can learn a lot about web analytics from examining log files - for example, “total time online” (with all it’s inherent issues :).

Why is server-side data collection (log file analysis) less popular than client-side page tagging?  It’s not about bots, accuracy, or ease of deployment… that’s marketing spin, somewhat true and false depending on context.  It’s may be because most web analytics practitioners currently use a hosted model. 

Page tagging enables ASP models  for web analytics.  It would be very difficult (i.e. resource intensive and costly) for a web analytics ASP to maintain a competitive service level if the ASP relied on its clients and their IT teams to maintain server side data collection- maintaining, transporting, and synching log files.   

Imagine the ASP model with log files!  It’s nightmarish and makes little sense:  hundreds and thousands of customers with just as many (or more) different servers behind firewalls generating many different log file formats.  Then, at various intervals all customers sending uniquely formatted logs for the ASP to parse, index, and make available for querying and retrieval in close to real time.  Yikes!! 

Page tagging creates a standard data structure for the ASP and a manageable data transport mechanism while minimizing complexity and reducing infrastructure cost in a way that log files just can’t.  Page tags enable ASP’s to meet rigorous service level agreements and contractual obligations for data processing in way they probably couldn’t with log files.  

So who does log file analysis for web analytics?  The answer is companies that have:

  • No desire to give up control of a web analytics infrastructure to young companies.  Web data is valuable and log files enable you to keep it all for however long and at whatever cost your business justifies.  You can processing your data using whatever tool you want, and reprocess at your discretion.
  • Existing business dimensions and structured data already existing in log files. Simple web analytics concepts like page views, visits, and “uniques” become more meaningful when sliced by relevant business objects from your companies existing log data.  For example, if you’ve passed a unique parameter in your query strings for years, you can program your log file tool to recognize the parameter.   Then reprocess your existing data to make use of it in your reporting. 
  • Deep data mining experience.  Data miners want data to mine, not static GUI’s or “my data” delivered in summary extracts at an incremental cost.  Log files can be parsed and indexed into the database that you, the company or the client, actually owns at the budget you actually set.
  • Know how to employ hybrid data collection.  Log files are complemented by page tags for instrumenting pages to track events and Web 2.0.
  • Complex integration requirements.  All the data exists on machines in your network.  Log files can be extended to incorporate new business dimensions.
  • A need to understand how spiders and bots behave on their site and the performance of their server(s).  Log files capture everything that happens on the web site.  The “kitchen_sink.html” is in there.  It’s up to you to customize the tool to meet the quality level you demand for filtered data.  As Mat and Steve mentioned in the comments, logs also enable you to view server errors, useful for site QA and debugging.

I’m sure there are other reasons too….  I’ve heard people tell me that:

  • It’s hard to move log files.  That depends on how many log files you have, what technology you are using to move them, and how often.
  • Log files are difficult to parse. Not really, especially if they are standardized.  You just need a flexible tool.  If things get non-standard, you need a tool that enables you to define your own formatting.
  • Log files take up a lot of space.  Yes, they do. How much space do you have?  What’s your companies existing standards for parsing log files?  How long will you store them?  You may find out you have plenty of “space.”  You will never have enough time though. ;)
  • Log files aren’t as accurate as page tags.  Huh? That all depends on how well you filter and exclude pages.  Accurate compared to what? To unvalidated data existing on servers you don’t see in some data center thousands of kilometers/miles away?  Logs capture visits from browsers that don’t execute javascript and thus will not appear in the metrics from tag-based systems…

So, you might be thinking I am a big fan of log files.  I am a web analytics realist.  Challenges with log files include:

  • Standardizing on a common format.  If you have a web server, you are at risk for someone doing something that creates a custom log file format.  If you are the new web analyst coming into an existing company, you may find custom fields in the logs.  Develop and champion a company standard.  It will make your life as a web analyst somewhat easier.
  • Creating a single “core” configuration that can be extended to deal with one-off formats.  If you live in world of custom log files, you can abstract a set of core fields, describe them, then extend the description when you deal with custom craziness.
  • Synchronizing/transporting log files.  The more servers you have, the more chances you have for failures in synchronization.  Application monitoring and recovery procedures are critical to preventing downtime.  Develop them.
  • Reprocessing log files.  Depending on your system reprocessing existing log file data can have myriad of effects - from data not being available, to data being partially available, to errors, to long waiting periods before the data is current and ready to query.
  • Receiving log files that are missing key fields or with key fields that have changed.   If you depend on systems, companies, and people that you do not control, fields can be stripped and formats changed in the blink of eye.  The web team needs to detect and manage these events.
  • Monitoring the operational workflow.  Someone on the web analytics team needs to manage the system and maintain the overall availability and integrity of log file processing and any issues that could result in the data not being synched or processed correctly. 
  • Managing change.  Since all your “hits” are captured in the log file, you’ll need to tell your web analytics tool how to correctly process the logs and remove data you don’t want to report.  You’ll need to identify and maintaining lists of filetype exclusions and the user agents/ip’s of robotic traffic.
  • Finding staff that “gets it.” The rule is that success in web analytics results from having competent people who have experience understanding the risks and rewards of web analytics data collection and executing on a web analytics strategy that generates value. 

As a web analyst you need to use the tools at your disposal to build your web analytics practice.  Log files are part of your toolkit.  Be weary of people who try to sway you one way or the other when it comes to web analytics data collection.  The method you ultimately use should be based on the data you want to collect in the context of the company’s goals, budget, and expertise

log_files.gif

Web Analytics and Data Collection: The Page Tag

Many methods exist for collecting different types of web analytics data- some much more accurate and useful than others: page tags, log file analyzerspacket sniffers, audience panels, and toolbars.  At this point in time, the page tag certainly seems to be the preferred method adopted by the web analytics industry for data collection.  It’s also core to other technologies, like multivariate testing. 

Various sources will tell you a multitude of “things” over time about page tags, depending on whether they think you are a noob or advanced practitioner.  There’s lots of useful information from many sources about page tags.  Still I find a lot of the average discussion about the utility of page tags to be somewhat true and false, depending on context.  For example, if you’ve never tagged pages, but you have processed log files, does it makes sense to entirely throw away legacy processes?  Or if you’ve mastered change management across all pages on your site but you’ve never parsed log files, do you want to have to deal with synching, moving, parsing, filtering log files?   These questions yield fantastic answers for fruitful discussion as you plan or extend your web analytics implementation.

I’ve heard the following “things” about page tags:

  • It’s easy to page tag your sites.  That really depends on the technologies used to build the site.  A static site with few simple pages is a different animal than a beast of a database-driven site with millions of complicated pages glued together using different technologies. 
  • Page tags are more reliable.  Reliable in what way I ask?  Page tags in hosted environment are processed by a data center many miles away.  I have no idea of the availability of the servers to support the load they are receiving from all the customers, nor do I have any insight into the raw data collected by the page tags, errors, and so on. 
  • Robots and spiders are always removed from the metrics.  Page tags do a better job than log file analyzers out-of-the-gate and especially if you don’t maintain your filtering, but I don’t really think page tags are infallible at all.  Bots and spiders get through.
  • You don’t have to deal with IT when page tagging.  It’s just a global include right?  You just give IT the code, and they include it.  For the most part that’s true, but IT has processes and procedures and your tagging needs to be QA’ed and perhaps even put through a “versioning” process.   Regardless, you’ll need to prep the corporation and your friend’s in IT for the big page tagging effort.
  • Change management is easy with page tags.  If you only use the vendor’s out-of-the-box page tag and have a global include, it’s somewhat easy to manage change.  But in my experience, a web analytics implementation requires using the tag in different ways across the site.  You’ll need to add attributes to the tag or script values into the tag.   In order to manage change, you’ll need to follow corporate processes.  In the case of web analytics page tagging, you may need to create those processes before you manage change… and process creation in corporate environments is sometimes not easy.  You may even need to follow a process for doing so.  :)

The biggest challenges I’ve had with page tagging include:

  • Ensuring complete code coverage across all pages.  The number of pages in your site, the way the site is built, and the technology used to build your site all need to be carefully assessed prior to beginning the page tagging effort.  The larger the portfolio of sites, the more difficult it will be to tag all of your pages.  Other challenges include the number of domains and subdomains, the technology standards used across your portfolio of sites, and whether, in more controlled environments, the web analytics team actually has the positional power to influence the change management process.
  • Determining a method for centralized tag management.  If you are running an internal solution, you’ll need page tag server(s) to collect the page tag data.  If you’re depending on your vendor’s servers, you may need to modify or update tags for various reasons (new campaigns and so on).  You may find new sites that need to be tagged.  New stakeholders may want special tags added to the site to support other purposes (such as multivariate testing).   How are you going to centrally manage all that?   
  • Orchestrating changes to tags across different site sections.  You may want to pass a new value in the tag or change something in the script on page X, but not page Y, or subdomain A, but not subdomain B.  A new campaign may need to be enabled.  You may have to update your tag to take advantage of new vendor functionality.  You’ll need a technology solution and process for centrally orchestrating and controlling tag changes.  Wielding a mighty CMS helps.
  • Reconciling tag metrics with log metrics and determining correct filtering.  It’s great fun to collect tag data then compare it to your log files to determine if all pages are being counted and how effectively you are filtering bots.  While looking at different data sources is a time consuming activity, certain business cases may demand it. 
  • Integrating tagging with a Content Management System.  Major sites use expensive CMS’ to create web sites.  The web analyst should work with the CMS team to build page tags into site pages.  Integration will help you with two points raised above: centralizing tag management and change management/orchestration.
  • Challenges with decodes and lookups using tags.  I’ve learned that it is difficult and in many cases impossible to decode a value or use lookup tables with a page tag.  For example, if I had a page that had a URI “/er45rw/e42f45erfwrq3r.html,” I can’t decode it to read “Web Analytics Blog” in my reports.  Or if it is possible to do simple decodes, I have to hardcode the decode on each page.   Hard coding on an evolving web site is never manageable over the time.
  • Latency.  If the page partially loads without executing the javascript, or if the user clicks through the page before the javascript has fired, the page view won’t be counted.  You must test to make sure your tag is firing properly.
  • Javascript turned off.  If the browser doesn’t execute javascript, the tag won’t fire and the visitor won’t be measured.  What effect will that have on your numbers? 
  • Cookie issues.  Tags may set third-party cookies.  Privacy policies don’t like the third party cookie.  As Justin Cutroni points out in the comments his tool sets first party cookies by default.  That’s good.
  • DNS changes.  To prevent third party cookie issues, you can trick out the DNS with a CNAME entry.  Say hello to someone in IT called the DNS admin!

In discussing the page tag in such a manner, you may think I don’t like the page tag.  Not true at all!  The page tag is a very useful data collection method in context.  I like it very much, especially in hybrid data collection.  Like any technology, just make sure you understand the implications of your data collection method on your implementation.   

page_tags.gif
Courtesy of Zeus.com.

Part 4: Spider and Bot Capability Maturity Framework for Web Analytics 2.0

Let’s conclude this series on the spiders and bots of web analytics 2.0 with a framework for ranking the spider and and bot capability maturity of  a web analytics practice! 

Before I do so, let me mention a few things:

  • How/if you detect spiders and bots depends on your data collection methods.  It is thought that page tags exclude spiders and bots because they don’t execute javascript or request images. Until they do. Log file data must be examined and filtered (either programmatically, manually, or both), and those filters need to be maintained by a web analytics professional.
  • Bots are used for a many purposes.  From inflating metrics, to indexing text for search, to chatting, to spamming, to data mining, to site performance monitoring, to click fraud, and more.
  • Bots are constantly evolving. Just like web analysis. :)
  • Not all bots are bad news.  You want bots crawling your site.  Some bots are good and helpful for your online business. Imagine if you blocked Googlebot.  Segment the robotic traffic into separate reporting (in the best case) and make sure its filtered from your externally reported numbers.   

Now without further adieu, here’s a web analytics 2.0 framework for ranking your spider and and bot capability maturity:   

  • Red.  You are in the red if the following applies:
    • I don’t understand the impact of spider and bot traffic nor do I measure, filter, or segment it at all. 
    • If I run an in-house solution, I haven’t updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification.
    • I don’t know when or how the Googlebot hits my site, but it sure does! 
    • I’ve never thought of reporting spider traffic for SEO.
    • I think the words “yahoo” and “slurp” refer to delicious soup.
    • Spiders scare me.
    • All bots are bad (no, they aren’t!)
  • Yellow. You are moving out of the red zone, into the yellow if:
    • I occasionally look for suspicious traffic. 
    • If I run an in-house solution, I have occasionally updated my list of filters and exclusions since I deployed it. 
    • If I run a hosted solution, I rely solely on my vendor to control all aspects of bot filtering and data purification, but I ask for verifications of compliance with industry standards.
    • I’ve heard of the Googlebot and maybe I use or I am thinking about using Site Maps (do it!). 
    • I know about Yahoo! Slurp.
    • It would be cool and useful to report on spider and bot traffic and learn which bots are good and bad. 
  • Blue.You are in the blue if at least every month or on an ad hoc basis:
    • I work with my vendor or in-house team to recognize and remove spider and bot traffic, conforming to industry standards
    • I do a monthly update of my filter and exclusions list, or I know my vendor does.
    • I capture bot traffic in my server log files, but I may or may not report on it.
    • I know all about the Googlebot, and I use Site Maps. 
    • I realize the Googlebot and Yahoo! Slurp behave differently on my site, but I don’t know exactly how.
    • I know which bots are good (like search bots) and bad (like content scrapers).
  • Green.  You are in the green if:
    • I have established a process for regularly removing spider and bot traffic and for keeping my lists of exclusions and filters up-to-date, or I am *absolutely certain* my vendor has such processes.
    • I am in compliance with industry standards.
    • I measure spider and bot traffic and segment it into distinct reporting separate from my human traffic reporting.
    • My SEOers love me, the web analyst, and I am involved in educating people in my company about bot traffic from IT to managers to consumers of reporting and analysis.
    • I know that there’s more than one Googlebot, and I know how its crawl differs from Yahoo! Slurp (hint: Slurp visits more).
    • I know detailed metrics like the “total time online” of all my bots. ;)

What I’ve covered in this four-part series only covers a little bit of what I know and what there is to know about spiders, bots, and crawlers and how they affect web analytics.   I’d enjoy hearing other experiences or opinions, so please share your comments if you feel like it.  Until next time, fine readers.  Thanks for visiting!

sengine1.gif

Part 2: The Crawlers, Spiders, and Bots of Web Analytics 2.0

In Part 1, I blogged about a huge challenge in web analytics: bot, spider, crawler detection, analysis, and filteration.  According to robotstxt.org ”robots” are:

“…programs that automatically traverse the web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced… the specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time… it is still a robot.”

In some regards it is “taboo” to speak about evil bots, but I think it’s a topic that needs to be brought out of the dark.  If we have to pay for page views, I require and expect accurate page view metrics. 

By understanding the characteristics of bots to help detection and purification of your web metrics, you will deliver more accurate analysis. Let’s take a deeper look at other bot behavior of Web Analytics 2.0:

  • Javascript executing bots.  Yes, they exist.  Big time.  They will invoke your beacon code, and generate large numbers of false page views, which your vendor should be filtering out.   Maybe your vendor filters traffic where large numbers of page views are generated in a single visit, or maybe you are paying for them.  Check your pathing reports or your ratios, like view:visit. Do see visits that have hundreds or thousands of page views?  Chances are those are bots (or your configuration may be technically challenged).

A large search engine company’s client service folks even hinted at javascript executing bots being programmed to execute queries on internal search engines, thus invoking an embedded advertising module, inflating the number of requests for the ad module.  While that doesn’t affect clickthrough or revenue and has nothing at all to do with the ad technology or the engine, it affects metrics. It confirms what I see.

Another favorite javascript executing bot I know likes to exploit HTML commenting left on in blogging architectures.  The bot spams comments with nefarious link farms or other unsavory content.  It executes javascript to do it.

Take for example, the ”rofovnifo” bot.  It fills out blog comments with the following content, leaving links to the nefarious goldwarez.org (don’t go there!) coming from Comcast 67.160.110.199:

Hi. Looks good! Very useful, good stuff. Good resources here. Thanks much!  G’night

A quick Google search reveals “rofovnifo really likes Cabela’s, sailing, Friendster, XBOX 360, car audio, card sharing, and world affairs–among many other things.  If only “rofovnifo” were a legitimate human, then I’d believe the traffic he generated and continues to generate across the 94,500 links he’s visited across the world  over the last several days! (UPDATE: 127,400 links… wait 128,000)

  • Enter your site from various referrers using various methods. One bot that I find fascinating executes a query on a particular country’s Google site (let’s day Google.au), then the bot crawls site pages returned in the SERP (search results page).  Next it indexes the content on the returned page. In the case I observed, this bot crawls a directory of subject matter related to a certain discipline.  The directory has thousands of pages.  So what I see is the bot coming from Google.au from a certain keyword, entering the site, and crawling 4,000+ pages during one visit that last a few hours.  Not human.
  • Come from different IP addresses or subnets. I’ll find bots that jump around IP addresses, but always hit the same page or pages for the same amount of time.  Or the bots will come from different subnets and hit the same page at a given periodicity (like every 15 minutes).
  • Repeatedly hit one page, repeatedly!  In a former job, I noticed that one educational institution referred a bot to the site on only the first of the month. The bot would crawl 77,000 pages then stop.  Are these clickfraud bots?  What are they?  Were they testing a prototype?
  • Spoof their user agent.  Nasty bots don’t want to say “hey, I am nasty bot” so they try to hide themselves by looking like normal user agents.

Mozilla compatible user agents look something like this:

xx.xxx.xx.xxx:Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.307; .NET CLR 1.1.4321; Media Center PC 4.0; .NET) 

Bots will spoof user agents to get by your filters.  Sometime they are sloppy in doing so, which makes them instantly detectable.  Notice the upper-case WINDOWS and the shortened NET reference:

xx.xxx.xx.xxx:Mozilla/4.0 (compatible; MSIE; WINDOWS 5.1; SV1; NET 1.0; Media Center PC 4.0; .NET)

  • Use cookies.  Wierd user agents that use cookies set off a huge red flag.   You need a web analytics tool that allows you to cross the cookie dimension with user agents.
  • Pass cookies.  Even worse than bots that simply use cookies is when you see one cookie passed across multiple IP addresses visiting your site consecutively during a short-time period, then disappearing after crawling huge numbers of pages in that short duration… very odd indeed. 

What do you see on your sites?  How are you dealing with spiders and bots?

Check out Part 3 where I will provide recommendations of keeping your analytics up-to-date with Web 2.0 bots and spiders.  Thanks for visiting! 

web-analytics-robots-spiders-crawlers.jpg 

Part 1: Spiders, Bots, Page Views, and Web Analytics, Oh My!

One fascinating area of web analytics is spider and bot detection and analysis.  I find it really amazing how many bots and spiders really exist out there on the “world wild web.” 

According to robotstxt.org ”robots” are:

“…programs that automatically traverse the web’s hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced…the specific traveral algorithm; even if a robot applies some heuristic to the selection and order of documnets to visit and spaces out requests over a long space of time, it is still a robot.”

Detecting non-human traffic is essential for “doing” solid web analytics.  After all, if you can’t adequately detect and filter robotic traffic, then your basic web metrics numbers are dramatically overinflated, which compromises the predictive power of your analysis.

The industry seems to believe that most vendors do a decent job of excluding robotic traffic.  After all, bots don’t execute javascript, so they don’t fire the page tag, thus that nasty robotic traffic doesn’t get included in the metrics reported by the web analytics service provider.   That’s true as long as the bot doesn’t execute javascript… what if it does? Page tagging is certainly not perfect, and in my experience bots do slip through. 

When you are doing log file analysis or using a hybrid data collection model, you’ll end creating lists of ip’s or user agents for bots to filter and exclude.  If you’re new to the game, your list may remain manageable.  If you record ip’s and agents to filter for many years, you end up with a rather unwieldy list of exclusions, which probably should be in a database.   Does anyone know of a web analytics vendor that filters traffic based on data existing in a client’s bot database?   

Since I think log-file analysis is a neglected art, I practice it (and I’ve used tags from a few vendors in my day-to-day).  In general, I notice that bot traffic:

  • Crawls inordinate numbers of pages per visit when compared to human visitors.  The basic observation you’ll notice immediately is that bots tend to view a large number of pages during each visit.  They tend to spend a long time on site too.
  • Enters the site at various intervals for various durations.  I find bots that request pages at the same periodicity.  One bot I’ve filtered in the past hits the same page every hour at 10 minutes past the hour, every hour, every day, every week, every month, every year.  It comes from somewhere in India.  A bigger problem than worrying about whether my visitors from India huddle around the same computer. ;)
  • Crawls a site in unusual patterns. A common signature of a bot is the way in which it crawls a site.  Often a bot will walk the document object model, crawling the structure of the site.  I see these bots request every link, in order, on a page, including file downloads.  The logic behind these crawls is easy to identify, and thus filter.
  • Repeatedly requests pages that human visitors don’t access.  I see bots that must be programmed to only crawl XML files or only request only rich media files. 

Adding to the complexity of bot detection, I’ve noticed that bots evolving in the modern age of Web 2.0 also:

  • Execute javascript
  • Enter the site from various referrers using various methods
  • Come from different IP addresses and subnets
  • Repeatedly hit one page, repeatedly!
  • Spoof their user agents, thus not identifying themselves
  • Take cookies (they are cookie monsters).

Remember, not all bots are bad!  You want the Googlebot and other useful spiders to crawl your site. 

Let’s take a closer look at the nuances of new-wave bots in Part 2.

googlebot_searchengine_spider.gif

More Thoughts on Web Analytics, Social Networking, and Social Networks….

I’ve been taking a look a deeper look at the trends in social networking and the analysis of social networks using nodes (such as taxonomy) and ties (such as clickstream data).   A few concepts from networking theory are intriguing me, and I figured I’d bring them up here to see if anyone has any thoughts:

  • Betweenness. Identifies the degree to which a node in a social network is interrelated to another node.  Identifying degrees of betweenness in taxonomy nodes and combining with “normal” analytics data could enable the analyst to:
    • Detect nodes with the most betweenness to identify content that should be *automatically* served when a visitor interacts with a related taxonomy node (extending site optimization technologies)
    • Determine misappropriated editorial agenda and withering products by contrasting the “popularity” of nodes with the most or least betweenness.
  • Clustering.   A concept used to express how visits relate to core taxonomy nodes could:
    • Provide a means for visualizing how visitor segments cluster around particular pages or nodes in a taxonomy
    • Enable the analyst to visualize the broad content themes that drive the most visits
  • Density.  Certain bloggers and site pages tend to see larger numbers of repeat visitors, comments, or maximized time-based metrics when compared to other pages.  Can a metric for “content density” of a site be calculated?  Perhaps by crafting a equation from counting objects in a taxonomy node, value-scoring each object, and seeing which objects were interacted with most frequently?
  • Influence.  The guideline is 99% lurk and 1% influence.  Can we gauge visit “influence” and visualize it from:
    • Pathing where visitors who have performed the most/least interactions and contributions ”go next” off-site.
    • Value scoring an “influence metric” for Interactions, Contributions, posts and comments, and off-site exit links in each visit, then adding up the values to calculate a new influence-based KPI measurement per visit.  Finally comparing the “influence metric” across all visits.

If you are still following me ( :-) ), what I’m working at understanding and reconciling is whether social network analysis theory when combined with web analytics can illuminate the analyst with new ways for thinking about a web site. 

By combining a rules-based approach to processing this type of data, the possibility for automatic content targeting and the idea of a “living site” self-optimizing based on visitor interactions with taxonomy nodes or site objects becomes closer to reality.  The potential to use analytics data and social networking theory for building and realizing new combinations of product, content, and design becomes possible.  For example, I could create rules and logic commanding my CMS fill a “related topics” module or widget on a particular page with content from nodes that have the smallest amount of betweenness and the greatest density.

It’s clear that social networking impacts web analytics. Most major analytics vendors don’t seem to be thinking about applying (or how to apply) concepts from social networking.  I’m looking forward to vendors bringing social network theory into their technologies by perhaps combining, rules-based algorithms for site optimization with existing analytics data and new, open API’s (for example, Facebook’s new API or LinkedIn’s forthcoming API) to drive profitable revenue from new and existing channels.

simpsons_sna.jpg