<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.3.2" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>
<channel>
	<title>Comments on: Part 1: Spiders, Bots, Page Views, and Web Analytics, Oh My!</title>
	<link>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html</link>
	<description>Judah Phillips, Web Analytics Practitioner at Web Analytics Demystified</description>
	<pubDate>Thu,  7 Aug 2008 20:09:41 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.3.2</generator>
		<item>
		<title>By: Judah Phillips at Web Analytics Demystified &#187; Blog Archive &#187; AVG Link Scanner Bot Executes JavaScript!!!!!!</title>
		<link>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-2446</link>
		<dc:creator>Judah Phillips at Web Analytics Demystified &#187; Blog Archive &#187; AVG Link Scanner Bot Executes JavaScript!!!!!!</dc:creator>
		<pubDate>Sat, 21 Jun 2008 07:11:33 +0000</pubDate>
		<guid>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-2446</guid>
		<description>[...] been talking about JS executing bots screwing with web data for about a year now.  SEOMoz and the folks at SlickSurface confirmed it quite recently (quoting me no less in their [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] been talking about JS executing bots screwing with web data for about a year now.  SEOMoz and the folks at SlickSurface confirmed it quite recently (quoting me no less in their [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bad Bots Confound Web Analytics By Executing Javascript Tags &#187; Slicksurface - Tech, Design &#38; SEO Blog</title>
		<link>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-2310</link>
		<dc:creator>Bad Bots Confound Web Analytics By Executing Javascript Tags &#187; Slicksurface - Tech, Design &#38; SEO Blog</dc:creator>
		<pubDate>Thu, 15 May 2008 21:43:15 +0000</pubDate>
		<guid>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-2310</guid>
		<description>[...] Judah Phillips wrote a post last July where he talked about seeing a similar phenomena. He said he sees bots doing the following: [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] Judah Phillips wrote a post last July where he talked about seeing a similar phenomena. He said he sees bots doing the following: [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Judah Phillips at Web Analytics Demystified &#187; Blog Archive &#187; Part 1: Your Web Analytics Data Quality Stinks and Here&#8217;s Why!</title>
		<link>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-328</link>
		<dc:creator>Judah Phillips at Web Analytics Demystified &#187; Blog Archive &#187; Part 1: Your Web Analytics Data Quality Stinks and Here&#8217;s Why!</dc:creator>
		<pubDate>Sat, 11 Aug 2007 16:55:57 +0000</pubDate>
		<guid>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-328</guid>
		<description>[...] and Bots.  If you haven&#8217;t read my series on spiders and bots, check it out.  Non-human traffic can inflate your metrics and diminish the predictive power of [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] and Bots.  If you haven&#8217;t read my series on spiders and bots, check it out.  Non-human traffic can inflate your metrics and diminish the predictive power of [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Judah Phillips at Web Analytics Demystified &#187; Blog Archive &#187; Web Analytics and Data Collection: The Page Tag</title>
		<link>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-263</link>
		<dc:creator>Judah Phillips at Web Analytics Demystified &#187; Blog Archive &#187; Web Analytics and Data Collection: The Page Tag</dc:creator>
		<pubDate>Mon, 23 Jul 2007 15:46:33 +0000</pubDate>
		<guid>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-263</guid>
		<description>[...] maintain your filtering, but I don&#8217;t really think page tags are infallible at all.  Bots and spiders get [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] maintain your filtering, but I don&#8217;t really think page tags are infallible at all.  Bots and spiders get [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Judah Phillips at Web Analytics Demystified &#187; Blog Archive &#187; Part 2: The Crawlers, Spiders, and Bots of Web Analytics 2.0</title>
		<link>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-160</link>
		<dc:creator>Judah Phillips at Web Analytics Demystified &#187; Blog Archive &#187; Part 2: The Crawlers, Spiders, and Bots of Web Analytics 2.0</dc:creator>
		<pubDate>Fri, 06 Jul 2007 19:14:28 +0000</pubDate>
		<guid>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-160</guid>
		<description>[...] Part 1, I blogged about a huge challenge in web analytics: bot, spider, crawler detection, analysis, and filteration.  In some regards it is [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] Part 1, I blogged about a huge challenge in web analytics: bot, spider, crawler detection, analysis, and filteration.  In some regards it is [&#8230;]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steve</title>
		<link>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-155</link>
		<dc:creator>Steve</dc:creator>
		<pubDate>Thu, 05 Jul 2007 11:04:08 +0000</pubDate>
		<guid>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-155</guid>
		<description>Nooooooooooooo! My website! It's been *Judah/.ed*!!! The poor server. It's melting!!!!
;-)

*cough*. Sorry. Couldn't resist really. Probably should have. Too many flashbacks to Wizard of Oz.


Actually wasn't awffull, was one of the others: "visitors". The problem I had was logs that had Apache mod_usertrack cookies embedded to betterer identify uniques. But no tools existed to actually make use of that - well that I had access to. I'd written a pretty nasty perl script. Slow. Overly complicated.
So I took that base and re-wrote in C and that became "visitors".

Most of the code is wrapper. The core logic is quite small.
Basically check to see if we get more than one bite at a given cookie. See it twice or more, then it's probably good. 
I can take you through the detail if you're really bored. Or: Read the Source. ;-)
I use comments, it's not ... too scary.

But part of making "visitors" work, was *really* understanding how people and browsers use a/the site. It's not perfect, but I am seeing a remarkable similarity between it's numbers and GA's (we switched on a few weeks ago). Which did wonders for my ego. ;-)

I haven't, yet, pulled this logic or code into awffull. Intend to. Looking more like being sooner than later, but too much crufty old code that needs trimming and cleaning first.

Cheers!</description>
		<content:encoded><![CDATA[<p>Nooooooooooooo! My website! It&#8217;s been *Judah/.ed*!!! The poor server. It&#8217;s melting!!!!<br />
 <img src='http://judah.webanalyticsdemystified.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>*cough*. Sorry. Couldn&#8217;t resist really. Probably should have. Too many flashbacks to Wizard of Oz.</p>
<p>Actually wasn&#8217;t awffull, was one of the others: &#8220;visitors&#8221;. The problem I had was logs that had Apache mod_usertrack cookies embedded to betterer identify uniques. But no tools existed to actually make use of that - well that I had access to. I&#8217;d written a pretty nasty perl script. Slow. Overly complicated.<br />
So I took that base and re-wrote in C and that became &#8220;visitors&#8221;.</p>
<p>Most of the code is wrapper. The core logic is quite small.<br />
Basically check to see if we get more than one bite at a given cookie. See it twice or more, then it&#8217;s probably good.<br />
I can take you through the detail if you&#8217;re really bored. Or: Read the Source. <img src='http://judah.webanalyticsdemystified.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /><br />
I use comments, it&#8217;s not &#8230; too scary.</p>
<p>But part of making &#8220;visitors&#8221; work, was *really* understanding how people and browsers use a/the site. It&#8217;s not perfect, but I am seeing a remarkable similarity between it&#8217;s numbers and GA&#8217;s (we switched on a few weeks ago). Which did wonders for my ego. <img src='http://judah.webanalyticsdemystified.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>I haven&#8217;t, yet, pulled this logic or code into awffull. Intend to. Looking more like being sooner than later, but too much crufty old code that needs trimming and cleaning first.</p>
<p>Cheers!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Judah</title>
		<link>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-151</link>
		<dc:creator>Judah</dc:creator>
		<pubDate>Wed, 04 Jul 2007 18:08:52 +0000</pubDate>
		<guid>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-151</guid>
		<description>Jacques:  When looking at visit metrics for bot identification, I'd look for a large, equal number of page views and visits during a brief time period.  But one needs to be careful because one ip address can be shared by multiple agents! 

Peter: I agree this is a huge problem for accuracy and analysis.  Thank you for sharing your opinion, insights, and learnings. Your tip is right on regarding images, and that resource is very useful.  Keep up the excellent work on OWA!  As a major proponent of open source, I think what you're doing is fantastic and gaining gravity. :)

Steve:  I hear you about the "vibe," but I think that would depend on the type of user needs that the site satisfies.  For example, content sites that are aggregated and scraped because they are juicy with good content can see bot traffic in excess of 70% in personal observations - that affects the pattern, if you will... but only if you don't filter... :)

Good advice on the headers too.  It echoes Peter's recommendation from above too... Hard earned learning from your work on AWFFULL (http://www.stedee.id.au/awffull). ;) 
  
The "bot's intent" a counterpoint to the "searcher's intent" for the modern web, ay?

Mat:  Good points. One of the reason why I use the tools I do for web analytics is that SEOers need to know how relevant searchbots crawl sites. It's amazingly helpful information.  

I also agree that value may be generated simply from excluding the traffic and not reporting on it!  Exclusionary rules are powerful.  Indeed.

Thanks to all of you for commenting and reading my blog! </description>
		<content:encoded><![CDATA[<p>Jacques:  When looking at visit metrics for bot identification, I&#8217;d look for a large, equal number of page views and visits during a brief time period.  But one needs to be careful because one ip address can be shared by multiple agents! </p>
<p>Peter: I agree this is a huge problem for accuracy and analysis.  Thank you for sharing your opinion, insights, and learnings. Your tip is right on regarding images, and that resource is very useful.  Keep up the excellent work on OWA!  As a major proponent of open source, I think what you&#8217;re doing is fantastic and gaining gravity. <img src='http://judah.webanalyticsdemystified.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Steve:  I hear you about the &#8220;vibe,&#8221; but I think that would depend on the type of user needs that the site satisfies.  For example, content sites that are aggregated and scraped because they are juicy with good content can see bot traffic in excess of 70% in personal observations - that affects the pattern, if you will&#8230; but only if you don&#8217;t filter&#8230; <img src='http://judah.webanalyticsdemystified.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Good advice on the headers too.  It echoes Peter&#8217;s recommendation from above too&#8230; Hard earned learning from your work on AWFFULL (http://www.stedee.id.au/awffull). <img src='http://judah.webanalyticsdemystified.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>The &#8220;bot&#8217;s intent&#8221; a counterpoint to the &#8220;searcher&#8217;s intent&#8221; for the modern web, ay?</p>
<p>Mat:  Good points. One of the reason why I use the tools I do for web analytics is that SEOers need to know how relevant searchbots crawl sites. It&#8217;s amazingly helpful information.  </p>
<p>I also agree that value may be generated simply from excluding the traffic and not reporting on it!  Exclusionary rules are powerful.  Indeed.</p>
<p>Thanks to all of you for commenting and reading my blog!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mat</title>
		<link>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-148</link>
		<dc:creator>Mat</dc:creator>
		<pubDate>Tue, 03 Jul 2007 15:21:11 +0000</pubDate>
		<guid>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-148</guid>
		<description>Also important is the differentiation of what you do with your "bot" traffic.  As Steve pointed out, MSIE will archive your site - that technically may behave as a bot - but what value do you get from identifying and reporting on it as a bot?

It has been my experience that bot traffic is fairly useful stuff when it comes to search-bid analysis, identifying which companies to spend your SEO dollars with, etc... so I would want to separate out ACTUAL bot traffic to know who is crawling my site the most (and ergo has the best, latest, deepest knowledge of my content) where I'd rather just forget about MSIE altogether and use an exclusionary rule to keep a few records from being written to my database.</description>
		<content:encoded><![CDATA[<p>Also important is the differentiation of what you do with your &#8220;bot&#8221; traffic.  As Steve pointed out, MSIE will archive your site - that technically may behave as a bot - but what value do you get from identifying and reporting on it as a bot?</p>
<p>It has been my experience that bot traffic is fairly useful stuff when it comes to search-bid analysis, identifying which companies to spend your SEO dollars with, etc&#8230; so I would want to separate out ACTUAL bot traffic to know who is crawling my site the most (and ergo has the best, latest, deepest knowledge of my content) where I&#8217;d rather just forget about MSIE altogether and use an exclusionary rule to keep a few records from being written to my database.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Steve</title>
		<link>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-147</link>
		<dc:creator>Steve</dc:creator>
		<pubDate>Mon, 02 Jul 2007 20:40:24 +0000</pubDate>
		<guid>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-147</guid>
		<description>An excellent posting Judah! Speaking as a practitioner of the fine art of log analysis. ;-)

Several thoughts:
* I find robots inflate my numbers but rarely change the ... pattern or vibe if you will. Do get unusual spikes, but generally they just increase numbers.
* Bots that hit with regularity are usually "is the Internet Alive?" monitoring type tools. If they didn't ask upfront, I block them at the firewall. :-)
* I have a funky AWK script/regular expression, that does a fair job. Been meaning to wrap it in a proper perl script with a "bot per line" config/include file. Time... need more time...


Jacques: We have 4500 people that come from the same IP address at work. I've worked in places where that number gets into the high 10's of thousands. All from a single IP address. Be careful of your assumptions. I usually cross check off cookies stored in the logs - and if the cookies match on images and pages.

Which incidentally Judah - is the biggest giveaway I've found. Bots don't request images/CSS etc as much as people do. Some tricky HTTP headers to force browsers to reload, say certain images or CSS, can go a long way towards helping identify bots - who won't 304 the image or css.

Fortunately it's minor, but "Intent" of the 'bot can matter. If I use MSIE to spider and archive a copy of your site, is that a bot? If I use wget to achieve the same thing? Tricky... :-)


Cheers!</description>
		<content:encoded><![CDATA[<p>An excellent posting Judah! Speaking as a practitioner of the fine art of log analysis. <img src='http://judah.webanalyticsdemystified.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>Several thoughts:<br />
* I find robots inflate my numbers but rarely change the &#8230; pattern or vibe if you will. Do get unusual spikes, but generally they just increase numbers.<br />
* Bots that hit with regularity are usually &#8220;is the Internet Alive?&#8221; monitoring type tools. If they didn&#8217;t ask upfront, I block them at the firewall. <img src='http://judah.webanalyticsdemystified.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /><br />
* I have a funky AWK script/regular expression, that does a fair job. Been meaning to wrap it in a proper perl script with a &#8220;bot per line&#8221; config/include file. Time&#8230; need more time&#8230;</p>
<p>Jacques: We have 4500 people that come from the same IP address at work. I&#8217;ve worked in places where that number gets into the high 10&#8217;s of thousands. All from a single IP address. Be careful of your assumptions. I usually cross check off cookies stored in the logs - and if the cookies match on images and pages.</p>
<p>Which incidentally Judah - is the biggest giveaway I&#8217;ve found. Bots don&#8217;t request images/CSS etc as much as people do. Some tricky HTTP headers to force browsers to reload, say certain images or CSS, can go a long way towards helping identify bots - who won&#8217;t 304 the image or css.</p>
<p>Fortunately it&#8217;s minor, but &#8220;Intent&#8221; of the &#8216;bot can matter. If I use MSIE to spider and archive a copy of your site, is that a bot? If I use wget to achieve the same thing? Tricky&#8230; <img src='http://judah.webanalyticsdemystified.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>Cheers!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Peter Adams</title>
		<link>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-146</link>
		<dc:creator>Peter Adams</dc:creator>
		<pubDate>Mon, 02 Jul 2007 20:34:57 +0000</pubDate>
		<guid>http://judah.webanalyticsdemystified.com/2007/07/part-1-spiders-bots-page-views-and-web-analytics-oh-my.html#comment-146</guid>
		<description>This is actually more of problem than people cause to admit. One way to avoid these new fangled web 2.0 spiders is to trigger tracing nad cookie setting off of a special image request. I've found that while these bugs execute javascript, they almost never request images...

Also, Gary Keith's Browscap project (http://browsers.garykeith.com/) is a must have component of any web analytics / spider detection effort.</description>
		<content:encoded><![CDATA[<p>This is actually more of problem than people cause to admit. One way to avoid these new fangled web 2.0 spiders is to trigger tracing nad cookie setting off of a special image request. I&#8217;ve found that while these bugs execute javascript, they almost never request images&#8230;</p>
<p>Also, Gary Keith&#8217;s Browscap project (http://browsers.garykeith.com/) is a must have component of any web analytics / spider detection effort.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
