Web Analytics and Data Collection: Log File Analysis
Web Analytics and log file analysis is misunderstood! With so many options for data collection in web analytics, the log file has been somewhat disdained for many years. In fact, I’d venture to guess that the majority of web analytics practitioners working today don’t “do log file analysis.” That’s unfortunate. One can learn a lot about web analytics from examining log files - for example, “total time online” (with all it’s inherent issues :).
Why is server-side data collection (log file analysis) less popular than client-side page tagging? It’s not about bots, accuracy, or ease of deployment… that’s marketing spin, somewhat true and false depending on context. It’s may be because most web analytics practitioners currently use a hosted model.
Page tagging enables ASP models for web analytics. It would be very difficult (i.e. resource intensive and costly) for a web analytics ASP to maintain a competitive service level if the ASP relied on its clients and their IT teams to maintain server side data collection- maintaining, transporting, and synching log files.
Imagine the ASP model with log files! It’s nightmarish and makes little sense: hundreds and thousands of customers with just as many (or more) different servers behind firewalls generating many different log file formats. Then, at various intervals all customers sending uniquely formatted logs for the ASP to parse, index, and make available for querying and retrieval in close to real time. Yikes!!
Page tagging creates a standard data structure for the ASP and a manageable data transport mechanism while minimizing complexity and reducing infrastructure cost in a way that log files just can’t. Page tags enable ASP’s to meet rigorous service level agreements and contractual obligations for data processing in way they probably couldn’t with log files.
So who does log file analysis for web analytics? The answer is companies that have:
- No desire to give up control of a web analytics infrastructure to young companies. Web data is valuable and log files enable you to keep it all for however long and at whatever cost your business justifies. You can processing your data using whatever tool you want, and reprocess at your discretion.
- Existing business dimensions and structured data already existing in log files. Simple web analytics concepts like page views, visits, and “uniques” become more meaningful when sliced by relevant business objects from your companies existing log data. For example, if you’ve passed a unique parameter in your query strings for years, you can program your log file tool to recognize the parameter. Then reprocess your existing data to make use of it in your reporting.
- Deep data mining experience. Data miners want data to mine, not static GUI’s or “my data” delivered in summary extracts at an incremental cost. Log files can be parsed and indexed into the database that you, the company or the client, actually owns at the budget you actually set.
- Know how to employ hybrid data collection. Log files are complemented by page tags for instrumenting pages to track events and Web 2.0.
- Complex integration requirements. All the data exists on machines in your network. Log files can be extended to incorporate new business dimensions.
- A need to understand how spiders and bots behave on their site and the performance of their server(s). Log files capture everything that happens on the web site. The “kitchen_sink.html” is in there. It’s up to you to customize the tool to meet the quality level you demand for filtered data. As Mat and Steve mentioned in the comments, logs also enable you to view server errors, useful for site QA and debugging.
I’m sure there are other reasons too…. I’ve heard people tell me that:
- It’s hard to move log files. That depends on how many log files you have, what technology you are using to move them, and how often.
- Log files are difficult to parse. Not really, especially if they are standardized. You just need a flexible tool. If things get non-standard, you need a tool that enables you to define your own formatting.
- Log files take up a lot of space. Yes, they do. How much space do you have? What’s your companies existing standards for parsing log files? How long will you store them? You may find out you have plenty of “space.” You will never have enough time though.
- Log files aren’t as accurate as page tags. Huh? That all depends on how well you filter and exclude pages. Accurate compared to what? To unvalidated data existing on servers you don’t see in some data center thousands of kilometers/miles away? Logs capture visits from browsers that don’t execute javascript and thus will not appear in the metrics from tag-based systems…
So, you might be thinking I am a big fan of log files. I am a web analytics realist. Challenges with log files include:
- Standardizing on a common format. If you have a web server, you are at risk for someone doing something that creates a custom log file format. If you are the new web analyst coming into an existing company, you may find custom fields in the logs. Develop and champion a company standard. It will make your life as a web analyst somewhat easier.
- Creating a single “core” configuration that can be extended to deal with one-off formats. If you live in world of custom log files, you can abstract a set of core fields, describe them, then extend the description when you deal with custom craziness.
- Synchronizing/transporting log files. The more servers you have, the more chances you have for failures in synchronization. Application monitoring and recovery procedures are critical to preventing downtime. Develop them.
- Reprocessing log files. Depending on your system reprocessing existing log file data can have myriad of effects - from data not being available, to data being partially available, to errors, to long waiting periods before the data is current and ready to query.
- Receiving log files that are missing key fields or with key fields that have changed. If you depend on systems, companies, and people that you do not control, fields can be stripped and formats changed in the blink of eye. The web team needs to detect and manage these events.
- Monitoring the operational workflow. Someone on the web analytics team needs to manage the system and maintain the overall availability and integrity of log file processing and any issues that could result in the data not being synched or processed correctly.
- Managing change. Since all your “hits” are captured in the log file, you’ll need to tell your web analytics tool how to correctly process the logs and remove data you don’t want to report. You’ll need to identify and maintaining lists of filetype exclusions and the user agents/ip’s of robotic traffic.
- Finding staff that “gets it.” The rule is that success in web analytics results from having competent people who have experience understanding the risks and rewards of web analytics data collection and executing on a web analytics strategy that generates value.
As a web analyst you need to use the tools at your disposal to build your web analytics practice. Log files are part of your toolkit. Be weary of people who try to sway you one way or the other when it comes to web analytics data collection. The method you ultimately use should be based on the data you want to collect in the context of the company’s goals, budget, and expertise.
Mat added the following ...
Judah - excellent post. This is something that I see becoming more and more prevalent. Three other items that I think are important to note when thinking of web server log file analysis:
1) Log files give you things that page tags don’t necessarily include - for instance error codes, your web application needs to have an error handling page that is page tagged and would return to your WA tool some form of error, this is included in log files by default in the status code field.
2) Log files are something that your organization already has. If you’re just getting started in web analytics, or starting to measure a new site. This is a vendor inspecific method of finding out what is going on. There’s no overhead from the standpoint of finding someone to pro-actively go in and tag a site (or worse, retroactively tag a site).
3) Log files are terrific for folks that want to optimize their overall deployments through the use of a hybrid collection method. With proper requirements gathering and due diligence on the part of the team(s) involved, all of the necessary data points required for the metrics of interest to the company can be identified. If those items exist in the log file, you would simply parse the log to retrieve those. If they do not, then it goes on the list of items that could be measured through either a code change in the site or through the implementation of a page tag. This sort of forethought and analysis lowers the overall cost of ownership and the level of impact necessary from an organizational standpoint.
Diane A added the following ...
As Steve says: “The one I find very useful is to be able to easily update assumptions. Go back to year dot and regenerate and extract even more insights.”
This is one of the main reasons we use log file analysis as our primary analytics solution.
Judah added the following ...
Steve: Good points! And excellent comments as usual. It’s obvious you are a log file master. An AWFFull one at that.
Thanks for sharing!
Mat: Excellent stuff. These bullet points are right on, and your comments about TCO is very smart. Do you work in Web Analytics? Heh.
Diana: I do hybrid data collection, as Mat alluded to, which gives me the ability to rerun data and test assumptions. I hear you loud and clear and totally agree! Thanks so much for commenting and reading my blog!
Steve added the following ...
Heh. ![]()
I’m sure Tony, he who came up with the name - my suggestions were very pedestrian and dull - gets a warm inner glow every time someone uses the pun as intended.
“What a junk program!”
No not junk, just awful.
BWHAHAHahahahahahah. ![]()
Cheers!
Kevin added the following ...
Logs are also an important data source in your analytics mix for spider & bot understanding and wireless/WAP analytics.
Judah added the following ...
Steve: That’s awfful-ly funny stuff! Hahaha
Kevin: You are absolutely right-on about wireless and WAP analytics (and bots). The Event model for web analytics also fits nicely with the card/deck metaphors in the mobile experience. Thanks for commenting!
angie added the following ...
I’m a bit behind on my reading, but wanted to chime in on this one anyway, since I spent several years working for an ASP-model analytics vendor before page tags were even invented (yes I am a dinosaur). ASP with logs can be a bit of a pain, but it’s very doable.
Part of the bad reputation for logfile “inaccuracy” stems from two things:
1) The caching issue is real, and where it’s most likely to bite you is when assessing navigational issues on the site. I’ve had numerous chances to directly compare logfile analysis with page tag analysis for the same site in the same date range as we transitioned customers logs to tags (and even on my own site now). As an example, if you rely on logfiles to tell you how many times someone is clicking through a search results page before finding the “right” result, your server logs will under-report views of the results page since subsequent clicks to that page are served from the browser’s cache. So while you think people are getting a good result on the first or second click, the reality might be that they’re clicking through 6 or 7 results before finding what they want (or giving up).
2) Sessionization can be problematic. When a new visitor comes to your site, a cookie doesn’t get set until AFTER the request is made (and logged). This means that first request doesn’t have a cookie associated with it. This could result in those pages being orphaned into their own visits (increasing single-page visit counts, skewing entry pages, etc.). Or your analysis can try to figure out to which visit that page should belong, which opens up a whole other can of worms. I would guess that this issue isn’t even on the radar of most people who do server log analysis.
Other than that, there is a lot of value to logs, especially since tagging for non-page entities can be such a pain, and the logs get written whether or not you knew ahead of time you were going to analyze them.
Steve added the following ...
Angie, 1 minor discussion; and 1 minor correction.
#1 can be a pain. I do agree. Flip side - Does it actually matter? Yes there are cases where it does, but there are also plenty of cases where it doesn’t. You can change the expires headers on the webserver to help overcome *some* of these issues, but that isn’t perfect either.
The counter argument - for me I stress, ymmv, - is that I’ve found that page tagging isn’t telling me anything that I couldn’t/can’t get from simple referral field analysis out of the logs. The numbers are so close as to not matter. And they surely follow the same pattern. ie The end actions I take haven’t changed. So does it matter that one is “not accurate” if I can draw the same conclusions anyway?
Obviously YMMV. ![]()
Flip side, I personally block all page tagging in my browsing. So if you rely on Page tagging? I’m invisible - you don’t even know I’m there to see how much trouble I’m having with your navigation. At least with logs I have a better chance of seeing *something*.
#2 is not correct. Your assumption depends on where you set the cookie. I have 4+ years worth of logs that are full of cookies sent as part of the initial response from the webserver. See apache’s mod_usertrack:
http://httpd.apache.org/docs/2.0/mod/mod_usertrack.html
You’ll note what they call that in the docco.
This is another way of helping to identify robots too - mainly as *most* ‘bots will reject cookies. So you end up with logs full of totally unique cookies. Given most people download an image/css as well as a page hit, you can thus help identify “not people”. So to speak.
Not a perfect recogniser alone, but is another data point to help cross verify.
I’d go so far as to suggest that anyone who does log analysis *MUST* enable this style of tracking.
HTH?
Cheers!
angie added the following ...
Steve, you’re probably more knowledgeable about analytics than many of the webmasters I’ve dealt with in the past… If they are set up to log a cookie on the first hit then #2 isn’t an issue, but in my experience working with a variety of customers, that’s a big big “if.” I’m all for making that a law.
Unfortunately, a lot of customers I worked with didn’t log their cookies at all, so part of getting a new customer set up used to be asking the business owner to talk to their IT guys about getting them logged. These customers weren’t stupid, this just wasn’t something they dealt with on a regular basis.
I’d like to think that knowledge of this stuff has come a long way since then, but I know there are plenty of companies out there that understand they need to measure their online activity, but this entire conversation would be way over their heads.
If you are fortunate enough to be analyzing a site without a bunch of redirects then you can indeed get some info out of the referrer field and skirt some of the caching issue. I’m not certain which tools (I assume the average analyst does not grep and cut raw logfiles) base their click-stream information on the referrer field and which base it on the request fields sorted in chronological order… if the latter then you’ll get some really wacky and impossible paths because of caching. I currently work on a site that uses a bunch of redirects so my referrer fields are practically useless.
Judah added the following ...
Angie and Steve:
Your comments are incredible and really add value to the blog and to the people who read it. Thanks for taking the time to respond thoughtfully.
The caching issue can be real, but certain tools can overcome the challenges by using web server caching directives that instruct the browser to check to see if the page has been modified, then log the 304 response and use that for pathing…
The cookie issue can be real too, but that really depends on your web server configuration. Some web analytics tools even offer server plugins (a time tested method) to prevent the sessionization issue.
Steve: Regarding bots and cookies. Some bots will take them. I see bots from the different subnets using the same cookie.
Angie: I hear you about the redirect issue. I’m investigating some methods for doing that right now. There is hope!
Also, I realize you can run an ASP using log files, but I think it would be harder to maintain SLA’s and provide the near real-time reporting. You aren’t a dinosaur either! Keep in mind I was doing “search” when they called it “information retrieval.” Google didn’t exist! Wait a minute… We’re ALL dinosaurs in Internet years…
Thanks for commenting!
Five Thousand Times Two Translates Into Goodness » Occam’s Razor by Avinash Kaushik added the following ...
[...] Gócza: Daddy I am not sure I agree with Avinash’s choice of Page Tagging over Log Files. (The making of a future…!) [...]
Maxwell added the following ...
hi i enjoyed the read



Steve added the following ...
Ooo! Ooo! Pick Me! Pick Me! (Sorry. Too many repeat viewings of Shrek by our 4yo lately…
)
Other positives with logs. Typing from personal experience:
* Performance Data. Can be logged inline directly with “analytics” data. ie. I can (and have) segment on people from ISP X and appreciate how we perform for them. ie Time to send a Page, or Image or…
The big plus here is we avoid going anywhere near very $$$ third party monitoring services.
* Easy integration with underlying system performance. All the data is essentially on the same server anyway. Just pull it together. Easy! Good for all sorts of troubleshooting. Which in turn helps drive a more positive end user experience.
* The one I find very useful is to be able to easily update assumptions. Go back to year dot and regenerate and extract even more insights. This is great for us, may be less useful for others.
* It is much more difficult to deliberately hide from logs vs simply disabling Javascript. eg A bunch of folk who visit my personal website use the browsers: wget, lynx or links. I do it myself. They will forever be hidden from page tagging. It’s much less likely they will be hidden from logs. Flip side, I actually found it incredibly difficult to verify that the page tagging I was building for GA on our Dev server actually worked. All the browsers I was testing with defaulted to simply not “there” as far as GA was concerned. Ended up having to fire up WinXP in a VM and run MSIE!
* “Unusual Events”. I have set our system to alert on excessive numbers of X’s in a given period. It could be argued (accurately) this is more monitoring vs analytics. Either way, the whole point, again, is to deliver a positive experience to end users. Fix problems before they become problems.
* Errors. I’m unaware of accurate Page Tagging tracking 4xx and 5xx server errors? Happy to be corrected. Again, can segment and analyse against errors. This also ties in nicely with QA.
* One nice trick is to use page tagging *as well* and drop the results into the self same logs. Aka your hybrid Judah!.
Awstats has a plugin for this type of thing, as does an Open Source click/heatmap tool I’ve recently discovered.
* All that extremely important back end stuff: Monitoring, Capacity Planning, Disaster Recovery. You *need* logs to do this.
* Easily extensible. This can also be a downside as you point out Judah.
Is not at all hard to add extra data to be logged - well, in Apache any
way.
Against Logging?
* Tagging purports to be immune to caching - which is one of the bigger issues with logs. I am … unconvinced. It would be trivial for me to cache
a select group of ASP tag calls for all users on my network. And if I was an ISP that thought I could shave a few dollars off my bandwidth charges? That’d be a great way to do it.
* We had to set some pretty funky HTTP headers to stop most caching effects with our pages. Not perfect by any stretch. This has an obvious downside in requiring either re-downloads or 304’s.
* 2Gb Limit.
Actually hit this a few years back. Completely smashed our pipeline log rotation program. Lost several days worth of logs on that.
We now rotate much sooner and keep an eye out for repeat adventures.
Hmm Better stop there. Don’t want a comment to be bigger than the original. That may negatively affect your entrance into the Top 10.
Cheers!
PS This comment was written in 100% vim. *Notepad*!?!?!?! Sigh.