Web Analytics and Data Collection: Log File Analysis
Web Analytics and log file analysis is misunderstood! With so many options for data collection in web analytics, the log file has been somewhat disdained for many years. In fact, I’d venture to guess that the majority of web analytics practitioners working today don’t “do log file analysis.” That’s unfortunate. One can learn a lot about web analytics from examining log files - for example, “total time online” (with all it’s inherent issues :).
Why is server-side data collection (log file analysis) less popular than client-side page tagging? It’s not about bots, accuracy, or ease of deployment… that’s marketing spin, somewhat true and false depending on context. It’s may be because most web analytics practitioners currently use a hosted model.
Page tagging enables ASP models for web analytics. It would be very difficult (i.e. resource intensive and costly) for a web analytics ASP to maintain a competitive service level if the ASP relied on its clients and their IT teams to maintain server side data collection- maintaining, transporting, and synching log files.
Imagine the ASP model with log files! It’s nightmarish and makes little sense: hundreds and thousands of customers with just as many (or more) different servers behind firewalls generating many different log file formats. Then, at various intervals all customers sending uniquely formatted logs for the ASP to parse, index, and make available for querying and retrieval in close to real time. Yikes!!
Page tagging creates a standard data structure for the ASP and a manageable data transport mechanism while minimizing complexity and reducing infrastructure cost in a way that log files just can’t. Page tags enable ASP’s to meet rigorous service level agreements and contractual obligations for data processing in way they probably couldn’t with log files.
So who does log file analysis for web analytics? The answer is companies that have:
- No desire to give up control of a web analytics infrastructure to young companies. Web data is valuable and log files enable you to keep it all for however long and at whatever cost your business justifies. You can processing your data using whatever tool you want, and reprocess at your discretion.
- Existing business dimensions and structured data already existing in log files. Simple web analytics concepts like page views, visits, and “uniques” become more meaningful when sliced by relevant business objects from your companies existing log data. For example, if you’ve passed a unique parameter in your query strings for years, you can program your log file tool to recognize the parameter. Then reprocess your existing data to make use of it in your reporting.
- Deep data mining experience. Data miners want data to mine, not static GUI’s or “my data” delivered in summary extracts at an incremental cost. Log files can be parsed and indexed into the database that you, the company or the client, actually owns at the budget you actually set.
- Know how to employ hybrid data collection. Log files are complemented by page tags for instrumenting pages to track events and Web 2.0.
- Complex integration requirements. All the data exists on machines in your network. Log files can be extended to incorporate new business dimensions.
- A need to understand how spiders and bots behave on their site and the performance of their server(s). Log files capture everything that happens on the web site. The “kitchen_sink.html” is in there. It’s up to you to customize the tool to meet the quality level you demand for filtered data. As Mat and Steve mentioned in the comments, logs also enable you to view server errors, useful for site QA and debugging.
I’m sure there are other reasons too…. I’ve heard people tell me that:
- It’s hard to move log files. That depends on how many log files you have, what technology you are using to move them, and how often.
- Log files are difficult to parse. Not really, especially if they are standardized. You just need a flexible tool. If things get non-standard, you need a tool that enables you to define your own formatting.
- Log files take up a lot of space. Yes, they do. How much space do you have? What’s your companies existing standards for parsing log files? How long will you store them? You may find out you have plenty of “space.” You will never have enough time though.
- Log files aren’t as accurate as page tags. Huh? That all depends on how well you filter and exclude pages. Accurate compared to what? To unvalidated data existing on servers you don’t see in some data center thousands of kilometers/miles away? Logs capture visits from browsers that don’t execute javascript and thus will not appear in the metrics from tag-based systems…
So, you might be thinking I am a big fan of log files. I am a web analytics realist. Challenges with log files include:
- Standardizing on a common format. If you have a web server, you are at risk for someone doing something that creates a custom log file format. If you are the new web analyst coming into an existing company, you may find custom fields in the logs. Develop and champion a company standard. It will make your life as a web analyst somewhat easier.
- Creating a single “core” configuration that can be extended to deal with one-off formats. If you live in world of custom log files, you can abstract a set of core fields, describe them, then extend the description when you deal with custom craziness.
- Synchronizing/transporting log files. The more servers you have, the more chances you have for failures in synchronization. Application monitoring and recovery procedures are critical to preventing downtime. Develop them.
- Reprocessing log files. Depending on your system reprocessing existing log file data can have myriad of effects - from data not being available, to data being partially available, to errors, to long waiting periods before the data is current and ready to query.
- Receiving log files that are missing key fields or with key fields that have changed. If you depend on systems, companies, and people that you do not control, fields can be stripped and formats changed in the blink of eye. The web team needs to detect and manage these events.
- Monitoring the operational workflow. Someone on the web analytics team needs to manage the system and maintain the overall availability and integrity of log file processing and any issues that could result in the data not being synched or processed correctly.
- Managing change. Since all your “hits” are captured in the log file, you’ll need to tell your web analytics tool how to correctly process the logs and remove data you don’t want to report. You’ll need to identify and maintaining lists of filetype exclusions and the user agents/ip’s of robotic traffic.
- Finding staff that “gets it.” The rule is that success in web analytics results from having competent people who have experience understanding the risks and rewards of web analytics data collection and executing on a web analytics strategy that generates value.
As a web analyst you need to use the tools at your disposal to build your web analytics practice. Log files are part of your toolkit. Be weary of people who try to sway you one way or the other when it comes to web analytics data collection. The method you ultimately use should be based on the data you want to collect in the context of the company’s goals, budget, and expertise.






