Web Analytics and the Normal Distribution: More on Statistics and Web Data
Is web analytics data normally distributed? That question calls for another question: what web analytics variables are you measuring? That matters. Numeric random variables (let’s call them data) are classified into the following types:
- Discrete. That means you count it. The data arrives from a counting process. In web analytics discrete random variables are counts of things like page views, visits, and unique visitors
- Continuous. That means you measure it. The data arrives from a measurement process. In web analytics continuous random variables are time-based metrics.
We do both in web analytics, don’t we? We count some things. We measure some stuff. And if we’re smart and have the autonomy and positional power to do so, we apply process to counting and measuring web analytics data.
We often talk about “counting” and “measuring” like they are the same activities. In general day-to-day online business, that’s no big deal for conceptual conversations. But in statistics, “counting” is different than “measuring.”
Both discrete and continuous variables may be represented by probability distributions to assess the liklihood of an outcome. To identify probability for discrete variables, use a “binomial distribution.” Binomial distributions take into account the probability that an outcome will occur, so you may see some skewing when plotting the data that may make it look a bit “long tail.”
For continous random variables use the “normal distribution.” Realize your data won’t always look exactly like a bell curve. If it looks really different and ”long tail” you may be looking at a discrete variable better suited for a binomial distribution.
So is web analytics data “normally distributed?” The answer is that it depends on the type of data. Even then, the answer is “probably not.” In fact, most business data doesn’t follow a perfectly normal distribution. Yet every day in halls of academia, very book smart people teach statistics and tell you to apply it to business data. Are they wrong? Insane? Misguided?
No they aren’t (well maybe you have to be slightly insane to teach stats). Academics realize that most distributions are not normal and do not have equal measures of central tendency (i.e. mode, median, mean). Skewness abounds! The normal distribution, however, can be used to approximate “real-world” distributions that have different measures of central tendency.
A theory called the “central limit theorem” states that “if the sum of the variables has a finite variance, then it will be approximately normally distributed ( i.e., following a normal or Gaussian distribution).” In other words as the sample gets larger the distribution of the mean can be approximated by the normal distribution. And if I remember correctly statisticians have determined that with a sample size of at least thirty, the sample distribution of the mean will be approximately normal. Fortunately, we web analysts often have millions of data points to use…
Some time ago I actually took average visit duration for one site for which I have real data for thousands of visits and did a Lilliefors Test of Normality. The test found no evidence that the data wasn’t normally distributed even though it looked a bit odd and the skewness was 0.741426 and the kurtosis was 4.1525665.
If you’re thinking about applying statistics to web data, make sure you identify whether the data you are looking at is discrete or continuous. Don’t abandon the normal distribution for certain types of web analytics data just because it doesn’t exactly look like the Liberty Bell. Test it for normality before applying the Gaussian statistics. If the data is highly skewed, determine whether the level of error is in acceptable limits. Look at using other distributions for discrete variables.
Image from http://www.weibull.com/



