Web Analytics Blogs

Judah Phillips is an experienced web analytics practitioner and Internet expert currently working as a Senior Director at a large, global Internet company. His blog is full of useful, unbiased, actionable insights learned from the real-world practice of a process-oriented, integrated approach to strategic Web Analytics for improving business performance.

Subscribe to Judah Phillips weblog

Web Analytics and the Normal Distribution: More on Statistics and Web Data

Is web analytics data normally distributed?  That question calls for another question: what web analytics variables are you measuring?  That matters.  Numeric random variables (let’s call them data) are classified into the following types:

  • Discrete.  That means you count it.  The data arrives from a counting process.  In web analytics discrete random variables are counts of things like page views, visits, and unique visitors
  • Continuous.  That means you measure it.  The data arrives from a measurement process.  In web analytics continuous random variables are time-based metrics.

We do both in web analytics, don’t we?  We count some things.  We measure some stuff.  And if we’re smart and have the autonomy and positional power to do so, we apply process to counting and measuring web analytics data. 

We often talk about “counting” and “measuring” like they are the same activities.  In general day-to-day online business, that’s no big deal for conceptual conversations.  But in statistics, “counting” is different than “measuring.” 

Both discrete and continuous variables may be represented by probability distributions to assess the liklihood of an outcome.  To identify probability for discrete variables, use a “binomial distribution.”  Binomial distributions take into account the probability that an outcome will occur, so you may see some skewing when plotting the data that may make it look a bit “long tail.” 

For continous random variables use the “normal distribution.” Realize your data won’t always look exactly like a bell curve.  If it looks really different and ”long tail” you may be looking at a discrete variable better suited for a binomial distribution.    

So is web analytics data “normally distributed?”  The answer is that it depends on the type of data.  Even then, the answer is “probably not.” In fact, most business data doesn’t follow a perfectly normal distribution.  Yet every day in halls of academia, very book smart people teach statistics and tell you to apply it to business data.  Are they wrong?  Insane?  Misguided?

No they aren’t (well maybe you have to be slightly insane to teach stats).  Academics realize that most distributions are not normal and do not have equal measures of central tendency (i.e. mode, median, mean).  Skewness abounds!  The normal distribution, however, can be used to approximate “real-world” distributions that have different measures of central tendency. 

A theory called the “central limit theorem” states that “if the sum of the variables has a finite variance, then it will be approximately normally distributed ( i.e., following a normal or Gaussian distribution).” In other words as the sample gets larger the distribution of the mean can be approximated by the normal distribution.  And if I remember correctly statisticians have determined that with a sample size of at least thirty, the sample distribution of the mean will be approximately normal.   Fortunately, we web analysts often have millions of data points to use…

Some time ago I actually took average visit duration for one site for which I have real data for thousands of visits and did a Lilliefors Test of Normality.  The test found no evidence that the data wasn’t normally distributed even though it looked a bit odd and the skewness was 0.741426 and the kurtosis was 4.1525665. 

If you’re thinking about applying statistics to web data, make sure you identify whether the data you are looking at is discrete or continuous.  Don’t abandon the normal distribution for certain types of web analytics data just because it doesn’t exactly look like the Liberty Bell.  Test it for normality before applying the Gaussian statistics.  If the data is highly skewed, determine whether the level of error is in acceptable limits.  Look at using other distributions for discrete variables.  

normaldist.gif

Image from http://www.weibull.com/

Akin Arikan added the following ...

Hi Judah,

Very educational post, thank you! If I read this right, we would prefer that data conform to a normal distribution because only then the whole methods of determining statistical significance with confidence intervals works? Is that really right? I was wondering, so am glad to come across your post.

Akin

Judah added the following ...

Hi Akin:

Thanks!

It would be preferable that data conform to the normal distribution when applying Gaussian statistics. But in reality it’s rare for business data to be “normally distributed.” As long as sample sizes aren’t too small and there isn’t too much skewing in the population, Gaussian methods (confidence intervals, t, z, r, hypothesis testing, anova and so on) have validity. Simply looking at a histogram or a box and whisker plot will provide a good indicator of normality. More complex tests exist too.

It’s very important to make sure you understand the assumptions of whatever test you use. If you apply “normal” stats to “non-normal” data, then errors will result. It’s up to the analyst to judge whether those errors are within acceptable limits. But remember, most data isn’t normal, so the Gaussian methods only approximate anyway (some level of error always exists).

It’s worth mentioning that a branch of statistics called “non-parametric” statistics provides methods for analyzing data that isn’t “normal.” In fact many of the common “normal” methods have “non parametric” equivalents. The t-test is comparable to the “Wilcoxon” and the Mann Whitney U test. The ANOVA is analogous to the Kruskal-Wallis.

Hope that helps! Thanks so much for commenting on my blog! :)


Add to the Conversation

Your email (required) will not be published.

Please note that contributions are moderated and may take a little while to appear.