Web Analytics Blogs

Judah Phillips is an experienced web analytics practitioner and Internet expert currently working as a Director at a large multichannel media company. His blog is full of useful, unbiased, actionable insights learned from the real-world practice of a process-oriented, integrated approach to strategic Web Analytics for improving business performance.

Subscribe to Judah Phillips weblog

Archive for 'Statistics'

Web Analytics and the Normal Distribution: More on Statistics and Web Data

Is web analytics data normally distributed?  That question calls for another question: what web analytics variables are you measuring?  That matters.  Numeric random variables (let’s call them data) are classified into the following types:

  • Discrete.  That means you count it.  The data arrives from a counting process.  In web analytics discrete random variables are counts of things like page views, visits, and unique visitors
  • Continuous.  That means you measure it.  The data arrives from a measurement process.  In web analytics continuous random variables are time-based metrics.

We do both in web analytics, don’t we?  We count some things.  We measure some stuff.  And if we’re smart and have the autonomy and positional power to do so, we apply process to counting and measuring web analytics data. 

We often talk about “counting” and “measuring” like they are the same activities.  In general day-to-day online business, that’s no big deal for conceptual conversations.  But in statistics, “counting” is different than “measuring.” 

Both discrete and continuous variables may be represented by probability distributions to assess the liklihood of an outcome.  To identify probability for discrete variables, use a “binomial distribution.”  Binomial distributions take into account the probability that an outcome will occur, so you may see some skewing when plotting the data that may make it look a bit “long tail.” 

For continous random variables use the “normal distribution.” Realize your data won’t always look exactly like a bell curve.  If it looks really different and ”long tail” you may be looking at a discrete variable better suited for a binomial distribution.    

So is web analytics data “normally distributed?”  The answer is that it depends on the type of data.  Even then, the answer is “probably not.” In fact, most business data doesn’t follow a perfectly normal distribution.  Yet every day in halls of academia, very book smart people teach statistics and tell you to apply it to business data.  Are they wrong?  Insane?  Misguided?

No they aren’t (well maybe you have to be slightly insane to teach stats).  Academics realize that most distributions are not normal and do not have equal measures of central tendency (i.e. mode, median, mean).  Skewness abounds!  The normal distribution, however, can be used to approximate “real-world” distributions that have different measures of central tendency. 

A theory called the “central limit theorem” states that “if the sum of the variables has a finite variance, then it will be approximately normally distributed ( i.e., following a normal or Gaussian distribution).” In other words as the sample gets larger the distribution of the mean can be approximated by the normal distribution.  And if I remember correctly statisticians have determined that with a sample size of at least thirty, the sample distribution of the mean will be approximately normal.   Fortunately, we web analysts often have millions of data points to use…

Some time ago I actually took average visit duration for one site for which I have real data for thousands of visits and did a Lilliefors Test of Normality.  The test found no evidence that the data wasn’t normally distributed even though it looked a bit odd and the skewness was 0.741426 and the kurtosis was 4.1525665. 

If you’re thinking about applying statistics to web data, make sure you identify whether the data you are looking at is discrete or continuous.  Don’t abandon the normal distribution for certain types of web analytics data just because it doesn’t exactly look like the Liberty Bell.  Test it for normality before applying the Gaussian statistics.  If the data is highly skewed, determine whether the level of error is in acceptable limits.  Look at using other distributions for discrete variables.  

normaldist.gif

Image from http://www.weibull.com/

Web Analytics and Statistics, the Normal Distribution, the Transformation Formula, and Cumulative Probabilities… Huh, dude?

Web analytics and statistics can make the head spin (just ask this guy)!  In this blog entry, I’m going to demonstrate applying some light statistical methods to web analytics data using Excel.   After you read this blog entry, I am hoping that you can answer the boss when asked “what’s the chance average time spent on site will be N next week?”  In honor of Avinash’s numbered posts, I’m calling this lesson #1 (not in order of importance).   Remember, 79.314159265% of all statistics are made up on the spot (indeed!), so when you can prove how you calculated yr stats, you are a more powerful analyst.

Some who read this blog entry may criticize the Gaussian nature of the following content, but I think the normal probability distribution may be the most important probability distribution in stats, and it has it’s applicability to some things in web analytics, like time-based metrics (hey, they’re continuous!). Check out Jim Novo for Pareto stuff.

First, a few definitions:

  1. Continuous variable: A variable that can be measured, like Average Time Spent on Site (ATOS), as opposed to discrete variables, which are counts of things, like unique visitors.
  2. Normal distribution: The bell shaped distribution measures central tendency (mode, mean, median) and has other common characteristics.  It has tails (sometimes, ahem, the tails are long).  But that’s for another post.

Determining probabilities and expected values of ATOS involves calculating things like standard deviations using integral calculus-(a good place to start).  But, good news!  We can use Excel!  It’s really easy to use special probability tables in Excel to solve complex equations like the one we’ll be solving below….

Equation

Let’s answer the Web 2.0-esque question I presented earlier: “what’s the probability the average time spent on site will be less/more than 201 seconds?”  Let’s jam!

Here’s my sample ATOS data in seconds.  Try to use as much data in your ATOS sample from your site as you can:

dataset.jpg 

First, use Excel to calculate the mean of your observations (the average, my good reader, as in “=average” function in Excel!):

Average (mean) = 126.8

Second, use Excel to calculate the population standard deviation (”=stdev”):

Standard Deviation = 141.3

Third, apply the transformation formula, which allows you to convert a normal random variable to a standardized normal random variable Z.  Yup, it’s called Z.  :)

The transformation formula:

Z = (N - Average) / Standard Deviation

Substitute your data for N (201 seconds):

Z = (201 - 126.8)/141.3

Z = .525

Now use the NORMSDIST function of the Z value in Excel to calculate the cumulative probabilities of a value in the normal distribution, like this:

=NORMSDIST(.525)

The NORMSDIST function ”returns the standard normal cumulative distribution function. The distribution has a mean of 0 (zero) and a standard deviation of one.”  That means it shows you the probability your data will be less than N in the distribution. 

In this case (rounded):

=NORMSDIST(.525) = .70

Rounding up we find that an ATOS of:

  • 201 seconds has a probability of .70

The .70 tells you that probability of ATOS being less than 201 seconds is 70%, shown visually below (the light grey).  To determine the probability of being more than 201 seconds, take the complement.  That is subtract .70 from 1 (i.e. 1-.70=.30).

cumul_density.jpg

Thus, when the boss asks what’s the probability ATOS will be 201 seconds, you can honestly answer “70% chance it will be less than 201 seconds, and a 30% chance it will be greater.”  And you can back it up with statistics. 

That’s all for today’s lesson, fine readers, of how one can apply web analytics and statistics to measure the Web 2.0 world.  Comments?  Criticisms?