# How to use Standard Deviation as an analysis tool without worrying about the “hard” maths

When examining a dataset, it’s often useful to look at the data through the lens of “cohorts” or “clusters”— with either meaning a subset of the data that shares something in common. I go into detail on cohorts related to common features in my article on data cubes (e.g. performing cohort analysis all objects that are coloured blue), but standard deviation allows us to use the data values themselves to generate cohorts of relatedness.

If you do want to learn the “hard” maths behind Standard Deviation, I’ll direct you to Wikipedia (not point recreating the wheel.) Note — Standard Deviation is straightforward maths, but you don’t even need to think about the underlying mathematics to take advantage of its power to group related datapoints.

Standard Deviation relates to the “deviation” or “distance” of a given value within a dataset to the mean or average value in that dataset. The idea is that the further away from the average a value lies within a dataset, then the more that value deviates from the average and the “normal” amount of variance we can expect from a dataset is known as the standard.

Standard Deviation specifically states that about 2/3 of a given set of values are within standard deviation, i.e. they’re the expected range. We can subdivide into halves of standard deviations if we like or quarters or however precise we want to get if we want to subdivide further.

## Normal Distribution

Before we look at how to perform standard deviation, here’s a picture version of what standard deviation looks like. This is from a dataset of job timings for a maintenance engineer. Jobs take anywhere from 5 minutes to 2 1/2 hours. With a so-called “normal” distribution, we observe the “bell curve” where most of the timings for jobs lie between 35 minutes and 1 hour 35 minutes.

One standard deviation (plus or minus the mean) is 66.27% of the total amount of jobs.
Two standard deviations (plus or minus) covers a further 29.18% up to 95.45%
Three standard deviations (plus or minus) covers a further 4.28% up to 99.73%
Anything beyond these extremes occur rarely enough to be considered outliers

What this allows us to do is use knowledge of the past to predict the future. Let’s say we have, on a given day, 50 hours of engineer maintenance time available, where they are either working on a customer’s premises or travelling between customer premises. The above only informs on how long we can predict the engineer will be spending on a customer’s premises (estimating travelling duration is a different interesting problem to solve).

All other things being equal, we can state than on average, our technical team will be spending two-thirds their time working on jobs lasting between 30 mins and 1.5 hours and one-third of their time working on jobs from 5 mins to 30 mins and from 1.5 hours to 2 hours and a tiny amount of time where the job is less than 5 minutes or greater than about 2 hours.

Once you calculate the SD in which a given value lies, then you can use that to band those jobs together and begin to understand the spread and predict the average work a given technician will have on a given day.

As ever, the more factors you introduce, then the deeper the analysis you can perform. If you really want to dig deeper, then you’ll want to investigate regression trees, a trick for which I’d use R and not Excel, but that’s a subject for another day. Simple cohort analysis using standard deviation (or mathematical subsets thereof) is straightforward in Excel and you will benefit from using it with Normally Distributed data.

Note: Standard Deviation SD is the tool of choice for prediction and cohort analysis if the data has a normal distribution. If you’re dealing with a non-normal data distribution, then SD should be used in conjunction with some other tricks.

Calculating the standard deviation, assuming you’re using Excel is straightforward, I’ll leave it to you to discover how to use the STDEV functions, though I will point out a slight complication.

There are two versions of the Standard Deviation formula, namely STDEV.P and STDEV.S. The difference depends on the source of your data. STDEV.P is used when you have absolutely all the data points to calculate the SD from, e.g. a school class’ list of test scores and all the children were present. If children were off that day, then you have a Sample of the data and should use the STDEV.S version of the formula which subtly changes the calculation.

So STDEV.P for “population” when you have the scores from all the tests and STDEV.S for “sample” when you have the scores from a subset of tests.

--

--