# Running Standard Deviations

Update, 7/13/2013: I’m amazed at the continued staying power of this post, considering that I had originally worked the math out for this 14 years ago. People are still commenting on this and suggesting fixes. I’m also amazed that I’ve peppered enough errors in the math and code for people to still be finding errors 5 years after the fact.

My friend Dan at Invisible Blocks came up with a great way to compute a long-running mean from the count and mean:

count += 1
mean += (x - mean) / count


I remembered that I had come up with a similar thing for standard deviation back when I was developing clustering algorithms that could use that value. It uses a power sum average, where you track the power sum as an average (divide the power sum by n) in a similar way.

n=0
mean=0
pwrSumAvg=0
stdDev=0

def update(x):
n += 1
mean += (x - mean) / n
pwrSumAvg += ( x * x - pwrSumAvg) / n
stdDev = sqrt( (pwrSumAvg * n - n * mean * mean) / (n - 1) )


We get this using the running sum formula for standard deviation and essentially simplifying away the sum by removing n:

In this formula, the core values for the standard deviation is the sum of the squares minus the square of the sum, which we can recreate from the mean and power sum average, respectively:

p: power sum average
m: mean
n: elements

$s=\sqrt{\frac{pn^2 - (mn)^2}{n(n-1)}}$
$s=\sqrt{\frac{pn^2 - m^2n^2}{n(n-1)}}$
$s=\sqrt{\frac{n^2(p - m^2)}{n(n - 1)}}$
$s=\sqrt{\frac{pn - nm^2}{n - 1}}$

Sorry for all the math, but I wanted to make sure I got it right, and that I can prove this later. I’ve already had to reconstruct it once, so I don’t want to do it again. Also, I just figured out how to embed $\LaTeX$ equations in WordPress, which is pretty damn cool.

Another neat thing to do with this has to do with manipulating centroids of vectors. That is, dealing with lists of values that correspond to the same datum. For instance, a row of data can be considered an n-dimensional vector, where each value in the row is a value along that dimension of the vector. The traditional data mining definition of a centroid usually deals with only the mean, but I like to refer to centroid clouds when adding in standard deviations, because rather than a point in the n-dimensional space, we refer to a diffuse cloud of probability when adding in the standard deviation value.

To find a centroid of two value vectors, it is a simple matter to apply the formulas above to create the mean and standard deviations of the centroid by applying them to each value along the vector. Combining a centroid with a value works the same way, where the centroid is a vector of mean and power sum averages, and the count appies to the entire centroid.

A neat part comes in when you try to combine two centroids. The equations and code from before become  more symmetrical than when only adding in one value at a time. A given value, $x$, can be generalized into a mean of $m=x$ and a count of $c=1$. By doing this, we can change our code to be:

class value:
_init_(self, x=0):
self.n=1
self.mean=x
self.pwrSumAvg=x*x
self.stdDev=0

def merge(self, val):
self.mean = (val.mean*val.n + self.mean*self.n)
/ (self.n + val.n)
pwrSumAvg = ( val.pwrSumAvg * val.n + self.pwrSumAvg * self.n)
/ (self.n + val.n)
self.n += val.n
stdDev = sqrt( (pwrSumAvg * n - n * mean * mean) / (n - 1) )


We lose our nice calculation of the mean, but we gain the ability to merge these two centroids together. Note that the calculation for the standard deviation is the same as before, but we change the way the power sum average and the mean are calculated.

Why go through all this trouble? You can now use this for data clustering and also to potentially visualize the centroid clouds with their data scattered through them. In bioinformatics, for example, data is often summarized at many levels. Gene expression probes are grouped into probe sets, and probe sets are often clustered or combined into functional units, or gene sets. Preserving standard deviation at these levels allows for greater clarity into the data.

## 14 thoughts on “Running Standard Deviations”

1. Correct me if I’m wrong, but there seems to be an error in your code, the pwrSumAvg should really be the running sum of squares, not the average of the sum of squares. In the derivation you perform, ‘p’ corresponds to the former, but your code uses the latter.

• Good of you to notice that, sorry that it took this long to respond. There is a problem with using a running sum in code, in that you risk precision loss and/or overflow in floating point values. In the original standard deviation equation, it is shown as a sum of squares. I screwed up in the derivation and in the code, as I needed to expand the power square average back out to the sum of power squares by multiplying it by N. It is fixed in the code and the math.

Thanks!

2. Is not the sequence

n = 0
n += 1
s = … / (n -1)

a guaranteed attempt to divide by zero?
In my once off code I changed it to just /(n) – I guess it does not matter much for large n.

• Standard Deviation is only defined for n > 1, so it shouldn’t be computed when n = 0 or 1.

3. Do you know how your power sum average method of computing variance compares to Welford’s method in terms of accuracy on digital systems? I am currently looking into combining running variances using Welford’s method, but your approach seems very attractive. It could possibly be modified to calculate weighted average and variance as well.

• Two things jump to mind: the PSA keeps the numbers from overflowing, as in Welford’s algorithm, but Welford is a special case where X is a single value. My method has no such assumption, and allows entire centroids to be merged incrementally. This can be useful for certain types of datasets, and I developed it in order to support merges of centroids in hierarchical k-means clustering using full expectation maximization instead of mean-only centroids.

4. Pingback: Running Variance | TaylorTree

5. One of your math steps is incorrect. You factored the n^2 out, but a magic n appeared next to m^2. In the end you get it right though. So the third equation for the stdev should be sqrt((n^2(p-m^2))/n(n-1))

• Thanks for the fix!

6. In all my education and work, StdDev was a parametric statistic. How is it that you never test the distribution via standard tests before presenting StdDev as valid? Yes, I know most distributions are random and thereby qualify, but that begs the question anyway, doesn’t it?

• I actually don’t know of any StdDev implementations that perform distribution checks before performing the calculation. The math works whether or not StdDev is actually meaningful, and the user is generally expected to determine on their own if the statistic is useful to them for their dataset.

How normal does a distribution have to be before someone should be allowed to use this algorithm? The answer will always be situational.

7. This does not produce numerically stable results. The standard solution to this problem has been known since the late 70’s, take a look at the relevant article on Wikipedia (https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance). To check the (lack of) stability of your solution, take e.g. a sequence like [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], which has a variance of 0.25. If you offset the whole sequence by a large amount the variance should be unchanged. Good algorithms can handle offsets of up to 1e15 with gradual degradation of precision. Your approach here works only up to 1e7, and for 1e9 produces errors of the dreaded “variance is a negative number” kind.