Running Standard Deviations

Update, 7/13/2013: I’m amazed at the continued staying power of this post, considering that I had originally worked the math out for this 14 years ago. People are still commenting on this and suggesting fixes. I’m also amazed that I’ve peppered enough errors in the math and code for people to still be finding errors 5 years after the fact.

My friend Dan at Invisible Blocks came up with a great way to compute a long-running mean from the count and mean:

count += 1
mean += (x - mean) / count

I remembered that I had come up with a similar thing for standard deviation back when I was developing clustering algorithms that could use that value. It uses a power sum average, where you track the power sum as an average (divide the power sum by n) in a similar way.

n=0
mean=0
pwrSumAvg=0
stdDev=0

def update(x):
    n += 1
    mean += (x - mean) / n
    pwrSumAvg += ( x * x - pwrSumAvg) / n
    stdDev = sqrt( (pwrSumAvg * n - n * mean * mean) / (n - 1) )

We get this using the running sum formula for standard deviation and essentially simplifying away the sum by removing n:

In this formula, the core values for the standard deviation is the sum of the squares minus the square of the sum, which we can recreate from the mean and power sum average, respectively:

p: power sum average
m: mean
n: elements

s=\sqrt{\frac{pn^2 - (mn)^2}{n(n-1)}}
s=\sqrt{\frac{pn^2 - m^2n^2}{n(n-1)}}
s=\sqrt{\frac{n^2(p - m^2)}{n(n - 1)}}
s=\sqrt{\frac{pn - nm^2}{n - 1}}

Sorry for all the math, but I wanted to make sure I got it right, and that I can prove this later. I’ve already had to reconstruct it once, so I don’t want to do it again. Also, I just figured out how to embed \LaTeX equations in WordPress, which is pretty damn cool.

Another neat thing to do with this has to do with manipulating centroids of vectors. That is, dealing with lists of values that correspond to the same datum. For instance, a row of data can be considered an n-dimensional vector, where each value in the row is a value along that dimension of the vector. The traditional data mining definition of a centroid usually deals with only the mean, but I like to refer to centroid clouds when adding in standard deviations, because rather than a point in the n-dimensional space, we refer to a diffuse cloud of probability when adding in the standard deviation value.

To find a centroid of two value vectors, it is a simple matter to apply the formulas above to create the mean and standard deviations of the centroid by applying them to each value along the vector. Combining a centroid with a value works the same way, where the centroid is a vector of mean and power sum averages, and the count appies to the entire centroid.

A neat part comes in when you try to combine two centroids. The equations and code from before become  more symmetrical than when only adding in one value at a time. A given value, x, can be generalized into a mean of m=x and a count of c=1. By doing this, we can change our code to be:

class value:
    _init_(self, x=0):
        self.n=1
        self.mean=x
        self.pwrSumAvg=x*x
        self.stdDev=0

    def merge(self, val):
        self.mean = (val.mean*val.n + self.mean*self.n)
                    / (self.n + val.n)
        pwrSumAvg = ( val.pwrSumAvg * val.n + self.pwrSumAvg * self.n)
                    / (self.n + val.n)
        self.n += val.n
        stdDev = sqrt( (pwrSumAvg * n - n * mean * mean) / (n - 1) )

We lose our nice calculation of the mean, but we gain the ability to merge these two centroids together. Note that the calculation for the standard deviation is the same as before, but we change the way the power sum average and the mean are calculated.

Why go through all this trouble? You can now use this for data clustering and also to potentially visualize the centroid clouds with their data scattered through them. In bioinformatics, for example, data is often summarized at many levels. Gene expression probes are grouped into probe sets, and probe sets are often clustered or combined into functional units, or gene sets. Preserving standard deviation at these levels allows for greater clarity into the data.

About these ads

12 thoughts on “Running Standard Deviations

  1. Correct me if I’m wrong, but there seems to be an error in your code, the pwrSumAvg should really be the running sum of squares, not the average of the sum of squares. In the derivation you perform, ‘p’ corresponds to the former, but your code uses the latter.

    • Good of you to notice that, sorry that it took this long to respond. There is a problem with using a running sum in code, in that you risk precision loss and/or overflow in floating point values. In the original standard deviation equation, it is shown as a sum of squares. I screwed up in the derivation and in the code, as I needed to expand the power square average back out to the sum of power squares by multiplying it by N. It is fixed in the code and the math.

      Thanks!

  2. Is not the sequence

    n = 0
    n += 1
    s = … / (n -1)

    a guaranteed attempt to divide by zero?
    In my once off code I changed it to just /(n) – I guess it does not matter much for large n.

  3. Do you know how your power sum average method of computing variance compares to Welford’s method in terms of accuracy on digital systems? I am currently looking into combining running variances using Welford’s method, but your approach seems very attractive. It could possibly be modified to calculate weighted average and variance as well.

    • Two things jump to mind: the PSA keeps the numbers from overflowing, as in Welford’s algorithm, but Welford is a special case where X is a single value. My method has no such assumption, and allows entire centroids to be merged incrementally. This can be useful for certain types of datasets, and I developed it in order to support merges of centroids in hierarchical k-means clustering using full expectation maximization instead of mean-only centroids.

  4. Pingback: Running Variance | TaylorTree

  5. One of your math steps is incorrect. You factored the n^2 out, but a magic n appeared next to m^2. In the end you get it right though. So the third equation for the stdev should be sqrt((n^2(p-m^2))/n(n-1))

  6. In all my education and work, StdDev was a parametric statistic. How is it that you never test the distribution via standard tests before presenting StdDev as valid? Yes, I know most distributions are random and thereby qualify, but that begs the question anyway, doesn’t it?

    • I actually don’t know of any StdDev implementations that perform distribution checks before performing the calculation. The math works whether or not StdDev is actually meaningful, and the user is generally expected to determine on their own if the statistic is useful to them for their dataset.

      How normal does a distribution have to be before someone should be allowed to use this algorithm? The answer will always be situational.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s