Update, 7/13/2013: I’m amazed at the continued staying power of this post, considering that I had originally worked the math out for this 14 years ago. People are still commenting on this and suggesting fixes. I’m also amazed that I’ve peppered enough errors in the math and code for people to still be finding errors 5 years after the fact.
count += 1 mean += (x - mean) / count
I remembered that I had come up with a similar thing for standard deviation back when I was developing clustering algorithms that could use that value. It uses a power sum average, where you track the power sum as an average (divide the power sum by n) in a similar way.
n=0 mean=0 pwrSumAvg=0 stdDev=0 def update(x): n += 1 mean += (x - mean) / n pwrSumAvg += ( x * x - pwrSumAvg) / n stdDev = sqrt( (pwrSumAvg * n - n * mean * mean) / (n - 1) )
We get this using the running sum formula for standard deviation and essentially simplifying away the sum by removing n:
In this formula, the core values for the standard deviation is the sum of the squares minus the square of the sum, which we can recreate from the mean and power sum average, respectively:
p: power sum average
Sorry for all the math, but I wanted to make sure I got it right, and that I can prove this later. I’ve already had to reconstruct it once, so I don’t want to do it again. Also, I just figured out how to embed equations in WordPress, which is pretty damn cool.
Another neat thing to do with this has to do with manipulating centroids of vectors. That is, dealing with lists of values that correspond to the same datum. For instance, a row of data can be considered an n-dimensional vector, where each value in the row is a value along that dimension of the vector. The traditional data mining definition of a centroid usually deals with only the mean, but I like to refer to centroid clouds when adding in standard deviations, because rather than a point in the n-dimensional space, we refer to a diffuse cloud of probability when adding in the standard deviation value.
To find a centroid of two value vectors, it is a simple matter to apply the formulas above to create the mean and standard deviations of the centroid by applying them to each value along the vector. Combining a centroid with a value works the same way, where the centroid is a vector of mean and power sum averages, and the count appies to the entire centroid.
A neat part comes in when you try to combine two centroids. The equations and code from before become more symmetrical than when only adding in one value at a time. A given value, , can be generalized into a mean of and a count of . By doing this, we can change our code to be:
class value: _init_(self, x=0): self.n=1 self.mean=x self.pwrSumAvg=x*x self.stdDev=0 def merge(self, val): self.mean = (val.mean*val.n + self.mean*self.n) / (self.n + val.n) pwrSumAvg = ( val.pwrSumAvg * val.n + self.pwrSumAvg * self.n) / (self.n + val.n) self.n += val.n stdDev = sqrt( (pwrSumAvg * n - n * mean * mean) / (n - 1) )
We lose our nice calculation of the mean, but we gain the ability to merge these two centroids together. Note that the calculation for the standard deviation is the same as before, but we change the way the power sum average and the mean are calculated.
Why go through all this trouble? You can now use this for data clustering and also to potentially visualize the centroid clouds with their data scattered through them. In bioinformatics, for example, data is often summarized at many levels. Gene expression probes are grouped into probe sets, and probe sets are often clustered or combined into functional units, or gene sets. Preserving standard deviation at these levels allows for greater clarity into the data.