Update, 7/13/2013: I’m amazed at the continued staying power of this post, considering that I had originally worked the math out for this 14 years ago. People are still commenting on this and suggesting fixes. I’m also amazed that I’ve peppered enough errors in the math and code for people to still be finding errors 5 years after the fact.
My friend Dan at Invisible Blocks came up with a great way to compute a long-running mean from the count and mean:
count += 1
mean += (x - mean) / count
I remembered that I had come up with a similar thing for standard deviation back when I was developing clustering algorithms that could use that value. It uses a power sum average, where you track the power sum as an average (divide the power sum by n) in a similar way.
Data mining is, in the most general terms, an attempt to extract patterns and knowledge from data using various types of software and techniques. Data mining is used to learn and predict. This is applied to biology, neuroscience, fraud detection, national security, and even sports.
Some of these are more successful than others. For instance, text mining has been very successful at extracting proper nouns (names, places, etc.) from text, and what might be considered the biggest success of data mining comes from text mining: internet search engines. But at the same time, text mining has been less successful at automated text summarization. Continue reading