Data mining is, in the most general terms, an attempt to extract patterns and knowledge from data using various types of software and techniques. Data mining is used to learn and predict. This is applied to biology, neuroscience, fraud detection, national security, and even sports.
Some of these are more successful than others. For instance, text mining has been very successful at extracting proper nouns (names, places, etc.) from text, and what might be considered the biggest success of data mining comes from text mining: internet search engines. But at the same time, text mining has been less successful at automated text summarization.
Typically, data mining algorithms are broken into two groups: supervised and unsupervised. A supervised algorithm attempts to learn labels that have been applied to the data via some external means. Supervised algorithms can be used as either classifiers or predictors, based on whether the output of the algorithm (the labels) are either discrete (as in a classifier), or continuous (as in a predictor). An unsupervised algorithm tries to make some sense of the data without the use of labels. There are also other sorts of algorithms that don’t fit easily into these groups, such as optimizers. An optimizer attempts to optimize the parameters of an algorithm to maximize some sort of score in a multidimensional search space. Evolutionary Algorithms, such as genetic algorithms and genetic programming, along with Monte Carlo simulation and simulated annealing, are among the most common of optimization algorithms.
The inputs used in data mining are usually called features, and outputs are called labels, even if they are continuous values, and even if they aren’t being trained. An example is a row of data in the data set. Each example is a set of features and possibly labels. A model is the set of data that is the output of the analysis of the training data, and represents the algorithm’s “understanding” (as much as it can be called that) of the training data. A typical processing pipeline used in data mining will look like this:
- Feature Selection: Often done by the person running the analysis, it is critical to find features that have some chance of performing the task you set out. You can’t use a coin toss to predict the winner of a basketball game, no matter how many times you try. It can also be done at a later stage as an automated process called automated feature selection.
- Normalization: There are two kinds of normalization: corrective normalization, which is used to remove measurement artifacts (this is very common in bioinformatics), and algorithmic normalization, which is used to bring the values into a range that the data mining algorithm will understand. A common form of algorithmic normalization is to bring the mean of all values to 0 and the standard deviation to 1, or to bring the minimum to 0 and the maximum to 1. It really depends on the data and the algorithm. In fact, after feature selection, proper normalization is probably the most important factor to successful data mining.
- Training: Training is the creation of the data mining model for that data. In supervised cases, this is by comparing the training data with the associated labels, and attempting to predict those labels from the data. In unsupervised cases, a model is built up based on some sort of similarity of the cases.
- Testing: This usually only occurs in supervised learning. A set of examples are reserved for testing the accuracy of the data mining model. The classifier or predictor is run over the reserved examples, and the outputs are compared to the already-existing labels. The scoring can take a number of forms, which I hope to go over later. The most rigorous form of testing is called cross validation. It is done by randomly partitioning the data into N parts, training a model on N-1 parts, and then testing the model on the remaining part. This is repeated for all N parts, and the scores are combined across the data. The most extreme form of this is where N is the number of examples in the data set, and is called leave-one-out cross validation. Otherwise, it is called N-fold cross validation.
A common criticism of data mining has been that it is hypothesis fishing. That is, rather than observing the data, making a hypothesis, and testing the hypothesis, we let the computer examine all possible hypotheses and then pick the highest scoring one. I see data mining more as automated hypothesis generation. It is very difficult to get valid results from data mining without knowing the domain you are applying it to, and ultimately the model that gets created will need to be vetted against unlabeled data in some sort of prospective study. Once that happens, the resulting model can go from being a hypothesis to a theory.