Great Discussion on LinkedIN on the Difference between Data Mining and Stats

I found myself entranced by this thread today: LinkedIN

I replied there, but with the promise of more detail here:

A sales’s rep I worked with faced a challenge some years ago. He had to define Statistics in one bullet on a power point slide to contrast it with Data Mining. The result was brilliant: “Stats is proving or disproving a hypothesis with a single data set”.

Here is my own two cents on the most striking differences. Disclaimer: By “Stats” I really focusing on hypothesis testing using parametric techniques.

1) In Statistics your hypotheses predate the collection of data. In Data Mining, the data has already been collected during the normal course of doing business.

2) Data Mining assumes eventual deployment. You don’t always get there, but the whole process is built around identifying previously unknown patterns, proving that you are right, and deploying the results in the form of a transformed business process.

3) To be worth an effort of many person weeks, the search for patterns has to be exhaustive. It takes a lot of human effort, not only machine effort, to know where and how to look. It can not be fueled only by a priori hypotheses. To do so is to limit the search by the business experience of the data miner and their collaborators. No matter how extensive that experience is, it would be imprudent to let it impede the discovery of new patterns. Rather, that experience should be used during validation and deployment to ensure that the patterns can be made useful to the business.

4) Data Miners always have to validate the model using data, whereas statisticians use distributional assumptions as a proxy for a validation data set.  It is always desirable to have a replication data set, but statisticians are not always that lucky. I am surprised how rarely the issue of money comes up when drawing the contrast between these two techniques. A competent statistician will often have to do a power analysis to determine sufficient sample size. One does not double (or increase less modestly) the sample size in order to create a validation data set. It would cost too much money. Imagine a heart study, for instance. Instead, one uses hypothesis testing and distributional assumptions to obviate the validation data set. In other words, the bell curve becomes a stand in for the validation data set. Data Miners don’t have to worry about this. They usually have  enough data to allow them to divide it randomly into a Train data set, and a Test data set.