Six Sigma, Stats, and Data Mining

I have been reading up on Six Sigma. I think there are real opportunities for more data mining on six sigma projects. A bit of confusion arises when experts in each of these areas meet, and I am starting to understand the source of the confusion.

The core principle of the Six Sigma process is the identification, measurement and reduction of variance. The general assumption that no variance is random. The idea is variance is inevitable, but bad, and it should be reduced to within acceptable limits (i.e. six sigma, another use of the term which is a reference to the acceptable limits themselves.)

Statistics is all about variance as well, but philosophically the approach the variance is a bit different. Statistics is largely about the measurement and reporting of inevitable, random, and uncontrollable sampling error – that is, the error one experiences when they are not able to do a census. Because both Six Sigma and Stats use probability, and because six sigma openly embraces statistics they are a natural fit, but I am not sure that even six sigma experts notice the subtle difference.

Data Mining is more strikingly different because probability theory and it application to the study of variance is not a core principle of data mining. One could argue that it has very little to do with data mining, but most competent data miners would have an appreciation of the concept. If challenged to name the "core principle" of Data Mining is would be Lift. Lift is the model's ability to better predict and outcome than one I like to call an "eyes closed model". In other words, if 5% of your marketing leads are future customers, you pick a name out of a hat, you have a 5% chance of finding a good lead. Perhaps with a good, and successfully deployed, data mining model, you might chose only the most promising leads, and achieve 20% success. Any six sigma practitioner can relate to this quest to measure increased performance, but it makes no use of terms like defect or variance. The lift would be 20%/5% or simply 4.

I am sure I will explore these connections more in the future. All three are compatible, and can be usefully combined to produce what SPSS might call the "Predictive Enterprise", and Six Sigma fans would call the "Six Sigma Organiztion".