Data Mining Defined

As I write this in 2020, the phrase Data Mining is increasingly out of fashion. It has become associated with privacy concerns as shown by this quote from a Computer World article entitled Big data blues: The dangers of data mining.

As companies become experts at slicing and dicing data to reveal details as personal as mortgage defaults and heart attack risks, the threat of egregious privacy violations grows.

Clearly, privacy concerns are both valid and important but most of the projects that I’ve been involved in over the last 25 years have used data that is internal to the organization. For instance, it is common to look at mortgage defaults on a bank project, but it actually would be very unusual to try to acquire mortgage default information through a third party for a health insurance project. There are so many well-known and strange case studies in the news that it is common knowledge that attempts to acquire private data occur, but it is not as common as the famous case studies lead us to believe. Also, we have all been annoyed by pop-up ads that appear mysteriously after a Google search. It is so commonplace that we might reasonably believe that all machine learning is done in this way.

If there a negative connotation then why use the phrase Data Mining at all?

When the Cross-Industry Standard Process for Data Mining (CRISP-DM) was written in the late 90s the term was not yet associated with privacy concerns. The document is still influential today and is still considered to be the de-facto standard. Attempts to modify it have never truly taken hold, and they rarely depart greatly from the original. I discuss some CRISP-DM alternatives in a video from my Data Assessment course. Therefore, I still embrace the phrase Data Mining and the recommended process in the CRISP-DM document. Perhaps the safer phrase to use today, albeit a bit wordy, is Traditional Supervised Machine Learning.

Predictive Analytics was the phrase of choice for a while, but that is now less often used. Artificial Intelligence as a description is so all-encompassing in how it is used today that it is increasingly unclear what it is referring to, but it is safe to say that Data Mining is a subset of AI. One of my favorite attempts to define AI is by one of the co-creators of CRISP-DM, Colin Shearer. The same can be said for Data Science, which is also so broadly defined as to create problems.

I first started trying to nail down a good definition many years ago, but I know favor this one that I used in my LinkedIn Learning course: The Essential Elements of Predictive Analytics and Data Mining.

Data is the selection and analysis of data, accumulated during the normal course of doing business, in order to find (and confirm) previously unknown relationships that can produce positive and verifiable outcomes through the deployment of predictive models when applied to new data.

This definition, while a bit longer than some, identifies several distinguishing features of Data Mining. We will discuss each in turn.

Historical Data:

Data Mining needs data where the outcome of interest has been achieved. It is common for the data to be a year or more old. One desires outcomes to have been achieved recently, but the data tends to be older because you want to understand the inputs at the time that the prediction would have been made. For example, if you want to understand an insurance claim the inputs should be what was known early in the claim cycle even if it took a considerable amount of time to determine that the claim was fraudulent.

Normal Course of Business:

In statistics, one often has a hypothesis and then creates an experimental design to capture data capable of testing the hypothesis. In data mining, the data already exists. It was captured to run the business, not to perform experiments. Now that it exists, we capitalize on it to create value for the business. This has implications for how we go about the process of Data Mining.

Selecting and Preparing:

It is often believed that Data Mining is conducted on all of the data in the data warehouse. This is not true. Data Mining is often conducted on a much smaller portion of the data than the entire amount. Also, Data Mining always involves data preparation.

Previously Unknown Patterns:

Data Mining is not about having a hunch and exploring the data to confirm one’s hunch. It is a systematic search for patterns that can provide value to the business. A search without some valuable surprises would be a disappointment but is actually very rare.

Models:

A model, usually in the form of a formula or a set of rules is a systematic way of recording the patterns so that they can be applied to new data. One is not looking merely for insight, but to systematically describe patterns that can be put to use.

Deployment:

A model that is not deployed is incomplete. The model only provides value if it is inserted into the business process, driving better decisions, and shown to provide a measurable benefit to the business.

The Essential Elements course is in many ways a course about CRISP-DM, but with relatively few explicit references to the document as I describe the elements of Data Mining. In the first half, I describe the characteristics of a Data Mining project that you will always encounter and should therefore be prepared for. In the second half, I make explicit mention of each phase of CRISP-DM and also discuss the Nine Laws of Data Mining.

This video excerpt provides a high-level overview of the entire data mining process.