Some reflections on CRISP-DM

Here is a copy of the CRISP-DM original document.


And here is a version of the famous circular diagram. Please be careful to refer to the task level and not just the phase level. The model is rich and very useful if you use the entire model. There is a danger in simply mapping one’s expectations onto the circular diagram. CRISP-DM is much more than just a diagram. Read the full document and then use the diagram for reference.


Some reflections on CRISP-DM’s six phases

Thoughts on Predictive Analytics and CRISP-DM Video

Business Understanding is a group activity. It takes place in meeting rooms, not alone at one’s laptop. The goal is to formalize the business problem and transform it into a form that can be answered with data and with data mining techniques. Data Mining is fundamentally about addressing problems. Data Mining projects can not ultimately succeed unless sufficient attention is given to this first phase. One of the possible outcomes of Business Understanding, and a common one, is to carefully define a Target variable.

The Tasks of this phase are:

  • Determine Business Objectives
  • Assess Situation
  • Determine Data Mining Goals
  • Produce Project Plan

Data Understanding, like Business Understanding is not a solitary activity.

A common form it takes is receiving data, exploring it, and then meeting with Subject Matter Experts (SMEs). It is not the SMEs job to pick winning or losing variables, but rather to ensure that all data is represented and that the data miner thoroughly understands the data. Once some exploration is done, the data is discussed in light of the business problem and sometimes Business Understanding has to be revisited. Deployment should be discussed during both the Business Understanding phase and this phase. You need to understand how the data will flow into the model, and back out to the business where it can provide value. One has to be careful not to draw too many conclusions at this stage because the data has not been fully integrated and data augmentation and cleaning have not occurred.

The tasks of this phase are:

  • Collect initial data
  • Describe data
  • Explore data
  • Verify data quality

It is frequently said that Data Preparation is 70-90% of the time spent on a Data Mining project.

When choosing a number as high as 90% it is probably meant that both Data Understanding and Data Preparation take up that much time, but the high end of that range is too high. However, the notion that it is the biggest threat to schedule is true. Data preparation is the biggest variable in estimating project length. Every project differs, but if the project takes longer than expected, it is virtually always that Data Preparation is to blame. Tom Khabaza claims that this phase is ALWAYS more than 50% even when a data warehouse is in place, and even when there is software support to facilitate the process. This is the case because, when time allows, more data preparation improves the effectiveness of the modeling phase. If ample time exists, it is often wiser to spend extra time on this phase than on any other phase. It is also true, that many practitioners underestimate the necessary time for this phase, and that is also a contributing factor to its reputation as time-consuming. Corners cut on this phase tend to make the other phases fall behind schedule.

In short, there is no way around it – this is a time-consuming phase, with CRISP-DM listing these five tasks.

  • Select data
  • Clean data
  • Construct data
  • Integrate data
  • Format data

Data construction, sometimes called data augmentation is a critical aspect of Data Preparation. Models will work much better when the variables have been manipulated to make the patterns clear. For instance, raw dates are not useful in most models. What is useful is the differences between dates. Other commonplace examples include subtracting variables from each other to measure change. Did someone spend more on iTunes downloads in December than they did on average over the last year? Most data mining beginners vastly underestimate the importance of this aspect of Data Preparation.

The Modeling Phase gets lots of attention during training, and in books about data mining, because there is so much to learn, but it is not the longest nor the most labor-intensive phase.

Each algorithm rewards many, many hours of study. However, in most projects Modeling is done in 1-2 weeks, even when the project is much longer. The reason is that if you have the schedule and resources to support a third week you would be better off spending an extra week on data preparation or data understanding instead. During this phase, the software is working at its hardest, and the human data miner has the most difficult work behind them. Expect to attempt dozens of models and model settings. Once you narrow it down to the several that are best, then the final winning model will be the one that best addresses the business problem.

The Tasks in the Modeling Phase are:

  • Select modeling technique
  • Generate test design
  • Build model
  • Assess model

Another challenge in the Modeling phase of Data Mining is weeding out the very weakest of the variables – weak in terms of data quality or weak in terms of relationship with the target. This is a delicate process as you don’t want to discard anything that could be useful.

The Assess Task, part of the Modeling Phase, is a screen of sorts. It is a process by which you can eliminate modeling approaches that are simply not working. Accuracy, while necessary, is insufficient. You need to ensure that the chosen model will solve the business problem. Model accuracy will not be the most important criterion once you have achieved this stage since all sufficiently capable models have graduated to the Evaluation Phase. Another way of phrasing this is that you need accuracy to be a semi-finalist, but the ultimate winner is not necessarily the model with the highest accuracy of all.

Data miners should be careful not to focus on predictive accuracy, model stability, or any other technical metric for predictive models at the expense of business insight and business fit. You may need to develop an evaluation approach that is unique to the problem at hand.

  • What are you trying to increase?
  • What are you trying to decrease?
  • How can you measure that?
  • Is there a particular target that has to be achieved to justify the project?

An extended quote from Tom Khabaza’s 9 Laws of Data Mining:

Accuracy and stability are useful measures of how well a predictive model makes its predictions. Accuracy means how often the predictions are correct (where they are truly predictions) and stability means how much (or rather how little) the predictions would change if the data used to create the model were a different sample from the same population. Given the central role of the concept of prediction in data mining, the accuracy and stability of a predictive model might be expected to determine its value, but this is not the case.

Khabaza also states the following.

The value of a predictive model arises in two ways:

  1. The model’s predictions drive improved (more effective) action, and
  2. The model delivers insight (new knowledge) which leads to improved strategy.

Once you reach the Evaluation Phase in the project you have built a model (or models) that have survived the technical criterion of the Assess Model Task. According to CRISP-DM “Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives.” Additional tasks address what might be called an after-action report creating an opportunity to revisit organizational issues that might be discussed to shed light on how the team could make some improvements in how they collaborate on future projects.

The Evaluation Phase Tasks are:

  • Evaluate Results
  • Review Process
  • Determine Next Steps

The oversimplified view of deployment is that you can easily score new data on an existing model as long as you have generated some code along the way. The reality of deployment is always a bit more complicated than this, and sometimes it can become a project in itself in the case of very sophisticated solutions. For instance, it is important to keep in mind, that during deployment data preparation still must be conducted.

If you had to merge two data sets to create the Training data, then you will likely have to do so at deployment. The only alternative is if the data is prepared in another way. For instance, once the model exists you may decide that you want to modify the source data. Ultimately, however, you have to ensure that the data flowing through the model is in the same form as it was when you built the model. This almost always involves data preparation in the deployment stream.

The Deployment Phase Tasks are:

  • Plan Deployment
  • Plan Monitoring and Maintenance
  • Produce Final Report
  • Review Project