The CRISP-DM Methodology

Cross Industry Standard Process for Data Mining is a methodology created to create Data mining projects.

Every Data scientist must know what CRISP-DM is and what are the steps used in it. Would you be able to explain the methodology in your next interview?

Who uses CRISP-DM?

Any data professional who works in the tech industry probably uses the CRISP-DM methodology in their day to day. If you are not aware of it, you came to the right place. The methodology focuses on obtaining quality in data by following 6 steps:

  1. Business understanding
  2. Data understanding
  3. Data preparation
  4. Modeling
  5. Evaluation
  6. Deployment

Ensuring that a team follows a standard methodology will increase the team’s performance. Most importantly, it will build trust amongst team members and the organization they work for.

In this article, we will only cover the general structure of CRISP-DM. It’s strongly recommended that you spend some time going over the methodology documentation which provides granularity to each of the steps in the process.

Business understanding

The first step in the process is understanding what is the main goal. Why are we doing the project in the first place? It requires making relationships between the motivation behind starting the project and the business perspective.

In this step is very important to be clear about the project’s impact on the business and the requirements to achieve the project’s goals.

Here it’s recommended to do any or all of the following:

  • Read documentation about the use of data of interest
  • Ask about previous applications of the data
  • Talk with Subject Matter Experts (SME)
  • Be curious! Ask, Ask, Ask.
  • General perspective of data available

Deliverables: *Project goal, business impact, context, requirements

Data understanding

Next, it must be inspected what kind of information can be gained based on the data availble. Also, an evaluation of the quality of the data takes place. Below some suggestions to consider:

  • Identify issues with obtained data
  • Create data mining goals
  • Verify data availability
  • Define selection criteria (databases, tables, fields)
  • Data completeness

Why do we need to explore these points before moving on? The reason is “quality”. Understanding the data and what it represents helps to find potential data gaps early in the process. As a result, a plan is created to cover them and reduce risks of compromising your project. Take advantage of SMEs again to confirm your ideas.

Deliverables: data completeness, quality issues and remedies, risk assesment

Data preparation

Ok, now you are sure of what data is out there. Next main objective is to focus on how to extract, transform and load data to your project. This process is called ETL and includes all the steps necessary to create a dataset that can be used for analysis.

Usually, most of the time spent on a project is used in preparing data. The reason behind is because data comes from different sources and is not necessarily stored in a standard manner in each of them. You will spend time with:

  • data wrangling
  • data normalization
  • categorization
  • streamline data pipeline

deliverables: the dataset to work with

Modeling

Early in step one, the goal was set. At this point, data is available to explore the variables that impact the business. Many roads lead to Rome, so it’s important to select multiple models and to evaluate later their accuracy. The model or models to work with will likely fall in the following categories:

  • Prediction
  • Forecasting
  • Anomany detection
  • Recognition
  • Optimization
  • Segmentation
  • Recommendations

Remember, you can always take advantage to SME’s and pick their brains for additional input.

Deliverables: category and models selection

Evaluation

For each category there might be different models that will serve you well. Use more than one to understand if there is bias in the results based on a model.

For example, if you want to apply sentiment analysis to recognize positive and negative experiences based on a survey, consider using more than one Lexicon and compare results among the Lexicon’s used.

Based on the models accuracy and carried analysis, keep the champion model. Different techniques are available to help find out what model is best.

  • Use of confusion matrix
  • Inter/Intra distance
  • Measure error
  • Look for overfitting/underfitting

Deliverables: champion model, hypothesis testing results

Deployment

The very last step in the process is to put your model out there. Make sure it’s documented properly and all data required to create the model is tied to it. It will be exposed to new data and the objetive is to treat it as during model development.

Also, you have to present the project and results from implementing it to a non-technical audicence. At this moment, the project must fulfill the project objectives and have a positive impact in the business. This phase includes:

  • Document the project
  • Prepare a compelling story
  • Present to the audience
  • Know your audience
  • Next steps - growth & maintenance plan
  • Celebrate!

Deliverables: Documentation, deployment, final presentation, next steps

Closing notes

This was a very brief explanation of the CRISP-DM methodology. Don’t forget that it’s a standard practice for anyone in a data science role. It remains relevant today and it might even come up in an interview!

Hopefully, this article has helped you understand the methodology and sparked curiosity to learn even more about it.

References

  1. Brown, M., &; About the Book Author Meta S. Brown helps organizations use practical data analysis to solve everyday business problems. A hands-on data miner who has tackled projects with up to $900 million at stake. (n.d.). Phase 2 of the CRISP-DM Process Model: Data Understanding. Retrieved September 15, 2020, from https://www.dummies.com/programming/big-data/phase-2-of-the-crisp-dm-process-model-data-understanding/
  2. Cross-industry standard process for data mining. (2020, August 23). Retrieved September 15, 2020, from https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
  3. Lin, M. (2019). Cracking the Data Science Interview. Retrieved 2020.
  4. Posted by William Vorhies on July 26, 2. (n.d.). CRISP-DM – a Standard Methodology to Ensure a Good Outcome. Retrieved September 15, 2020, from https://www.datasciencecentral.com/profiles/blogs/crisp-dm-a-standard-methodology-to-ensure-a-good-outcome
  5. Rodrigues, I. (2020, February 20). CRISP-DM methodology leader in data mining and big data. Retrieved September 15, 2020, from https://towardsdatascience.com/crisp-dm-methodology-leader-in-data-mining-and-big-data-467efd3d3781

© 2019. All rights reserved.