Data mining methodologies to build real-world data products

Avikumar Talaviya
The Modern Scientist
10 min readMay 29, 2024

Learn the must-know data mining techniques to develop and build machine learning-powered applications using flask

Photo by Isaac Smith on Unsplash

Introduction

Machine learning in notebooks has no real value beyond research and knowledge purposes. Businesses need to be able to monetize the data science and machine learning projects for which models must be deployed in production and maintained to realize their potential business impact and enhance the customer experience.

With the rise in the need to implement large-scale machine learning deployments, We must understand how to bring both data mining and software engineering methodologies together in order to achieve success in ML-powered products and services. In this article, we will look at those methods for building real-world data products. So let’s get started!

Learning objectives:

  1. To appreciate the difference between data science and software engineering methodologies.
  2. Develop machine learning model using data mining techniques and consumer using web API
  3. Set up a CI/CD pipeline to deploy the model in production using software engineering principles

Table of contents:

  1. Data science vs. software engineering
  2. Data mining methodologies for data science projects
  3. Build, Deploy, and Consume model
  4. Taking the Machine learning model from Dev to staging
  5. Conclusion
  6. FAQs

Data science vs. software engineering

When it comes to data science projects, following the right methodology is crucial to achieve the desired results of the project. There are many methods like scrum, agile, Kanban, or CRISP-DM among others. All these methods are used to make sure the project is delivered in a timely manner while fulfilling user or client’s requirements.

Data science projects are executed using methodologies like CRISP-DM or KDD. These methods are specifically designed for a data science project but they are heavily influenced by software engineering methods. Many software development principles are adopted while designing these methodologies. we will look at them in depth Before that let’s learn more about how data science and soft engineering differ from each other.

Data Science

  • Data science deals with collecting, organizing, and extracting insights from a large amount of the data
  • Data science focuses on tasks like data preprocessing, feature selection, model training, model evaluation, and generating insights from data.
  • The outcome of the data science is probabilistic in nature
  • For example, We have built an email system that can send and receive emails, but now we are facing the issue of spam emails. In this case, data science helps to identify to detect spam emails by analyzing email text and can filter it into a spam folder.

Software engineering

  • The goal of software engineering is to design, develop, test, and deploy applications or systems to meet end user’s requirements
  • Software engineering involves designing software architecture, coding, debugging, testing, and optimizing software applications.
  • The outcome of software engineering is definite in nature
  • An example of a software product would be an email system wherein the user can send and receive emails from other users. here, there is a 100% chance that the email is going to be sent as intended.

Data mining methodologies for data science projects

There are many data mining methodologies available to archive different objective as per the available data but methods like CRISP-DM is widely used and popular for data science projects. let’s look at the types of methods available:

  1. CRISP-DM: Cross-Industry Standard Process for Data Mining developed by companies like SPSS, NCR, and Daimler-Benz
  2. KDD: Knowledge discovery in databases developed by AI magazine
  3. SEMMA: Sampling, Exploring, Modifying, Modeling, and Assessing developed by SAS Institute.
  4. ASUS-DM: Analytics solution unified method for data mining developed by IBM
  5. TDSP: Team Data Science Process Developed by Microsoft

Among all these methodologies, we will learn more about CRISP-DM methodology.

CRISP-DM methodology

crisp-dm process model (source: datascience-pm.com)
  1. Business problem understanding

This initial phase focuses on understanding the project goals, objectives, and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan with a specific set of tasks and desired outcomes to achieve project-level objectives.

2. Data Understanding

The data understanding phase starts with initial data collection and proceeds with activities such as feature description, primary data analysis, and exploratory data analysis that enable you to become familiar with the data, identify the data quality problems such as missing values, inconsistent data entries, and/or identify compelling subsets to form a hypothesis regarding confidential information.

3. Data Preparation

The data preparation phase covers all activities needed to construct the final dataset fed into the modeling tools from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include data table, feature selection, feature engineering, as well as feature transformation and cleaning of data for modeling tools and techniques.

4. Modeling

In this phase, various modeling techniques depending on the problem statement, are selected and applied, and their parameters are calibrated to find optimal modeling performance. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements for the format of data it needs. Therefore, going back to the data preparation phase is often necessary at this stage.

5. Evaluation

The evaluation phase is one of the most crucial phases of any data science project lifecycle. At this stage in your project, you must have built machine-learning models. Models might be performing well on your training data, but it is necessary to test and evaluate it on unseen data for the model to achieve its objectives. Appropriate evaluation metrics are measured and well-tested at this stage. At the end of this phase, a decision on using the data mining results should be reached.

6. Deployment

The creation of the model is generally not the end of the project. Even if the purpose of the model is to analyze the data and increase the understanding of the data, the knowledge or insight gained from the modeling needs to be presented so that the end users of the model can use it. End users could be operational-level staff, business executives, or customers as well. It often involves applying live models within an organization’s decision-making processes-for example, real-time personalization web page, product recommender system, or scoring of marketing leads. post this stage, It is also necessary to monitor the performance of deployed models.

Software engineering principles used in Data mining

Software engineering uses Agile methodologies to deliver software products within the required timeframe and within available resources to the company or team. Agile-Scrum divides the development process into short iterations called “sprints,” usually lasting 2–4 weeks. This approach of iterative development ensures continuous user feedback so that the project aligns with the client’s requirements, at the same time projects can be completed within the timeframe with optimal utilization of resources.

Agile-Scrum methodology

In the context of data science, each sprint involves defining specific goals and tasks for the data analysis and modeling process. The team, consisting of data scientists, analysts, and stakeholders, collaborates closely to plan, execute, and review these tasks. Regular meetings like sprint planning, daily stand-ups, and sprint reviews facilitate communication and adaptability. The aim is to deliver a product in an incremental manner rather than deliver a finished product in one go.

Data product lifecycle

Hence, Data science projects incorporate software engineering principles to develop and build data-driven products and services on top of data science methodologies like CRISP-DM and KDD. Now that we have learned the theory, let’s implement this method to develop real-world projects.

Build, Deploy, and Consume model

In previous sections, we learned methodologies used in data mining projects. This section is all about Implementing such methodology in real-world problems to develop, build and consume machine learning models. let’s look at the problem statement and objective of the project.

Business objective: Improve direct marketing efficiency

Background situation: In the banking sector direct marketing is an integral part of the marketing strategy. Human agents reach out to customers via telephone or online banking services to offer long-term deposits to them. Banks always want to ensure efficiency in such offerings.

Data mining goal: Identify the customer traits that affect the success of the direct marketing

Now, we will follow the data mining methodology to build and develop the first version of the product.

Step 1: Load the dataset

We will load the bank dataset using pandas in the project environment.

import pandas as pd

#load data
df = pd.read_csv("./bankData/bank.csv", header = None)

Step 2: Data preparation

In this stage, we are cleaning the dataset and converting the raw dataset into the scaled dataset for the modeling purpose.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

#drop campaign related columns
df.drop(df.iloc[:, 8:16], inplace = True, axis = 1)
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

#extract numeric features
numeric_data = df.iloc[:, [0, 5]].values
numeric_df = pd.DataFrame(numeric_data, dtype = object)
numeric_df.columns = ['age', 'balance']

#standard scaling age
age_std_scale = StandardScaler()
numeric_df['age'] = age_std_scale.fit_transform(numeric_df[['age']])
#standard scaling balance
balance_std_scale = StandardScaler()
numeric_df['balance'] = balance_std_scale.fit_transform(numeric_df[['balance']])

#extract categoric features
X_categoric = df.iloc[:, [1,2,3,4,6,7]].values

#onehotencoding
ohe = OneHotEncoder()
categoric_data = ohe.fit_transform(X_categoric).toarray()
categoric_df = pd.DataFrame(categoric_data)
categoric_df.columns = ohe.get_feature_names_out()

#combine numeric and categories
X_final = pd.concat([numeric_df, categoric_df], axis = 1)

Step 3: Model training

Once the data is prepared we will use “RandomForestClassifier” to train the model on the final data. the model will be used for inference using Flask API.

from sklearn.ensemble import RandomForestClassifier

#train model for model inference
rfc = RandomForestClassifier(n_estimators = 100)
rfc.fit(X_final, y)

Step 4: Model inference

Flask performs as a backend server and helps to create model API for inference purpose.

from flask import Flask, request, Response, json

#create Flask instance
app = Flask(__name__)

#create api
@app.route('/api', methods=['GET', 'POST'])
def predict():
#get data from the request
data = request.get_json(force=True)
data_categoric = np.array([data["job"], data["marital"], data["education"], data["default"], data["housing"], data["loan"]])
data_categoric = np.reshape(data_categoric, (1, -1))
data_categoric = ohe.transform(data_categoric).toarray()

data_age = np.array([data["age"]])
data_age = np.reshape(data_age, (1, -1))
data_age = np.array(age_std_scale.transform(data_age))

data_balance = np.array([data["balance"]])
data_balance= np.reshape(data_balance, (1, -1))
data_balance = np.array(balance_std_scale.transform(data_balance))

data_final = np.column_stack((data_age, data_balance, data_categoric))
data_final = pd.DataFrame(data_final, dtype=object)

#make predicon using model
prediction = rfc.predict(data_final)
return Response(json.dumps(prediction[0]))

Once we develop the model on a local machine, the next goal is to test the code before pushing it to a remote GitHub repository as per the CI/CD pipeline. CI/CD pipeline is the standard approach to developing, testing, and delivering software products. The same principle is used in data-based products or machine learning models to ensure the timely delivery of end products with utmost reliability. It also provides agility in development and makes it easier to incorporate feedback as we learned in previous sections.

CI/CD workflow and product development stages

In the next section, we will take the tested model locally to the staging and deployment stage on the Heroku cloud. you can also use any other cloud platforms like Vercel, Streamlit, AWS, or Azure Cloud.

Taking the Machine learning model from Dev to staging

After testing the model on a local machine, we need to add a couple of configuration files to deploy our model API on the Heroku cloud. These config files differ depending on the cloud platform you choose. Here’s the structure of the repo and the required files for the deployment.

Repo structure with Heroku config files

After all the development and testing, we can push the files to the remote repository followed by setting up the pipeline on a Heroku cloud. Below are the steps to set up a staging and deployment pipeline.

Step 1: Sign up or log in to the Heroku Cloud platform

Signup/Login page of Heroku Cloud

Step 2: Create a new app

Create App page on the Heroku cloud

Step 3: Create a new pipeline

Create pipeline page inside created app

Step 4: Connect to the GitHub repo and deploy!

Connect with the repo and deploy!

Once you have connected the GitHub repository with Heroku Cloud as shown in the above screenshot, you can deploy the app without any additional configuration settings. That’s it! our first iteration of the deployed model is ready for demonstration. According to data mining methodology; upon receiving feedback from the users, developers can incorporate them in the next round of iteration or a sprint.

Finally, here’s how the project workflow would look according to the CI/CD diagram.

Model code from dev to staging and deployment

GitHub Link of the project code: Click Here

Conclusion

In conclusion, the fusion of data science and software engineering is vital in constructing impactful real-world data products. Delving into various data mining methodologies equips us with the tools needed to navigate the complexities of data science projects. The process of building, deploying, and consuming models bridge the gap between development and practical application. Moreover, transitioning machine learning models from development to staging facilitates seamless integration into operational environments. Such methodologies also facilitate collaboration between data science and software engineering teams.

FAQs

Q1: What are the 6 processes of data mining?

A: 6 processes of data mining are business understanding, data understanding, data preparation, modeling, evaluation, and deployment. this iterative sequence of processes helps data scientists develop and build machine-learning models.

Q2: What do you mean by KDD?

A: Knowledge discovery in databases (aka KDD) aims to discover potential insights and knowledge in datasets to achieve specific objectives.

Q3: What is the agile methodology for data mining?

A: Agile-Scrum divides the development process into short iterations called “sprints,” usually lasting 2–4 weeks. It aims to deliver a minimum viable product in its first iteration followed by a series of sprints to build a data product.

--

--