In this post I will give you brief overview of how data science machine learning project handle end to end. We have around 11 steps enlisted below. In some companies data scientist is responsible for all steps while in some companies we have product managers, data engineers, software engineers for our aid with some of the steps.
1. Understand Business Requirement
This is the first and important steps in Data Science project. At this stage we define what is the problem, understand who is the customer(end user)for whom we are developing machine learning/deep learning solutions. This is typically like a product managerial role where you go and talk to customer/business units for whom you are developing solution and try to understand their core problems.
2. Data Acquisition
Once you understand the problem next step is to acquire proper data to solve problem. This is done usually using ETL and the most popular tool use is SQL. Your can acquire data from various sources like Databases, Data warehouses, Log Files, Web, Hadoop/Spark. Data Engineers helps with this step.
3. Data Preparation
Data collected can be noisy. So to build a model we need to clean and pre-process data to bring it in the proper format so that models can be build on that data. This is one of the most important step.
4. Exploratory Data Analysis
In this stage you try to understand data by using various plots and visualization techniques like t-sne. Eg click here. Also you use tools in statistics like hypothesis testing and etc. to understand the nature of data. You slice and dice data and try to look at data from various angles and try to understand feature which are going to help you to achieve your goal.
5. Modelling Evaluation and Interpretation
All of your machine learning models like linear regression, logistic regression, KNN, Decision Trees, Random Forest, deep learning models are part of this step. Even if you are a good modeler if you don’t focus on previous steps your model won’t perform well.
Evaluation deals with defining what is KPI. Which performance matrix you are going to use to assess you models and how these performance matrix connect with business requirement.
Sometimes Interpretation is important. You want to understand why your model performing well, which feature are contributing most in predicting correct output.
Lots of people focus more on this step, But previous steps are more important which are the foundation for any data science projects.
6. Communication Results
After modelling once you get your initial results you want to communicate results very clearly and simply not in technical language with all the stakeholders. One of the best way is prepare report of 1–6 pages not more than 6 in very simple and clear terms. Communication is important because here you are convincing your Manager, Business teams that this solution is perfect and we need to deploy it in real world.
Once you communicate your results and you get all approvals you go ahead and deploy your model. Software Engineering has a key role here. Software Engineer helps in this step because they know how to make system robust, stable, scalable , etc.
8. Real World Testing
A/B testing is the widely used testing method to test Machine Learning models. At the end of A/B testing you measure the true business impact of model and all the efforts you have done.
9. Customer Business Buying
So once you tested business impact you go to business owners again with all the data, stats, model results and convince them to buy. You need to convince them the solution you developed will add business value.
This is a maintenance step. You need to handle failures in production. Data gets change everyday, so you need to retrain models to cope up with the varying data. So you need to define process for handling failures, for retraining models.
Model you developed in modelling stage is a first cut model which can certainly be improved. There is some data you might skip, you can dig more and extract new information. So, basically you go to drawing board and try to think on acquiring more data, building new feature, optimizing code. So, optimization is a continuous process. Large companies spends multiple years optimizing models.
I will share one case study end to end which will cover first few steps.
Thanks for reading.