This article is all about the high-level overview of what to expect in a typical data science pipeline. From framing your business problem to creating actionable insights.
Initial point to solve any data science problem is to formulate the questions first that you will ask data to solve.
For example –
You have gathered the data from online surveys, feedbacks from regular customers, historical purchase orders, historical complaints, past crises etc.
Now, Using these piles of different data you may ask your data to answer the following ?
More you ask the questions to data, more insight you will get. This is how your own data unfolds the hidden knowledge which has the potential to transform your business totally.
Now, Following diagram depicts a typical pipeline to address any data science problem.\
- Getting your data
- Preparing / cleaning your data
- Exploration / Visualization of data which allows to find patterns in numbers
- Modeling the data
- Interpreting the findings
- Re-visiting / updating your model
Getting Your Data
Data science can’t answer any question without data. So, Most important thing is to obtain the data not just data “authentic and reliable data”. It’s simple, garbage goes in garbage comes out.
As a rule of thumb, there must have strict checks when obtaining your data. Now, gather all of your available datasets (which can be from the internet or external/internal databases/third parties) and extract their data into a usable format (.csv, json, xml, etc..)
Preparing / Cleaning Your Data
This phase of the pipeline is very time consuming and laborious. Most of the time data comes with their own anomalies like missing parameters, duplicate values, irrelevant features etc. So it becomes very important that we do a cleanup exercise and takes only information which is important to the problem asked. Because the results and output of your machine learning model is only as good as what you put into it. Again, garbage in garbage out.
The objective should be to examine the data thoroughly to understand every feature of the data you’re working with, identifying errors, filling data holes, removal of duplicate or corrupt records, throwing away the whole feature sometimes etc. Domain level expertise is crucial at this stage to understand the impact of any feature or value.
- Coding language: Python, R
- Data modifying Tools: Python libs, Numpy, Pandas, R
- Distributed Processing: Hadoop, Map Reduce / Spark
Exploration / Visualization of data
During the visualization phase, you should try to find out patterns and values your data has. You should use different types of visualizations and statistical testing techniques to back up your findings. This is where your data will start revealing the hidden secrets through various graphs, charts and analysis. Domain level expertise is desirable at this stage to understand fully the visualizations and their interpretations.
The objective is to find out the patterns through visualizations and charts which will also leads to the feature extraction step using statistics to identify and test significant variables.
Modeling the data (Machine learning)
Machine learning models are generic tools. You can access many tools, algorithms and use them to accomplish different business goals. The better features you use the better your predictive power will be. After cleaning the data and finding out the features that are most important for a given business problem, using relevant model as a predictive tool will enhance the business decision making.
The objective of this is to do the in-depth analytics, mainly creation of relevant machine learning models, like predictive model/algorithm to answer the problems related to predictions.
Second important objective is to evaluate and refine your own model. This involves multiple sessions of evaluation & optimization cycles. Any machine learning model can’t be superlative at first attempt. You will have to increase its accuracy by training it with fresh ingestion of data, minimizing losses etc.
Various techniques or methods are available to assess the accuracy or quality of your model. Evaluating your machine learning algorithm is an essential part of data science pipeline. Your model may give you satisfying results when evaluated using a metric say accuracy_score but may give poor results when evaluated against other metrics such as logarithmic_loss or any other such metric. Use of classification accuracy to measure the performance of a model is a standard way, however it is not enough to truly judging a model.
So, here you would test multiple models for their performance, error rate etc. and would consider the optimum choice as per your business problem.
Some of the commonly used methods are
- Classification Accuracy
- Logarithmic Loss
- Confusion Matrix
- Area under Curve
- F1 Score
- Mean Absolute Error
- Mean Squared Error
- Machine Learning: Supervised/Unsupervised algorithms
- Evaluation methods
- Machine Learning Libraries: Python (Sci-kit Learn, NumPy)
- Linear algebra & Multivariate Calculus
Interpreting the findings
Interpreting the data is more like communicating your findings to the interested parties. If you can’t explain your findings to someone believe me whatever you have done is of no use. Hence, this step becomes very crucial.
Objective of this step is to first identify the business insight and then correlating it to your data findings. You might need to involve domain experts in correlating the findings with business problem. Domain experts can help you in visualizing your findings according to the business dimensions which will also aid in communicating facts to the non-technical audience.
- Business Domain Knowledge
- Data Visualization Tools: Tablaeu, D3.JS, Matplotlib, GGplot, Seaborn
- Communication: Presenting/Speaking & Reporting/Writing
Re-visiting your model
As your model is in production, it becomes important to re-visit and update your model periodically, depending on how often you receive new data or as per the changes in nature of business. The more data you receive the more frequent the update will be.
Assume you’re working for a transport provider company and one day fuel prices hiked and company had to bring up the electric vehicles in their stable. You’re old model doesn’t have this feature and now you must update the model that includes this new category of vehicles. If not, your model will degrade over time and won’t perform as good, leaving your business to degrade as well. The introduction to new features will alter the model performance either through different variations or possibly in correlations to other features.
Most of the problems you will face are, in fact, engineering problems. Even with all the resources of a great machine learning, most of the impact will come from great features, not great machine learning algorithms. So, the basic approach is:
- Make sure your pipeline is solid end to end
- Start with a reasonable objective
- Understand your data intuitively
- Make sure that your pipeline stays solid
So, this is how I look into the data science pipeline. If there is anything that you guys would like to add to this article or if you find any slip-up, feel free to leave a message and don’t hesitate! Any sort of feedback is truly appreciated.
Connect with me on LinkedIn::