Experts often quote that- “data preparation consumes around 80% time of overall time of an analytics project”. Isn’t it staggering?
I have been occupied with some data analysis assignments at work which made me curious to understand the data science process as it is more scientific and based upon factual data elements. To this end, I resorted to Azure Machine Learning (AML) for hands-on and found this environment to be quite user friendly and collaborative. To my amazement, data preparation consumed considerable chunk of my time spent on building an analytical service; and it laid the foundation for further modeling and prediction process.
What is Data Preparation or Data Pre-processing?
The image below depicts Cross Industry Standard Process for Data Mining or CRISP-DM (refer link for more details) which is widely used by industry members. It outlines six-phase iterative framework for data analysts and data scientists to follow. It is not necessarily executed linearly in practice. Many of the phases can be carried out in a different order, and analysts may work back and forth amongst these phases as and when needed.
According to CRISP-DM, the data preparation phase covers all activities to construct the final dataset from the initial raw data in order to prepare the data for further processing. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order.
(For those who might not know, data mining is the process of analyzing raw data to identify patterns and establish relationships in data to solve complex problems.)
Amongst, number of various standard approaches, I am refering to CRISP-DM here because it entails business understanding and data understanding and I think these are quite important. Also, notice the two-way connections indicating number of iterations that will be required depending on new data & relationships in order to refine the predictions and increase model accuracy.
Why Data Preparation?
Data comes from multitude of sources; it can be high in volume and have variety of attributes. Real-world data is generally noisy, incomplete and inconsistent. It implies that raw data tends to be corrupt, have missing values or attributes, outliers or conflicting values. Data preparation stage resolves such kinds of data issues to ensure the dataset used for modeling stage is acceptable and of improved quality. Analytical models fed with poor quality data can lead to misleading predictions.
What does Data Preparation include?
Preliminary to data preparation is data understanding (refer to CRISP-DM image above), in which data is scanned to get familiar with the data, to identify data quality problems and to discover first insights into it. This can be done by checking number and type of features, descriptive statistics and visualizations, missing values, inconsistent data records etc.
Human discretion and decision making skills are extremely vital to adequately analyze and prepare your data for following stages of the data mining process. It is imperative to understand the nature of the data, business objective, and the impact data preparation will have on the results of the analysis that is supporting that business objective. Other well-known terms used in data science world are Data Wrangling or Data Munging which refer to data preparation done during interactive data analysis and model building stages.
I’ve summarized my learning in the table below which gives a snapshot of the main activities involved in Data Preparation. Each of these tasks can be handled in several ways which are extensive and hence are not detailed here.
What are the benefits?
Good data preparation is crucial to producing valid and reliable models that have high accuracy and efficiency. It is essential to spot data issues early to avoid getting misleading predictions. Accuracy of any analytical model depends highly on the quality of data fed into it. Excellent quality data leads to more useful insights which enhance organizational decision making and improve overall operational efficiency. Data preparation conducted cautiously and with analytical mindset can save lots of time and effort, and hence the costs incurred.
Future of Data Preparation/ Data Wrangling
As data science activities depend significantly on human experience and wisdom, it is not possible to automate every dimension of data science and machine learning. However, many self-service and cloud based data preparation tools are rapidly emerging in market to automate some parts of data preparation process. Trifacta is one such next generation data wrangling specialist company aiming to use machine learning to automate data preparation tasks. Google has also launched Cloud Dataprep, which embeds Trifacta interface, to ease off data preparation for machine learning.
With technological advancements like IoT and Artificial Intelligence leading to data deluge, effective data preparation is the key to success of any data science project. In future, data preparation will be powered by machine learning to make it more automated. Also, achieving greater user-friendliness transparency and interactivity will be the major goal in future data preparation approaches. There remains a lot of evolution to be seen in this area.
Did you find this article useful? Share your views in the comments section below.