DataPrep — Understanding the story behind the data

Shreyash Rawat
4 min readNov 5, 2022

--

The “scientist” in the data scientist corresponds to the roles involved with experimentation, research, and perpetual failing. Experimentation is a huge part of the ML pipeline process. Most data scientists when given a dataset want to start experimenting with different prediction techniques immediately (me included). It always seems like time spent without coding is time wasted which is a super flawed mindset to have.

The experimentation phase of the ML pipeline is not synonymous to shooting an arrow in the dark. Yes, there are various models that can be tried upon for almost all datasets, but these models have qualities. If you go deeper into how any particular model works, one will realize that each model takes into account different properties while predicting. For example, decision trees use multiple if-else statements to get to the predictions where SVM looks for the widest margin possible between 2 classes to separate classes.

This being said, how does one get to know the optimal list of models that one should start experimenting with? The answer is comprehensive Data Exploration. If a decent amount of time is spent on cleaning and understanding the nature of the data then the time required for experimentation can be reduced exponentially. The good news is that you do not require a lot of code to do this. In fact, just one line of code could do it.

Data Prep is a low code data collecting, cleaning and visualization tool that makes it incredibly easy to create machine learning ready data and uncovering the story behind that dataset. I strongly believe that if you know the story of that dataset, one can create much more effective prediction methods in exponentially lesser time.

The best part about DataPrep is that it is designed to work with all kinds of notebooks including jupyter, colab and Kaggle which are the major technologies used to carry out experimentation. It can be easily installed using the following command.

pip install -U dataprep

I tried using data prep on the movies dataset for cleaning and visualization to get an idea about the dataset. The first step would be data viz. Now, dataprep offers 2 different types/views: one is a overall holistic view of the entire dataframe and second a more specific and detailed view for each column as shown below.

Holistic viz
Specific view for user_rating column
Value table for rating column in dataframe

After thoroughly analyzing the visualizations, you get a clear picture about the distribution of the data, the nature of the individual variables, correlations between variables and a lot more. So for instance, you figure that one particular column has a lot of missing values but the column seems pretty useful to make the prediction, you could probably start your experimentation with decision trees or a random forest because they are immune to null values. Hence, approaching the experimentation process from the perspective of the dataset makes the process faster and efficient.

Next, there are also some super helpful functions in the package to clean your data without having to write explicit steps to do so. There are some built in functions like clean_data, clean_country, clean_address etc that automatically extract the required data from the columns. These functions are really useful for cleaning some common formats of data very quickly. Apart from the specific cleaning functions, there is also a holistic cleaning function called clean_df that takes the entire dataframe and cleans the data to the best of its ability while also making the data compressed and smaller than before which could be a very handy function for handling big datasets.

Clean_df report

Strengths

· DataPrep gets you most of the needed vizualisations and aggregations that are required to analyze the data in 2–3 lines of codes.

· It also has the feature of cleaning some common data values like country, address, zip code etc

· The holistic clean_df feature not only cleans the entire dataframe but also compresses the data to its minimum possible size.

Limitations

· Custom data cleaning columns would require to write functions that clean that data

· Functionality for multi variate analysis is not available

Conclusion

The dataprep is a very handy tool that has the capability to visualize the dataset appropriately, clean data automatically and reduce the dataset size to optimize the ML experimentation process using just a few lines of codes.

--

--