Data Structuring and Preparation For an ML Model

Chukwuka Chime
Axum Labs
Published in
5 min readSep 19, 2019

The importance of proper data structuring and preparation, which is usually called data preparation cannot be overemphasized, this is highly valuable and, oftentimes overlooked step, when training a model. This mistake is often made by newbies but it’s not unheard of for a more experienced data scientist. Today various organizations continue to look for ways to prepare their data quickly for their ML model and quickly solve their data problems. During my team’s first machine learning project we discovered we needed to do much on data preparation to get our training model to work better and more efficiently. I am going to look at how one can go through this process to prepare data for an ML model.

What is Data Preparation

This is basically getting data in its raw form, making sure it's organized properly, ensuring its easily readable and understandable and in the right format, in a nutshell, its basically making raw data well structured, readily and accurately available for and ML model.

Why Data Structuring and Preparation is Good for ML Model

Before successfully building an ML model that is optimized, a lot of data is involved to properly train a model and an ML model basically find patterns in this datasets and use them for predictions, having a good, accurate and clean data will save you a lot of time debugging poor performance of your training model, truth be told you’ll spend the majority of time on a machine learning project constructing, cleaning data sets and transforming data. Also having a well-structured data will help your training model to be more efficient and allow your model to make better predictions, briefly I will list out some core benefits of doing data preparation and cleaning

  • Data cleaning and preparation helps to recover incomplete data, this is done by filling missing data and reducing ambiguity,
  • Noisy data can properly be cleaned, reshaped outliers taking out and errors fixed
  • Usually, the data preparation process usually reduces the size of the original data thus improving processing efficiency and speed
  • Overall it generally generates quality data, which leads to a better performing model

Steps Involved in Data Structuring and Preparation

For the process of getting your data ready for your ML model you need to first understand the problem you are solving that will guide you through the various steps on getting the data ready, these steps include

  1. Data collection
  2. Data Preprocessing

Also depending on your model, the problem you are solving and the structure of your data, all of the steps I will further explain may or may not apply, but the idea is to understand these steps so that irrespective of your data structure you should be able to use these techniques to get it ready for your model.

Data Collection

In machine learning there is a general notion that your model is only as good as your data, it is important to get a lot of data, from a reliable source, well structured and good quality. This is important as it helps lay a good foundation for data preparation. Some people ask questions like “is it important to have a lot of data and how much is a lot?” the answer to that is, yes you need a lot of data to help your model work more efficiently and then for the number for a lot of data, that depends on the kind of problems you are solving e.g if you a building a spam filtering system the data required to train your model, may not be same as data required to build a language translator, in general, it boils down to you how you understand the problem you are solving, but having a lot of data with a low Bias model will always win.

It's also important that the quality of the data collected should not be overlooked, getting a lot of data with poor quality, its like no data at all, there are a couple of things to consider when thinking of data quality this includes, the reliability, how well features are represented in your data, outliers and normalized numeric values, most times data is labeled by human and there is bound to be errors, those must be eliminated too.

Data Pre-processing

Now once you are sure of your data source and have properly collected the required data, the next very important step to take is data preprocessing, as the name implies this steps involved transforming a noisy data into a cleanly shaped and well cleaned and structured data to be able to run it through an ML model. Generally, after the data collection process, you will discover that your data would have issues like outliers, missing or empty columns, irregular standard units exceptions, incorrect, inconsistent, skewed information or not in an appropriate file format.

Formatting: Here you want to make sure everything is standardized from the file format. Sometimes your data may come in form of a relational database and you want it as a text document, or in a text document but you want it in a flat-file(CSV), also data sometimes a collected from different sources as such there might also be differences in units representations for example you can have a column for price from one source represented as $20 and have another represented as USD 20 also Say State name could also (e.g Lagos and LAG) anomalies like this can affect model performance.

Cleaning: In the cleaning process here you will have to handle all missing, extreme values, outliers, and erroneous data, one thing to note here is you need to be careful when handling this process, for example, missing values could become pointers to new features later, so its a bit tricky handling them, there are various strategies for this but you want to first analyze the reasons for missing data. One quick fix is to remove all field with missing data but this might hurt your model so you want to consider an approach like Imputation, setting values for missing fields, this can be done by:

  1. Using the values estimated by another predictive model
  2. Using a mean, median, mode values of the missing field column
  3. Using values from a randomly selected record
  4. using a constant value that has meaning within the domain like setting to 0 or “missing”

Conclusion

In this post, you have learned about what data preparation is, its role in enhancing the performance of your ML model and understand that you spend more time preparing your data than running your actual model as such this needs more attention, data preparation is a technique that all data scientist and machine learning expert should have at their fingertips.

We have also covered some basic steps involved in the data preparation process, following this process will save you a lot of time troubleshooting your model and help you solve your problem faster and efficient model.

--

--