The Unsexy Data Science : Cleaning and Labeling Data

DataTurks: Data Annotations Made Super Easy
DataTurks
Published in
4 min readMay 14, 2018
Priors and Observations

Shameless plugin: We are a data annotation platform to make it super easy for you to build ML datasets. Just upload data, invite your team and build datasets super quick. Check us out.

We hate to disclose this upfront but the least talked and most important part of any machine learning pipeline is cleaning and preparing data to be useable for training, validation and testing. Anything that separates your machine learning model to that of Google’s model would be quality of data.

“I kind of have to be a master of cleaning, extracting and trusting my data before I do anything with it”. — Scott Nicholson

Data Annotations Made Easy

Improving the data quality generally involves many steps, lets go through them :

Data formatting:

  • Data with units should have same format across. For example, money, hours, address etc.
  • Make sure you are doing lowercasing/stemming where case-sensitiveness is not one of features.

Data Cleaning:

Jason Bourne isn’t happy with your data quality
  1. Removing duplicate values could a start to cleaning exercise
  2. Missing values can really play a big part in kind of output you would get from your model, hence there has to be a strategy in place to deal with it.
  3. Depending on domain and the kind of data you have, you could substitute missing values with
  • Mean/Median value [for numerical values]
  • Dummy value [not for categorical data]
  • Most frequent [for categorical values]
  • New class as ‘UNKNOWN’ for categorical values

4. Other thing that people do to deal with missing data is, remove the data where one of the column is missing value. But make sure you have enough data to train your model after removing that data.

Normalization :

  • If all of your features are on similar scale of data, generally it takes fewer iteration to train your model. While some of ML algorithms are not impacted by this, it’s a good idea to keep your dataset in similar ranges across features So whenever you see one of your feature is an outlier in terms of its values, think of normalizing it. Its not necessary that all of them lie in same range, but its good to have them in nearby ranges. There are various way to do normalizations and depending on your data and domain you can choose any of them
  • Generally its referred as feature scaling [1].

Categorize data:

  • Sometimes it helps when you convert your numeric data to a categorical data, it gives you better intuition and makes the model more relevant. One of example for such conversion is making age an age-group, so all students from 5–8 could be in one category and from 8–11 could be in some other category

Labeling data:

When he says ‘Nasty’, you better be prepared for it.
  • For any supervised machine learning algorithm you are going to need a hell lot of labeled trained data. For example you are writing a model to identify cats in images, then you need to have thousands of images which already have label of cat/non-cat image.
  • There are lot of ways to generate labeled data, depending on your problem you could go for some algorithmic way of data generation. For example, when generating labeled data to learn query intent for a search on an ecommerce site, vector product clicked-product attributes and query tokens is a way to generate such labeled data.
  • If there is no easier way to generate labeled data algorithmically, you could create a task on Amazon mechanical turks or Crowdflower to get the data tagged from humans. This is the most expensive part of any machine learning algorithm.
  • Instead of creating a task on something like Mturk, you could get the data tagged from your team, you could use some of open-source tools as per your problem statement. If not, you could create a very simple web-interface. We offer a tool to get this really easy for you, its called DataTurks, checkout more details here.

Happy Cleaning and Labeling!

--

--

DataTurks: Data Annotations Made Super Easy
DataTurks

Data Annotation Platform. Image Bounding, Document Annotation, NLP and Text Annotations. #HumanInTheLoop #AI, #TrainingData for #MachineLearning.