A lesson in Heuristics for Data Resampling

Arunabh
Arunabh
Aug 28, 2017 · 2 min read

Background:

Heuristics are methods or techniques employed to quickly approximate a solution or build a model rather than get an exact solution. Obviously this means you sacrifice your accuracy. However, heuristic is a good tool for exploring your data. When you do not have enough data, heuristic methods can be used to extrapolate your existing data to create enough data points to get a useful trend or classification. We will be looking at one such example below.

The Problem:

Let’s assume of have a limited set of data points from a Vehicle Detection System. The total data available is shown below:

So, we have the above 17 readings which I compiled into a DataFrame. The problem with the this dataset is it cannot provide any meaningful trend. There are so few rows that no model can be trained on this data. On top of that, we have missing values(The missing values are timestamps for which no vehicle was passing through the Vehicle Detection System).

The way you go about understanding this data, is by creating new data from it.

The Approach:

Complete code can be found at this link

Summarize the data attributes:

Firstly, always look at the data using the describe function as shown below:

This gives us the upper and lower limits that our data should follow. Also the mean and standard deviation each new data point should follow to fall on the same distribution curve.

Creation of Synthetic Data:

As we can observe 15 data points are not sufficient to perform a model that fits correctly. We can upscale the data points by resampling.

The function repetition is used to simulate with what probability we repeat a vehicle. The function reading used mean and standard deviation to come up with random reading from within a range.

Finally, we can use for loops to create a number of rows of synthetic data which is sampled from the original data randomly but still follows all characteristics of original data.

Result:

We started from 15 points but by understanding distribution of data, we figured out how to create new data using resampling. This is useful for:

  1. Fitting a Machine Learning Model on few points.
  2. Feature engineering may need more data on your original columns.(e.g. Moving average in time series)
)
Arunabh

Written by

Arunabh

to err is human.... to tweet is divine

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade