ML Dataset Creator

Patrick Coffey
Apr 9 · 3 min read

We are creating better datasets from jagged time series data!

Building datasets for machine learning is a time-consuming process. I find that I tend to spend about 70–80% of my time on a project preparing and cleaning the datasets. Often how I process the datasets follows a similar path. I filter my data down to only the relevant samples and features, and then I clean it. The cleaning process itself changes often depending on the data type, but a simple solution is to drop the dirty rows if you have enough data to have an adequate sample size afterwards.

Once the data is cleaned, I then start manipulating it to the appropriate level. This manipulation often includes some form of quantisation for when I’m dealing with time-series data. Sometimes time-series data is nice and clean, with events happening at regular intervals, but this is often not the case when dealing with healthcare-related data. People don’t get sick or visit the doctors on a schedule, and this random distribution of samples is what I like to call “Jagged Timeseries” data. The Jagged nature of this data means that quantisation is the only option. I use the word quantisation to describe the process of aggregating/averaging and imputing values to massage the data into an evenly spread format. This might mean taking the averages of all measures during a month and imputing data for the months with no data.

It is relatively common for an ML model in the healthcare setting to need the data prepared so that the absolute times of events are shifted to be all relative to each other. This would mean that the first time step represents the onset of a given disease for all patients or the first time all patients were admitted to hospitals. I refer to this action commonly as rolling, this comes from the numpy action of rolling and axis, and to me, it describes the situation well.

One important thing to note is, I often leave the imputation of missing data until the final step, once all the cleaning, quantisation (excluding imputation) and rolling has been done. This means imputation can then happen in a more intuitive way for things like disease progression. The average for the entire population at a given timestep encodes a little bit of helpful information, unlike the case before rolling occurs.

After doing this many times, I realised this would be the perfect idea for building a simple library to handle these standard functions for me for many projects. In addition to simplifying the building process, I realised that the library could handle version datasets. This being done in a scriptable and configurable way could be very useful for concepts like autoML, allowing the parameters for the dataset’s construction to be part of the hyperparameter optimisation space!

How can I write a library to do something like this and expect it to be usable against other datasets? It is simple — the EADV format. EADV stands for Entity, Attribute, Datetime, Value. Entity represents a unique instance of something (for instance, a person), which can have multiple attributes (for instance a pulse rate), and those multiple attributes can have multiple DateTime stamps and values (their pulse was 80bpm on the 01/01/2008 14:21:04). Using this format, you can essentially express very complex data as a single four-column table. You can convert any time-series dataset to this format, and because of that fact, this library can work its magic!

Before we can produce the dataset, there needs to be some initial cleaning take place for the code later to make assumptions about the dataset. These things include renaming all features to their lowercase counterparts and fixing dates if you have data with two-digit year columns.

Once we have clean input data, we can produce the actual ML dataset. This is done using the following function, which filters, quantises, rolls and imputes!

Here is an example config file written in JSON that describes some actions needed for a real dataset I needed to build a regressor for at work.

Nerd For Tech

From Confusion to Clarification

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To stay up to date on other topics, follow us on LinkedIn.

Patrick Coffey

Written by

Patrick Coffey is currently a Data Architect and an avid practitioner of software development and data analysis/visualization.

Nerd For Tech

NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To stay up to date on other topics, follow us on LinkedIn.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store