It’s Only a Munger of Time

Wranglin’ the Night Away

Nicholas Teague
Automunge
3 min readAug 31, 2018

--

In my last post I drew up some functions for wrangling structured datasets. An extension of this method could be to incorporate a function that evaluates columns in a dataframe to identify the presence of date or time series data in order to apply an appropriate processing algorithm. In this notebook we’ll create this function to automate the identification of time series data and update our automunge(.) function from the previous post to include this new category of data. This leaves us with the ability to automatically identify and process numerical, binary, categorical, and time series data from structured datasets. In the interest of brevity I won’t repost the processing functions that were introduced in a prior notebook, although they will be included in the companion Colaboratory notebook available [here].

1) Import data pre-processing functions from last notebook

(not shown for brevity)

process_numerical_class(.)
process_binary_class(.)
process_text_class(.)

2) Define process_time_class(.) function

Here we define our function to process a date or time series column once it has been identified. Note that the approach is to segregate the data into separate fields for year, month, day, hour, minute, and second. My expectation is that this could prove beneficial for cases with distinct cyclical features based on different time scales (such as day of week, season of year, business hours, etc). A reasonable extension of this method could also create an additional column capturing purely a single scale of measurement such as day, hour, or minutes aggregated over the entire process.

3) Define evalcategory(.) and automunge(.) functions

Here we update the evalcategory(.) and automunge(.) functions introduced in our last notebook to include the address of date or time series data.

4) Test Functions

Here we’ll create some sample Train and Test datasets for demonstration of our functions. Note that this is updated from our last post to now include time series data.

Our updated train data for testing the functions.

Now let’s apply our automunge(.) and see how we did.

train
output numpy array for train set
labels
output numpy array for labels set
validation
output numpy array for validation set
validationlabels
output numpy array for validation labels
test
output numpy array for test data

Great well I think I’ll chalk this one up as a success. Until next time.

Books that were referenced here or otherwise inspired this post:

Code Complete — Steve McConnell

Code Complete

(As an Amazon Associate I earn from qualifying purchases.)

Hi, I’m a blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share. I can also be reached on linkedin for professional inquiries or twitter for personal.

--

--

Nicholas Teague
Automunge

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.