Automated Data Ingestion Tools With Pandas: A Love Letter to a Job I Did Not Get

Louis Casanave
Geek Culture
Published in
10 min readFeb 24, 2023

Valentine’s Day, 2023

(Photo by Alexander Sinn on Unsplash)

Last November, I graduated from Flatiron School’s Data Science Program and started the last and most important phase of my career change: getting my first job in data. And I fell hard in love with a job posting, even when trying not to become emotionally attached. C’est la vie. The job had so many features I’d been looking for:

  1. Mission Oriented
  2. Pandas and SQL
  3. Visualization
  4. Team Work Culture

The job had opportunities for creative solutions with R&D, which made me really excited. This job even had a gift for me, like a bouquet of flowers, hidden in the posting. A bulk of the job’s responsibilities would be working on python scripts to help automate the data-ingestion process. Wow! It felt like a gift to realize I could be using Pandas on a higher level.

Being a fan-boy of function-oriented code, I was quite excited to try building Exploratory Data Analysis scripts that were generalized enough to be used across projects. It made so much sense to try and make tools for cleaning data on a larger scale. Scripts offered so much flexibility and could be used together with pipelines in the future for an automated or semi-automated workflow. If the script had enough utility and intuitiveness, it could save time and headaches for my whole team. So, I applied enthusiastically, and started scripting and preparing for an interview I was hoping to land.

Unfortunately, that job was not for me to have. However, much like Sir Thomas Wyatt when he wrote love poems about Anne Boleyn, I have all this evidence of my heart’s desire left to keep me company after my heart’s desire has settled on someone else. And while my fledgling script is still being developed and is certainly no genius work like Whoso List to Hunt, I Know Where Is a Hind, I’ve decided to share this script and what it can do with all of you. What we make while in love can be valuable, even when that love is not meant to be, as Wyatt evidenced.

Without (much) further emo poetry, let’s talk code. This module was designed to help automate getting to know your data through visualizations and tables. It is not intended to produce deliverables to a stakeholder. Rather, it was designed to help speed up the process of beginning to ask the data what it needs to produce a deliverable.

What Does This Module Do?

(Video by the author)

In short, the script takes the user through a series of prompts. A CSV (comma-separated value file,) with a categorical target (the dependent variable or OUTPUT COLUMN of the data) is input by the user. It displays a number of useful visuals and returns a data frame for future use.

The script is designed to display:

  1. the target variable distribution, and the distribution of all other categorical features
  2. a pivot table with the target as the index and the mean of each numeric feature
  3. histograms of each numeric feature’s distribution
  4. bar plots of the mean of each numeric column on the x-axis, grouped by each target outcome

The idea here is to give the user a quick and dirty view under the hood of the data.

These visualizations and tables can help the user ask the following questions about their data:

  1. Is my target column balanced? If not, I’m going to start thinking about resampling vs synthetic data vs how I hyper-tune my model.
  2. Do I have a categorical feature where there are many one-off values? If so, can I combine any of them?
  3. Are my numeric columns normally distributed (in a bell curve?) Do I need to normalize or transform any of my numeric features before modeling?

When the output of the deploy function deploy_categorical_target_EDA() is saved to df with df = deploy_categorical_target_EDA() , the user can easily keep working with the data in Jupyter notebooks once this function has run its course. This makes for a much smoother workflow than going through each of these steps by hand for each categorical-target dataset you encounter.

But how does this module work? How did I build it?

Making the Module:

I love functions. And this module showcases one of my favorite formats with which to write any code as a bunch of tiny helper functions, organized by the larger functions that deploy them. So let’s take a look at this module, mini-function by mini-function.

You’ll notice I’ve instituted a try and except methodology with my functions. This is what allows my module to keep rolling, even when the data doesn’t have numeric dependent features but does have categorical dependent features and vice versa. I’ve asked the code to display any error message as a dictionary, simply my preferred way to read that kind of output.

First thing is first, the module handles importing Pandas as pd, Matplotlib.pyplot as plt, Seaborn as sns, and imports string.

(Image by the author)

Now let’s take a look at the first function, input_csv():

(Image by the author)

This function:

  1. asks for input, prompting the user for a CSV file and attempts to read it with Pandas, displaying the name of all the columns
  2. asks for input, prompting the user if the columns are currently in ‘Title Case’, if so the user can enter “y” into the input and the columns will be turned into ‘snake_case’ using some string manipulation
  3. displays df.info to show the columns, how many null values are in each column, the column’s data type, how many rows and the step size of the index
  4. returns the df

The next function is select_target():

(Image by the author.)

This function:

  1. asks for input, prompting the user to select a column for the target
  2. splits y (each and every row of the target column) X (each and every row from the dependent features) and saves y as a Pandas series and X as it’s own Pandas data frame while saving the name of the target_column as target_column
  3. displays the distribution of y by using the df[column_name].value_counts() functionality, taking it a step further by styling it to a barplot inline
  4. returns X, y, and target_column

The next function is select_cat_cols():

This function requires X, which we have from our last function. It works by:

  1. prompting the user for the categorical columns and making that input into cat_cols as a list, it then slices out these columns and saves them as cat_df
  2. for each column in cat_df, display the distribution using .value_counts().to_frame().style.bar()
  3. returns cat_cols so that list can continue to be used

The next function is select_num_cols():

(Image by the author.)

This function requires X, which we have from previous. It works by:

  1. prompting the user for numeric columns and saving that as a list num_columns
  2. for each numeric column in X, it’s going to run pd.to_numeric() on that column to ensure that if it was in a string format before, now it will be numeric (the default datatype is a string in Pandas, so this circumvents that.)
  3. it returns num_columns so we can continue to use that list in the future

So far, nothing terribly interesting has happened. However, with our next series of functions, we start getting a little jazzier.

The function histogram_jam() looks like this:

This function needs X and num_columns which we already know about. It also requires something we don’t have yet, labels_dict. This has to do with the order in which the functions are deployed. I want you to notice in the docstring, it notes that this function is deployed by another function called output_viz(). Let’s skip ahead and take a look at a few lines from output_viz() too, as shown below:

(Image by the author.)

These three lines say that basic_labels is a list containing the each column in X. title_labels are simply reverting ‘snake_case’ back into ‘Title Case.’ labels_dict therefore is a dictionary where the format is {<snake_case_label> : <Title Case Lable>, ...}

Now, looking back at histogram_jam we can see that this function:

  1. for each column in X that is also in our list num_columns, make a Seaborn histogram as save that as distributions
  2. using the labels_dict and an f string formatting we can set distributions.set_title and distributions.set_xlabel to make sure that our histograms are using the label_dict to transform the labels on our plots
  3. display each histogram
  4. return nothing

The next function is pivot_table_mean() and it makes and displays a pivot table from our numeric columns, needing X, num_cols, y, and target_column, which we have saved from all our previous functions.

(Image by the author.)

This function:

  1. makes num_cols_df which is a data frame made of each numeric column in the data set with its final column being y, our target column
  2. it then uses the groupby() method with .mean() to aggregate each numeric column by the mean, making a row for each target outcome. The displayed result looks like this:
(Image by the author.)

3. it returns that pivot table, so it can be saved and used for the next function

The last mini-function in this set of functions is viz_numeric_vars_to_target() and it needs our new pivot_table, labels_dict and target_column:

(Image by the author.)

What this function does is:

  1. working with our pivot_table, for each column in pivot_table, make a barplot with the index (our target categorical outcomes) as the X axis and the values of pivot_table as the y axis.
  2. using our labels_dict as we have with the .set_title(), set_ylabel(), and set_xlabel(), we can make sure that our bar plots have the labels in the form we want
  3. returns nothing

Lastly, we have our deploy functions, which act as wrappers to organize the order in which our mini-functions pop off. Here’s the first deploy function, it contains the execution for histogram_jam(), pivot_table_mean(), and viz_numeric_vars_to_target(). It requires X, num_cols, and target_column, which we have from our mini-functions, if we’ve saved their returns as variables.

(Image by the author.)

This function operates by:

  1. making our labels_dict in the way we’ve already explored, to be ready to use by our other functions contained within this one
  2. deploys histogram_jam(), passing in the necessary variables
  3. deploys pivot_table_mean() with the necessary variables and saves the resulting pivot table as pivot_table
  4. deploys viz_numeric_vars_to_target(), with the necessary variables including pivot_table as the variable we just made
  5. returns nothing

And now we have our big deploy function, which ensures all the other functions execute in their proper order and have all the variables they require to be passed in.

(Image by the author.)

This function is the brain of the module. It tries, in order:

  1. saving the df as the return from input_csv()
  2. saving X, y, and target_column as the return from select_target() and uses our new df as the required variable for that function
  3. saving num_columns as the list of numeric columns, from the return of select_numeric_cols() using X as the required variable for that function
  4. saving cat_cols as the list of categorical features returned from select_cat_cols(), using X as its required variable for that function
  5. deploys our middle-management function output_viz() with the required variables it needs
  6. returns our data frame to us, to be potentially saved for further use

And there you have it!

Future Work:

This is but my beginning foray into the world of automation for data science and analysis. Any QA tester would be able to brake this module easily. If the target has 15+ different potential outcomes, the bar graphs become difficult to read, for example. Increasing the complexity this module can handle is one obvious next step.

In the future, I’d also like to make similar modules for myself for when I’m working with time series data specifically because time series work is a favorite of mine.

I’d also like to keep building this up with functionality that can handle more cleaning tasks, like inputting missing values or looking for anomalies. I’m very motivated to keep working doing automation, debugging is very fun for me, and I enjoy the idea of eventually making a customized graphical user interphase for my own data projects.

I really must thank the company whose job posting I fell in love with, which prompted me to make this module. I’ve discovered a whole new way to think about the data process, even though I’ll not get the opportunity to use that new way of thinking with this company.

A Tiny Bit More Emo Poetry:

Here is my favorite poem by Sir Thomas Wyatt, about how he loved Anne Boleyn, but alas, could not pursue her since she was being courted by King Henry the Eighth. Wyatt compares Boleyn to a beautiful doe who he may not hunt for all deer in the king’s woods at that time belonged to the king.

(Photo by Scapes by Jana on Unsplash)

While about a very different kind of falling in love, it being about falling in love with a person, not a new aspect of coding, it still helped me process my disappointment about this job that I did not get. Not having a “traditional background” for a career in data, I have sometimes felt that I was of them that farthest commeth behind in my search for a new opportunity. Whoso List to Hunt, I Know Where is an Hind remains a truly beautiful and evocative poem, where we can feel the anguish of Sir Thomas Wyatt still to this day. And I must thank Sir Thomas Wyatt for helping me process my love for the job that was not for me to have, even though he would have no idea what a python module would entail.

Whoso list to hunt, I know where is an hind,

But as for me, hélas, I may no more.

The vain travail hath wearied me so sore,

I am of them that farthest cometh behind.

Yet may I by no means my wearied mind

Draw from the deer, but as she fleeth afore

Fainting I follow. I leave off therefore,

Sithens in a net I seek to hold the wind.

Who list her hunt, I put him out of doubt,

As well as I may spend his time in vain.

And graven with diamonds in letters plain

There is written, her fair neck round about:

Noli me tangere, for Caesar’s I am,

And wild for to hold, though I seem tame.

-Sir Thomas Wyatt, Whoso List to Hunt, I Know Where is an Hind[1]

Happy Belated Valentines Day, Data Lovers.

[1] T. Wyatt, Whoso List to Hunt, I Know Where is an Hind, (unknown), Poetry Foundation

--

--

Louis Casanave
Geek Culture

Louis is a Data Scientist who loves: writing, python and maps. He knows a better world is possible, and data science can help.