Pain in the Data

Aoun Lutfi
AstroLabs
Published in
3 min readOct 5, 2017

--

I have been working on a data analytics project for around 3 weeks, the project aims to visualize and allow querying a database of employees based on their skills, industry, and specialty. It is a very interesting and challenging project, it sounds fairly simple, yet it is taking a surprising amount of time; this is not a bad thing, as I was taking this opportunity to verify a certain fact in data science.

The challenge lies in the data itself. I received the data as a CSV (comma separated values) file with each row being a record of Name, Email, ID, Skills List, Skills Scores List, Region, Industry, and Specialty. I was working in Python and developing a web app running on IBM Cloud, so I went with Pandas library to handle the data for me. Just uploading the data to a database on the cloud was painful. Converting the CSV file to a JSON (JavaScript object notation) format was a challenge because the data was organized in such a way that each employee had one row for each region, skill or industry or specialty. I essentially had to:

  1. Combine all rows for an employee into one row
  2. Clean the data types
  3. Convert to JSON

It took me a week just to clean the data types, and this was just the first step in the project: uploading the data to Cloudant NoSQL database. One might argue why did I use JSON and NoSQL whereas I could have used a table format and SQL database? There are two main reasons, primarily because I am more comfortable working with NoSQL, and second because I was doing an experiment.

Then came the challenge of querying the data, once I received the query identifying the requested combination of region, skills, industries, and specialty. Structuring the data right for a query was a challenge which took around 3 days to address; if it weren’t for the Pandas library, I would have taken maybe a week or two. Funny enough, the total time I spent on building the structure of the web app, log in, and user interface all in all took around 2 or 3 days.

This little experiment of mine shows a very important fact about data science and analytics:

80% of the time is spent cleaning the data

I spent around 10 days to clean and prepare the data, and just 4 days to query and build the web app. Lucky enough I was doing everything in Python which provides a set of great tools and libraries for data science. My choice of database was not the best for this application, but in a real-life situation, not everything is so sweet, you almost always have to restructure, reformat, and reorganize the data.

--

--

Aoun Lutfi
AstroLabs

AI Solutions Engineer, Avid Researcher and Developer — Using AI to power the world💡🤖