Pandas for beginners: How to handle real-life data

Handling real life datasets can be painful when you are used to cleaned and ‘ready-to-use’ datasets that are used in books, tutorials and beginner challenges in Data Science. This tutorial aims to provide some useful tips and codes to get started with the Pandas library and data provided by your company or client.

1) Read your csv

If your data is in a csv file, the first step is to load it by using pd.read_csv(). Nonetheless, there are many other ways to load your data (excel files, SQL queries …)

If you encounter problems while reading the csv, these arguments can help you:

  • sep: often the separator will be the default one or ;, but it can be trickier like \t

You now have a Pandas data frame to play with!

Use df.shape,, df.head(), df.describe() to have a first glance of your data.

Image for post
Image for post
My data frame looks like this: A row for each newsletter of each user.

2) Get rid of your unlabelled rows

Concerning the labeling of your data, you may need to have to do some manual work at this step too. In my case, I have my labeled data in a second data frame like this:

Image for post
Image for post

Even if your case is different, a good technique is to have a list of your labeled rows id’s, labeled_ids in my code ( labeled_ids = target_df.id_user.tolist() ) and remove the unlabelled rows from your data frame with this simple line of code:

Image for post
Image for post

3) Basic Feature Engineering

In my data frame, neither the ‘read_mail’ nor the ‘unread_mail’ columns are very meaningful… Here comes feature engineering with Pandas!

I create a new feature (=column) named ‘percent_read’, which is the percentage of ‘read_mail’ on the total number of mails received:

Image for post
Image for post

4) Reshape Your Data Frame

In real life, unlike the most famous datasets such as the Irises flowers, Titanic passengers, or Boston house prices, yours will need a little more work before you can start to dive into analysis and model training. Likely, you will have to reshape your data frame.

My data frame, as for now, can not be used to train a model because I have several rows for each user. Furthermore, I cannot perform analysis about which newsletters are the most useful because I have a row for each newsletters of every users.

I need to reshape it using a pivot table to have each newsletter as a column and only one row per user:

Image for post
Image for post

It is pretty simple, I just chose what data I want in my new data frame (the values argument, here I chose the previously created feature ‘percent_read’) and the columns names (the columns argument). They both are columns of my previous data frame.

Be aware that pivot_table is very convenient but can be very CPU consuming if your resulting data frame has many dimensions.

5) Label your rows

The last step is add the ‘target’ column to our new data frame.

To do so, use the method, to map each target with its corresponding row.

Image for post
Image for post

The data frame is now ready to work with. We can now start to visualize and analyse it, and later train a model on it!

Have fun!

Written by


Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store