Pandas for beginners: How to handle real-life data

Handling real life datasets can be painful when you are used to cleaned and ‘ready-to-use’ datasets that are used in books, tutorials and beginner challenges in Data Science. This tutorial aims to provide some useful tips and codes to get started with the Pandas library and data provided by your company or client.

1) Read your csv

If your data is in a csv file, the first step is to load it by using pd.read_csv(). Nonetheless, there are many other ways to load your data (excel files, SQL queries …)

If you encounter problems while reading the csv, these arguments can help you:

  • sep: often the separator will be the default one or ;, but it can be trickier like \t
  • encoding: You may need to use encoding to read your file. Try different ones.
  • error_bad_lines: if despite all your efforts, you are still getting errors, this argument allows to skip the rows that generate errors. Set it to False to solve your bad data format problems (and your day, incidentally).

You now have a Pandas data frame to play with!

Use df.shape, df.info(), df.head(), df.describe() to have a first glance of your data.

My data frame looks like this: A row for each newsletter of each user.

2) Get rid of your unlabelled rows

Concerning the labeling of your data, you may need to have to do some manual work at this step too. In my case, I have my labeled data in a second data frame like this:

Even if your case is different, a good technique is to have a list of your labeled rows id’s, labeled_ids in my code ( labeled_ids = target_df.id_user.tolist() ) and remove the unlabelled rows from your data frame with this simple line of code:

3) Basic Feature Engineering

In my data frame, neither the ‘read_mail’ nor the ‘unread_mail’ columns are very meaningful… Here comes feature engineering with Pandas!

I create a new feature (=column) named ‘percent_read’, which is the percentage of ‘read_mail’ on the total number of mails received:

4) Reshape Your Data Frame

In real life, unlike the most famous datasets such as the Irises flowers, Titanic passengers, or Boston house prices, yours will need a little more work before you can start to dive into analysis and model training. Likely, you will have to reshape your data frame.

My data frame, as for now, can not be used to train a model because I have several rows for each user. Furthermore, I cannot perform analysis about which newsletters are the most useful because I have a row for each newsletters of every users.

I need to reshape it using a pivot table to have each newsletter as a column and only one row per user:

It is pretty simple, I just chose what data I want in my new data frame (the values argument, here I chose the previously created feature ‘percent_read’) and the columns names (the columns argument). They both are columns of my previous data frame.

Be aware that pivot_table is very convenient but can be very CPU consuming if your resulting data frame has many dimensions.

5) Label your rows

The last step is add the ‘target’ column to our new data frame.

To do so, use the pandas.Series.map method, to map each target with its corresponding row.

The data frame is now ready to work with. We can now start to visualize and analyse it, and later train a model on it!

Have fun!