Zero to Hero, With Udacity and Bertelsmann

Some of life’s greatest opportunities are found in moments of strong doubt, overwhelming fear, and impossible situations. I have found this to be true, not for me alone, but for a good number of friends and colleagues I met at the Udacity/Bertelsmann Data Science Challenge Course, too.

There’s a mother of three who joggles between giving her children the needed attention and focusing on finishing the challenge course. A seven years stay-at-home Dad takes on the challenge as an opportunity to start a new career. Then there’s an inexperienced graduate who takes on the challenge as an opportunity to acquire the needed skills that opens the door to a career in data science — This is me.

Prior to May 2 [the day this challenge started], the knowledge I had of Data Science was quite superficial and my knowledge of python was poor. A month later into the course and I had learned more than I was prepared for. This is because Udacity offers its students an amazing learning experience which makes them one of its invaluable assets. The students are always willing to learn, share, and solve problems regardless of how daunting it may seem. This of course is a habit learned from how the study materials in Udacity’s classroom are structured.

The remaining part of this article will focus on some of the things I learned during this challenge course, in and outside of the classroom.

A little bit of python

Let’s say I decided to keep a record of the friend requests I receive on Facebook for a period of 7 days. And at the end of this weird exercise I have two different list: names and ages

names = ["Mike Gil", "Love Uche", "Femi Krane", "Sarah Jim", "Ahmed Crux", "Sarah Akor", "Openiyi Olu", "Kalu Isa", "Bekyy Buggy", "Itz Mhiz Gold"]ages = [34, 24, 18, 43, 51, 33, 22, 18, 29, 31]

How to create a dictionary

This looks nice, but it will be more useful if each age is standing next to the name it belongs to. To do this, we will create a dictionary.

{'Mike Gil': 34, 'Love Uche': 24, 'Femi Krane': 18, 'Sarah Jim': 43, 'Ahmed Crux': 51, 'Sarah Akor': 33, 'Openiyi Olu': 22, 'Kalu Isa': 18, 'Bekyy Buggy': 29, 'Itz Mhiz Gold': 31}

The request dictionary gives the name and age each of each person. This is way better than what we had earlier.

While each name in the dictionary is known as a key, each age is known as the value. The values in a dictionary can be accessed using its key. So, let’s answer the question, “How old is Femi Krane?”

18

That was easy, I used Femi Krane’s name to know his age (18) and I think it is so cool.

How to update a dictionary

It is however way easier to add new data to the dictionary. Let’s say I forgot to add one of the friend requests to the dictionary and I would like to do that now. Here is how to go about it.

{'Mike Gil': 34, 'Love Uche': 24, 'Femi Krane': 18, 'Sarah Jim': 43, 'Ahmed Crux': 51, 'Sarah Akor': 33, 'Openiyi Olu': 22, 'Kalu Isa': 18, 'Bekyy Buggy': 29, 'Itz Mhiz Gold': 31, 'Monwe Uda': 45}

Hurrah! 🏆 The request dictionary is now complete.

How to delete an entry from a dictionary

The name 'Itz Mhiz Gold’ is most likely an anonymous one. As much as I would decline this friend request, I also want the name out of my dictionary. ♻️

{'Mike Gil': 34, 'Love Uche': 24, 'Femi Krane': 18, 'Sarah Jim': 43, 'Ahmed Crux': 51, 'Sarah Akor': 33, 'Openiyi Olu': 22, 'Kalu Isa': 18, 'Bekyy Buggy': 29, 'Monwe Uda': 45}

'Itz Mhiz Gold’ is gone for good, and now we have a reliable dictionary. 👏

Using pandas to read excel files

So far, I have used pandas to read .csv , .xls and .xlsx excel files. The first step is to always import pandas and give it an alias (pd). You can then go on to read your excel file.

Using Jupyter Notebook saves you from the need to include the file path, since you must have navigated to the folder that holds the file you want to read.

Cleaning and Reshaping Data with pandas

For this part, let’s clean and reshape this dataset (Cost of Transportation May 2018). It contains details on the cost of transportation in Nigeria over a period of 29 months. I have renamed the data to read as "trans_cost.xlsx". Let’s load the the excel file.

data.sheet_names prints the name of all the sheets in the file.

['ABIA', 'ABUJA', 'ADAMAWA', 'AKWA IBOM', 'ANAMBRA', 'BAUCHI', 'BAYELSA', 'BENUE', 'BORNO', 'CROSS RIVER', 'DELTA', 'EBONYI', 'EDO', 'EKITI', 'ENUGU', 'GOMBE', 'IMO', 'JIGAWA', 'KADUNA', 'KATSINA', 'KEBBI', 'KOGI', 'KWARA', 'KANO', 'LAGOS', 'NASSARAWA', 'NIGER', 'OGUN', 'ONDO', 'OSUN', 'OYO', 'PLATEAU', 'RIVERS', 'SOKOTO', 'TARABA', 'YOBE', 'ZAMFARA']

Let’s count how many sheets there are in this file.

37

This value is correct. There are 36 states in Nigeria and then the Federal Capital Territory, Abuja. This implies that each sheet contains the data for each state.

Let’s take a look at the first sheet

We can use either the name of the sheet or its index. The index of the first sheet is 0 but we will use the sheet_name “ABIA”.

We will use skiprows to skip the first two rows, and .drop to exclude the rows and columns we do not need. Finally we will round the values to a whole number.

😘This looks good but there is more we can do. Let’s rename the columns and rows properly.

👯 Almost done. We need to repeat this same task for all the 36 remaining states. To do this, we need to define a function that iterates through each sheet.

List Comprehension

List comprehension is the most preferred way to create a list from iterables.

Lets say we want a list of names that are not more than 10 characters from the list names. We can just to it this way…

['Mike Gil', 'Love Uche', 'Femi Krane', 'Sarah Jim', 'Ahmed Crux', 'Sarah Akor', 'Kalu Isa']

…but here is what the code looks like using list comprehension

['Mike Gil', 'Love Uche', 'Femi Krane', 'Sarah Jim', 'Ahmed Crux', 'Sarah Akor', 'Kalu Isa']

Yeah! 😂 List comprehension is the queen, 😕and King, too. Its simple, easy and neat.

Let’s now go back to the transportation data and implement this beautify idea.

Concatenation

Rather than having a list we can join everything together like one big happy family 👪 using concatenation. Here’s an example.

#UdacityDataScholar #PoweredByBertelsmann

Another option is to use .extend but it modifies the initial list. For our cleaned data df_list we will use pandas’ .concat and save it using .to_excel. You can use .to_csv if you want to save it as a csv file.

Let’s take a loot at the first 10 rows of our cleaned data. We will use df_concat.head(10). Without the argument 10, .head() will print the first 5 rows.

💩 In this present structure, the data won’t help us that much. It has to be structured in such a way that the states are reflected and the years run down in a single columns. It will be nice to split this data into multiple day frames using .loc.

.loc slices a dataframe using the name of the column or row. Another option is to use .iloc which uses the index of the column or row.

Now, let’s use .stack() to pivot each of these dataframes (a, b, c, d, e) and merge them together. That is, each of this dataframes will become a single column in our new dataframe.

After a few touches, here is what our final tables looks like.

The first 10 rows of our table looks really nice. You will notice that I added new columns (region and price of pms). Now we can do whatever we want to do with our final table. 👯 💪

Data Visualization with Matplotlib

Matplotlib is a powerful tool for data visualization. You can plot line charts, bar graphs and histogram. Let’s do a little bit of visualization. Plotting the cost of Okada per drop and price of pms is a way to go.

Whew! 😫 That was long. I feel like writing about all that I learned but there is so much to write about. So I will stop here, for now.

Months ago, who would have thought that I will be writing this article? Not even me. From an inexperienced graduate who had no clue whatsoever on what he would do with his life, I have moved on to a person who is focused on being the best of himself and changing to world in my own little capacity. This is a story of Zero to Hero because Udacity and Bertelsmann dared to open this door of many possibilities. I will forever be grateful to both companies for this opportunity. What is now left for me to do is to hope earnestly that I make it into the next round of this challenge.

#UdacityDataScholar #PoweredByBertlesmann

PS: This article, Seven Clean Steps To Reshape Your Data With Pandas Or How I Use Python Where Excel Fails was really helpful when I started out. A big thanks to every person who has contributed to Stackoverflow and Stack Exchange. My study group members (Achievers Study group) adapted this data and improved on it as a project.

--

--