Lessons I Learnt From My First Proper Data Analysis Project

Published in

Women Data Greenhorns

7 min readNov 1, 2018

**Image Credits:** https://www.techemergence.com/wp-content/uploads/2018/07/data-mining-medical-records-with-machine-learning-5-current-applications-851x315.png

Ever since I received the Bertelsmann Data Science Scholarship 2018, I have set my mind to learn the best I can and finally snag a Data Scientist job. This was three months ago and today, I finally finished, submitted, and got reviewed my first proper data analysis project as part of the Data Analyst Nanodegree I was awarded.

The Project Requirement

I had to analyze the data given to me the best I could. The no-show appointments dataset, taken from Kaggle, is about 100K medical records of whether or not a patient missed their doctor's appointment.

The requirement was to do it in Python. I used numpy, pandas and matplotlib in the analysis. They say it is better to learn by doing and that is exactly what I experienced as it is in this project that I found the power of these libraries and some handy widgets of ipython to create an interactive data analysis report.

I learnt quite a bit about data analysis and how to do it properly and in an, from this project.

Following is a list of what I learned from abstract to specifics:

Asking the Right Questions

Chances are, two people with the same data set may ask completely different questions and as a result come up with completely different conclusions for their analysis. However, neither of them would be incorrect because both individuals asked different questions from their data set.

For the most part, analysts are playing with a sample of the population i.e. they do not have all the data, only a subset of the data. Thus, what type of questions one asks is solely dependent on their knowledge and perspectives. To counter this, it is recommended to not ask broad questions such as, what is my customer base like?.

An analyst can take this from various dimensions like customer base demographics, or customer’s buying habits. Both are valid but both answer different questions. Thus, it is important to know what your end goal is. You should know who you will pitch the results to and answer every question accordingly.

For my project, based on the data I had here are the questions I asked of it and then tried to answer:

What is the age of patients who make it to the appointment? What is the age who does not make it to the appointment?
How many patients who received the SMS showed up? How many did not show up even after receiving the SMS?
What are the most popular days of visiting the hospital?
Has the number of people visiting each year increased or decreased?
How many scholarship students have visited the hospital?
What are the neighbourhoods who visit the doctor and what are those who don't?
What are the trends of Hypertension, Diabetes, Alcoholism and Handicap in the data?

I have assumed nothing about the data and all my answers are formulated directly from the data set.

Visualization Types To Use

Another thing I learnt is that there is more than one way to show the same data. A categorical data set can be shown with a bar chart, but also with scatter plots (more information can be found here). Comparisons can be done with line charts and also bar charts. Once again, the choice of the graph is highly subjective and usually based on the knowledge of the analyst.

The most basic guide to choosing the correct visual.

My analysis contains bar charts and line chars but in a couple of places, I have just shown my results and analyzed the numbers as they give the information needed. A visual, to me, seemed redundant in those cases.

This is an example of those cases where I analyze the proportion of the sample who showed up and missed their appointments given they have received or not received the messages. Since I have to just show numbers here, making an actual chart didn’t very much appeal to me. Instead, I focused on showing the numbers and their impact on my analysis alone.

Data Scales

While plotting a population pyramid, I was comparing the number of individuals who showed up for their appointments and those who did not based on where they lived. The resulting pyramid, based on raw data looked beautiful, but it was misleading.

Having proved in previously in my analysis , that the number of patients missing their appointment was far less then those who did not, the scale of the pyramid was warped i.e. the number range of those who showed up was from 0–7000 but for those who did not was from 0–2000 and yet, on the visual it looked like they had the same range.

The longest bars had the same lengths and the number of intervals were the same.

This was odd to me. If the number of patients who showed up for their appointments was higher than those who did not shouldn’t that bar be longer? Then I realized my mistake. They were on different ranges. As a result, I turned the raw numbers into percentages.

The resulting visual, shows the percentage of the people who showed and people who didn’t within the same range from 0–100. The visual was no longer misleading as the bar sizes were in the same range as seen in the following chart:

Visualized Plotting shows the percentage of the people who showed and people who didn’t within 0–100

Now, I can obviously, see that there are clear difference in patients who showed up and patients who missed the appointments along with the extreme cases.

Pandas Group-By Aggregations

Pandas group-by aggregations is a very powerful tool, if one can utilize it properly. It turns long codes into one-liners, which is the Pythonic way of writing code.

First, pandas’ size() aggregation. I have used size, time and again in my analysis because I found it more efficient than the count function. A size() function just counts the number of rows in a group. So as a result, the size aggregation will be more efficient.

Code blocks for size function, reset_index function and unstack function used with groupby

Another new function I learned is the reset_index(). The function literally resets the index to form a single level dataframe again which makes if easier (and simpler) to perform more operations on it.

Finally, the unstack() function which, in the dataframe unstacks the row to columns. The fill value = 0 argument fills any empty cells with 0. To know more (since it is hard to understand stack and unstack), here’s a YouTube tutorial that helped me:

Pandas Stack and Unstack helpful tutorial.

Using Pandas Datetime

While quite obvious to many who have been working with date times for some time, mastering date times is one the hardest things to do. Pandas, however, provides a handy-dandy module for this and turns everything into a one-liner which is how I found myself, using the datetime (dt) module extensively for my analysis.

Pandas dt class with the most used functions

Using IPython Widgets

IPython Widgets are built into Jupyter notebooks for interactive analysis and I had no idea about the power these widgets possessed until I started searching for a way to drill up and down in matplotlib.

From the code on StackOverflow, which took some time for me to understand, I was able to generate dynamic plots over time. It’s quite brilliant the mind of the person who thought of this.

Here’s are screenshots of my dynamic plot in the running :

The plot of Number of Patients per day from 2016–02–10 to 2016–06–08

Drilling down on the previous image. Plot from 2016–02–10 to 2016–04–09

Another drill down of the previous plot: from 2016–02–10 to 2016–03–10

And here’s a brilliant tutorial for ipython widgets.

Overall, these things were the most important things that I learnt in the course of doing this project. Of course, I used quite a few online resources to help me complete this project, but in the process, I managed to master some very useful tools which will help my journey forward.

The project is up on Github so please have a look at it.

I’d love constructive feedback on it. Cheers!