Adventures with Python: Storytelling with pandas and Matplotlib (ft. Seaborn)
Data scientists are magicians. You give them data, they turn them into stories. Visuals included.
A crucial part of my journey to becoming a Pythonista for data science is meeting new friends called Python libraries. A library is a collection of powerful scripts that can make your work a lot easier. There are libraries for general purposes and there are libraries for more specific needs. Each library is downloaded as a package and is imported into the script so we can call it and use its contents in our program.
Preparing the Story Outline
The library called pandas is a Python package that provides data structures and is useful for structured and time-series data. It mainly works with Series (for 1 dimension) and DataFrames (for 2 dimensions). Using pandas is quite efficient in the initial analysis of our data.
We can initially inspect how much data and the kind of data types we are dealing with using a number of methods under the df object.
Next, we generate descriptive statistics for our data so we can check the averages, extreme values, and how spread out our data points are. This is easily accomplished by using describe().
You may choose to go deeper and apply a few more specific statistical tests on the data. The pandas website has thorough documentation that can help anyone get started.
Building the Story Plot
The plot is the most important part of a story. This is where we get to know and see the relationships among characters. This is where we see all the action. It pulls at our heartstrings. It makes us happy, sad, surprised, and the whole gamut of other emotions.
This is how we want to communicate with our audience. As data scientists, we don’t want to throw arrays and tables of numbers at people. We show them the trends and relationships through compelling visuals in the hopes that they will also have the insights that we gained from our data.
This is where Matplotlib and Seaborn enter the picture. These are beautiful data visualization packages that take your analysis to the next level. Let’s take look at some examples below and learn the stories they tell.
I used an insurance dataset for this exercise. It can be used to predict the premium for a customer based on several factors. Based on the plots, we can say that age and insurance charges have a positive correlation that we can further explore. This is a good starting point.
Histograms may give us a quick insight as to how our data points are distributed. Check out the image below. Notice which BMI has the highest density among the insurance clients in our data?
Different kinds of graphs show different relationships between factors. Try experimenting with different visualizations to have a better understanding of your data.
The kernel density estimation (kde) graph above was generated by using the code below:
g = sns.jointplot(x=”bmi”, y=”charges”, data = insurance_df,kind=”kde”, color=”r”, height=10, aspect=1)
g.plot_joint(plt.scatter, c=”w”, s=30, linewidth=1, marker=”+”)
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels(“$BMI$”, “$Charges$”)
The arguments may seem intimidating, but familiarity with the properties of these graphs and exposure to these libraries can help someone in getting the hang of creating awesome visualizations in Python. Tune in for my next Python adventures!
(This is the third article in my series. Check out the first and second.)