Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Adventures with Python: Storytelling with pandas and Matplotlib (ft. Seaborn)

--

Data scientists are magicians. You give them data, they turn them into stories. Visuals included.

A crucial part of my journey to becoming a Pythonista for data science is meeting new friends called Python libraries. A library is a collection of powerful scripts that can make your work a lot easier. There are libraries for general purposes and there are libraries for more specific needs. Each library is downloaded as a package and is imported into the script so we can call it and use its contents in our program.

Preparing the Story Outline

The library called pandas is a Python package that provides data structures and is useful for structured and time-series data. It mainly works with Series (for 1 dimension) and DataFrames (for 2 dimensions). Using pandas is quite efficient in the initial analysis of our data.

A screenshot of a cell in Google Colab, showing the code for importing libraries.
We always import the needed libraries first so we can easily call them later.

We can initially inspect how much data and the kind of data types we are dealing with using a number of methods under the df object.

A screenshot of a cell in Google Colab, showing the code for inspecting data.
Based on the output, the file we are working with has 1339 rows and 8 columns and contains strings, integers, and float values as data types.

Next, we generate descriptive statistics for our data so we can check the averages, extreme values, and how spread out our data points are. This is easily accomplished by using describe().

A screenshot of a cell in Google Colab, showing the code for inspecting data.
Showing descriptive statistics of our dataset.

You may choose to go deeper and apply a few more specific statistical tests on the data. The pandas website has thorough documentation that can help anyone get started.

Building the Story Plot

The plot is the most important part of a story. This is where we get to know and see the relationships among characters. This is where we see all the action. It pulls at our heartstrings. It makes us happy, sad, surprised, and the whole gamut of other emotions.

This is how we want to communicate with our audience. As data scientists, we don’t want to throw arrays and tables of numbers at people. We show them the trends and relationships through compelling visuals in the hopes that they will also have the insights that we gained from our data.

This is where Matplotlib and Seaborn enter the picture. These are beautiful data visualization packages that take your analysis to the next level. Let’s take look at some examples below and learn the stories they tell.

A screenshot of a cell in Google Colab, showing several scatterplots.
Scatterplots are great for exploring relationships between variables.

I used an insurance dataset for this exercise. It can be used to predict the premium for a customer based on several factors. Based on the plots, we can say that age and insurance charges have a positive correlation that we can further explore. This is a good starting point.

Histograms may give us a quick insight as to how our data points are distributed. Check out the image below. Notice which BMI has the highest density among the insurance clients in our data?

A screenshot of a cell in Google Colab, showing several scatterplots.

Different kinds of graphs show different relationships between factors. Try experimenting with different visualizations to have a better understanding of your data.

A screenshot of a cell in Google Colab, showing correlation between two factors.

The kernel density estimation (kde) graph above was generated by using the code below:

g = sns.jointplot(x=”bmi”, y=”charges”, data = insurance_df,kind=”kde”, color=”r”, height=10, aspect=1)

g.plot_joint(plt.scatter, c=”w”, s=30, linewidth=1, marker=”+”)

g.ax_joint.collections[0].set_alpha(0)

g.set_axis_labels(“$BMI$”, “$Charges$”)

The arguments may seem intimidating, but familiarity with the properties of these graphs and exposure to these libraries can help someone in getting the hang of creating awesome visualizations in Python. Tune in for my next Python adventures!

(This is the third article in my series. Check out the first and second.)

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Roch Derilo
Roch Derilo

Written by Roch Derilo

A lover of data, tech, and hot choco. Supports anything open source and its power in the data value chain.

No responses yet