An Introduction to Python Libraries for Data Visualization and Data Analysis

Alejandra Budar
Fields Data
Published in
4 min readApr 1, 2021
Photo by Alfons Morales on Unsplash

Now that your organization has established its own data guidelines and you have familiarized yourself with Google Colaboratory, it is time to learn about the coding language we will be using to analyze data and build reports.

Python is a programming language that works well with text, as well as with categorical and numeric data. It also contains multiple libraries, which exist of reusable chunks of code that you can utilize to achieve varying results and solutions in projects. These packages range from statistics to Natural Language Processing (NLP) and machine learning, which can be utilized for more in-depth analysis. Each of these libraries should be thought of as a product with its own individual purpose — like a broom, of which the intended purpose is to sweep. Just as you would not use a broom to cook, you should not utilize a library for anything other than its intended function. However, if one package (or broom) does not achieve the intended result, you can combine it with another tool to accomplish it (an extra broom/leaf blower). For the purpose of our reporting, we will keep the usage high-level, although I will deep dive into two libraries that are important to know for data visualization and analysis.

Data Visualization

The two main libraries that I use to create data visualizations for reports are Seaborn and Matplotlib. Both can be used to create a large variety of visualizations, including bar charts, stacked bar charts, pie charts, scatter plots, bubble charts and more. Seaborn’s main purpose is for data visualization. It allows you to customize the appearance of your report by selecting the color scheme and design of each chart. Additionally, if you are using categorical data, its count plot can calculate the frequency of each category in as little as two lines of code. The potential result is displayed below.

A Fields Data count plot on the sources of the data in its database.

However, in order to adjust the tick marks, align the charts and include additional customizations, a deeper understanding of code is required. We will cover the coding aspect in more detail in the next segment of this series, but for now, simply familiarize yourself with the concept of libraries and their uses. If you need to refresh your memory on which different types of charts exist and when to use them, feel free to refer to the resources below. I am currently writing a separate piece on this topic for another organization, which I will link below once it is published.

The second library that I use the most is Matplotlib. Matplotlib’s use case is for data visualization as well. It is important to familiarize yourself with this package, since you may need to combine the Seaborn code with the Matplotlib code to achieve your exact, desired format. Below is an example of an instance in which I used Matplotlib to edit the size of my Seaborn visualization.

The frequency of each province within the data (colored by country).

However, I find Matplotlib more cumbersome and less visually appealing than Seaborn for data visualization. For our purposes, a more general understanding of this library will suffice. The resources linked below will help you accomplish that.

Data Analysis

Pandas is a powerful tool that facilitates data manipulation and analysis. It can be used to load files, such as CSV, JSON or Excel files, and then perform calculations on the data. It provides various options to organize data based on the frequency of a variable, using methods such as “group by” and “value_counts”. Again, the actual code for the library will be further explained in the next post. For now, the goal is to familiarize yourself with what Pandas is, and what it is used for. You can find examples of analysis done at Fields Data.

Analysis of the top sectors (above a count of 5) sorted by country and province.
Analysis of the different types of organizations present (grouped by country).

Conclusion

To avoid overwhelming or confusing a reader, it is important to find a balance between data analysis and visualizations. Both are equally important. As such, data visualization and data analysis will be the two main components of the data reports that we will build in the next article. Now that you have learned the whats and the hows of our metaphorical broom, you are ready to begin using it for its intended purpose.

Resources

  1. Seaborn: https://seaborn.pydata.org/
  2. Matplotlib: https://matplotlib.org/stable/users/index.html
  3. Numpy: https://numpy.org/doc/1.20/

--

--