Data Visualization with Python

Data Management and Visualization — Week 4

This is the fourth in a series of posts I’m writing as part of Wesleyan University’s online Data Management and Visualization course.

Overview

This week we finally get to the visualization portion of our Data Management and Visualization series. Now that we’ve learned to load, summarize, sort, cut, and slice our data, let’s take a look at how we can visualize it.

Visualization is a powerful tool, both for presenting our findings to others, as well as for our own analysis. Things like spread, variance, and modality can all be quickly gleaned from a good visualization.

Here’s an example using the same Outlook on Life data from my previous posts:

Fig. 1: Count of respondents’ age group, by gender

This graph shows us the breakdown of age groups of the people who responded to this study, grouped by gender. At a glance it’s clear that both genders have a very similar distribution. They are both unimodal (they have a single peak, in the 55–65 age range), and skew to the left (i.e. towards younger respondents.)

Visualizing Politics, Anger, and Optimism

Before I begin, I want to discuss the type of data I’m working with and how it affects visualization.

The 2012 Outlook on Life survey consists almost entirely of categorical data. Categorical data is comprised of variables with a fixed number of possible values, like the answers to a multiple choice questionnaire. Compared to quantitative data, categorical data is more limited in the ways it can be described, and, by extension, visualized. Categorical data can’t, for example, produce scatter plots, regression lines, etc.

To refresh your memory, here’s the Python code that I use to load and clean my data (for more detail, see last week’s post):

First, I will create a single variable bar graph to see a distribution of how optimistic participants said they felt about the future. Using Python and the (amazing) module Seaborn, we execute the following lines of code:

sns.factorplot(x='hope', data=data, kind='count')
plt.xlabel('Level of Optimism')
plt.ylabel('Count')
plt.title('How optimistic are you about your future')
plt.show()

Which produces this simple graph:

Fig. 2: Count of optimism

This plot gives us a good idea of responses to the question of “When you think about your future, are you generally optimistic, pessimistic, or neither optimistic nor pessimistic?” Obviously, the majority of participants are optimistic about their future. Only a very small portion of respondents said they were pessimistic.

The problem is this visualization is so flat. We’re only really looking at one variable, along with its count. As I mentioned, categorical data is somewhat limited in how it can be visualized, but with Seaborn, we can definitely add some extra dimensionality to this plot.

I’ll run the same code as before, but this time add a new parameter ‘hue’.

sns.factorplot(x='hope', hue='poli', data=data, kind='count')

Which transforms the previous figure into this:

Fig. 3: Levels of optimism by political orientation

Great! Now we can simultaneously see levels of optimism and political orientation (in the form of the coloring of the bars) at the same time. By adding the hue parameter, we can now look at two variables at once.

This chart largely tells the same story as the previous. Each political orientation have the same shape and skew as fig. 2. But we can make some new observations. For example, we can see that in comparison to the other orientations, Democrats tend to have a low level of pessimism. The number of Independents who said they are optimistic is very close to the number who said they were neither optimistic nor pessimistic, a trend not seen with the Democrats or Republicans in the study.

We can add one more dimension to our analysis by splitting the graph into two subplots, as in fig. 1, using this code:

sns.factorplot(x='hope', hue='poli', col='gender', data=data, kind='count')

This produces:

Fig. 4: Level of optimism by political orientation and gender

In this plot the overall trend is again the same, but we can still notice some interesting differences, like the relatively high level of pessimism among male Independents.

For the data I’m using, lacking quantitative variables, I’m mostly stuck with bar graphs, which gets boring quickly. Fortunately I do have one quantitative variable I can use: age.

Let’s say we want to see a breakdown of respondents by age, political orientation, and race. We can visualize that using a swarm plot:

sns.swarmplot(x='poli', y='age', hue='ethn')

The result is pretty striking:

Fig. 5: Swarm plot of political orientation by age, with ethnicity as hue

Here are some of the variables we can readily glean from this graph: distribution of ages by political orientation, total count of respondents by political orientation, distribution of races by political orientation, distribution of races by age, and total count of respondents’ races. That’s a lot of data for a single graph.

This graph shows that the age distributions for Republicans and Independents are uniformly distributed. However, it’s multi-modal for democrats, with peaks in the mid 20s, from 50 to 70, and large concentration at 81.

It’s also clear that, in this survey sample, Republican respondents are almost exclusively white, while African Americans make up the majority of Democratic respondents. Overall Independents are a more varied group.

Conclusions

This has obviously been neither a comprehensive overview of Seaborn, nor have I visualized all of the possible combinations of variables in my dataset. Rather, what I wanted to show was the power and elegance of Seaborn, even when working with limiting categorical data. To see additional ways Seaborn can visualize categorical data, make sure to see their fantastic documentation.