Moving Outside the Box for Data Visualization

Graphing and data visualization are all about storytelling. We must always consider the question, “How can I most effectively communicate the story that I am trying to tell with my data?” When we think about graphing, we typically think about building different types of graphs — “bar charts”, “line charts”, “scatterplots” — a veritable potpourri of middle school math class topics.

Each of these graphs can effectively tell the story that they are designed to tell. However, not all data stories fit neatly into the box of one of these types of graphs. If we just consider these named graphs, then our storytelling abilities are limited.

In the graphs above, we see many elements that are common to graphs that we traditionally think of. There are bars. There are lines. There are points. However, none of the graphs themselves fit neatly into the box of a “bar chart”, a “line chart”, or a “scatterplot”. Each graph is something more than one of the boxes we might try to prescribe them.

In order to build complex visualizations like these and truly unleash our data storytelling capabilities, we must rethink how we traditionally build graphs. We need to consider a new framework that moves beyond traditional graphing conventions. We call this framework a “grammar of graphics”.

One of the most popular implementations of the “grammar of graphics” framework is Hadley Wickham’s¹ `ggplot2`² in`R`. He succinctly describes what a “grammar of graphics” is in his introduction to `ggplot2`:

A grammar of graphics is a tool that enables us to concisely describe the components of a graphic. Such a grammar allows us to move beyond named graphics (e.g., the “scatterplot’’) and gain insight into the deep structure that underlies statistical graphics.

The key point that differentiates the “grammar of graphics” framework from how we traditionally think about graphs is that it operates by using layers. In `ggplot2`, each layer is then further made up of 5 components. Each of these components are:

• data — what data is being used to build the layer?
• mapping — what field(s) from the data are being used to build the layer?
• geometry — what shape should our data take when building the layer?
• statistic — how should our data be transformed when building the layer?
• position — where should our data be placed when building the layer?

When building data visualizations using `ggplot2`, we do not necessarily need to think about each of these components for every new layer that we build. `ggplot2` will take care of much of that work behind the scenes for us. In most cases, you will simply need to consider the “data”, “mapping”, and “geometry” components of any given layer that you build. As you first begin working with the `ggplot2` framework, the “statistic” and “position” components will (largely) stay working behind the scenes.

Here’s a toy example to consider:

This is a (very) made up dataset about clothing purchases that consists of 4 fields: `Date`, `Item`, `Quantity`, and `Price`. From the data, specific fields can be mapped to different characteristics of a given geometry. For example, in this case I would want to understand how the total quantity of each item purchased compares to each other. A great way to do this is by using a bar geometry. To create a bar graph using `ggplot2`³, I am required to use an `x` aesthetic and a `y` aesthetic. There are other aesthetics available for me to use, though I am not required to do so. The code that I would use to build this plot (after a bit of data manipulation) would look similar to:

My visualization would look like:

This is an extremely simple example that relies upon using one layer and one geometry that, ultimately, produces one of the more common types of graphs — a bar chart. To get a better sense of how more complex visualizations are built, look at the work done by Gina Reynolds in creating the “ggplot flipbook”. Below, you can see some of the examples that she built and how the different layers work together to bring a plot to life, including some of the visualizations used earlier in this article.

In the next post, we’ll explore how to use the “grammar of graphics” framework with `ggplot2` in order to explore data and build complex visualizations. We’ll slowly work through examples that require increasingly complex visualizations, moving beyond the traditional boxes of bar charts, line charts, and scatterplots.

¹ Hadley Wickham is, perhaps, the most influential figure in the R community. Besides `ggplot2`, Hadley has built some of the community’s most popular libraries. [back]

² `ggplot2` traces its roots back further than Hadley Wickham, though, to a textbook called Grammar of Graphics written by Leland Wilkinson. Ultimately, he can be considered the godfather of the “grammar of graphics” framework. [back]

³ There are actually multiple ways to create a bar chart in `ggplot2` and we’ll be using both of them in this article. The first example that we present here uses the `geom_col()` function. For this method, both the `x` and `y` aesthetic are passed into the function. We can also use `geom_bar()`, which only requires the `x` aesthetic to be passed into it. In this case, the height of the bar is calculated for us, using the count of each `x` value as a default. [back]