Anatomy of a chart

Published in

Data Science at Microsoft

7 min readNov 30, 2021

How do you communicate with data? Do you tend to show everything and let the readers figure out what’s important, or do you provide a curated view where insights are right there in front of them? As Andy Kirk has said in Data Visualisation: A Handbook for Data Driven Design, data visualization is a form of communication, transitioning from raw datasets and metrics into something trustworthy, accessible, and elegant. The question, of course, is how to get there.

By engaging our perception, data visualization uses individual visual elements, geometry, and the composition of those elements to translate the information from text into shapes. Data scientists and analysts do this daily, communicating the results of business actions, financial results, and customer insights to explain their findings and improve the knowledge of domain owners. Unfortunately, this process sometimes results in a data dump — throwing everything out in front of the audience, whose task is then to visually decode the message.

To avoid that problem, I recommend following a highly curated process involving the intentional placement of elements in a meaningful way. Starting with data, this involves carefully selecting and adding each piece on the canvas and assembling everything in structured manner.

In this way, every detail matters and each one plays a role. Following this approach can make a big difference in what your final output looks like and, more importantly, how it’s received and consumed. In this article, I illustrate this process and explain how to properly think about designing a data visualization.

Process and guidelines

Data visuals reflect not just their underlying data, but a real-world phenomenon. They tell stories about people, places, and events. The charts you create represent a certain reality, and they answer important questions about a specific subject: who, where, when, how, or why. Taken together, charts convey a message and amplify understanding to the intended audience.

Boxes describing the process stages: real world, data, visuals and understanding.

My focus here is on stages 2 and 3 as shown in the diagram above — the transition from data to viz — while keeping the other two in mind. Remember, the goal is for readers to learn something new, and as quickly as possible, without friction, ambiguity, or confusion. The message to be delivered must be easy to digest as well as effective, clear, and memorable. I now review some key concepts that make this possible.

Grammar of graphics

This framework was created and named by Leland Wilkinson and pushed into the data science mainstream by Hadley Wickham. It essentially involves mapping raw data to visual elements through layers, following a specific syntax pattern consisting of data, aesthetic mappings, geometric objects, statistics, positioning, scales, and facets.

Pre-attentive attributes

These are the visual elements that our brain can translate into meaning — even if we are not aware that it’s happening. They represent the visual encoding of specific values (such as metrics) to one of the following visual properties: color, shape, size, angle, length, width, orientation, or grouping.

Gestalt principles

When interpreting the groups of visual objects and their relationships, our brain relies on the following principles: proximity, similarity, connectedness, continuity, and closure. They help us to translate the individual parts into a meaningful whole so that we can automatically recognize the underlying patterns.

You have already used many of these concepts, even if you have never heard of them before. Every time you create a chart for presenting data, many of these aspects are already built in. By recognizing what is happening behind the scenes, however, and keeping that in mind before you start the design process, you can make your data visualizations even more effective.

The data-to-viz transition can be illustrated with a simple table-to-graph illustration. Data points in the table are encoded with marks on the Cartesian coordinate system of x and y values:

This is an oversimplified version of a scatterplot, encoding only two observations with two numerical variables. But if we look closely, we can see that it already employs several pre-attentive attributes and Gestalt principles:

Scatterplot example showing gestalt principles.

On their own, the two points represent individual values based on their location along the axes, but the principles describe the relationship between them. The story gets more interesting as we start adding additional marks to the chart:

Different types of charts based on two values between x and y axis.

These additions may seem minimal, but they dramatically change the message conveyed in each chart. Chart 1, for example, indicates that point A barely crossed some threshold defined by a dotted line, while point B is not even close to it. Charts 2 and 3 show that A is larger than B based on some quantity measured on the x-axis and y-axis. Chart 4 introduces a third variable — size as a pre-attentive attribute — so object B is now much more prominent than before.

Now, if you’re thinking It would have been better if these comments were added directly on the charts, then great job. You are unintentionally tapping into the Gestalt principle of proximity, which highlights the connection among visual elements simply by having them close to each other. By my placing the comments below the charts, embedded in this paragraph, I’ve made your eyes make a few trips up and down the text to make the connection.

If you were to guess, by some criteria, whether A or B is better — without knowing the units, scales, or meaning of those points — the answer would differ depending on which version of the chart you were viewing. So, even in this trivial example, we can see that each addition or subtraction can have a big impact on what the audience takes away from the chart.

Applying the concepts in ggplot2

Now it’s time to apply some of these concepts in ggplot2, an R library for data visualization. The example I present is based on sample data that shows the specifications of various car models.

Just as in the first illustration, we start with points scattered around the coordinate system based on two numerical values: the number of cylinders (cyl) on the x-axis and miles per gallon (mpg) on the y-axis. A pattern is already visible: the higher the number of cylinders, the lower the mpg. The principles of proximity and grouping help us make this conclusion.

install.packages(“ggplot2”)
library(ggplot2)ggplot(mtcars, aes(cyl, mpg)) + geom_point()

Scatterplot created in R, using ggplot2 library.

Next is to include horsepower (hp), also a numerical measure, as one more dimension, leveraging size as its visual attribute. Points are now converted into bubbles, and we observe that they tend to be bigger in the lower right corner of the chart: the higher the number of cylinders, the greater the horsepower of the car.

ggplot(mtcars, aes(cyl, mpg, size = hp)) + geom_point()

The next step is to add data about the transmission, which is a categorical variable (or factor) with two levels: 0 for automatic and 1 for manual. To encode it visually, we use color as a pre-attentive attribute to make a distinction between the two groups. Most of the observations in the lower right corner are colored red, which represents automatic transmission.

ggplot(mtcars, aes(cyl, mpg, size = hp, color = factor(am))) + geom_point()

Finally, we add reference lines and tap into the Gestalt principle of closure. If we are interested in a group of cars with more than seven cylinders but also less than 12 mpg, only one observation (bottom right) meets those criteria.

ggplot(mtcars, aes(cyl, mpg, size = hp, color = factor(am))) + geom_point() + geom_vline(xintercept = 7, linetype = “dashed”) + geom_hline(yintercept = 12, linetype = “dashed”)

The finished example shows a basic view that most chart readers are familiar with. It can be further enhanced to reflect more complex scenarios and designed for publication-ready visuals. Either way, the concept and thinking behind this process remains the same. It involves a layered approach for mapping the data to corresponding visual objects in a structured and organized way. This approach enables seamless updates and iterations to be made on the base visual, where the main graph can be saved as an object, and further refinements added as layers. In this way, charts can be enhanced, personalized, and branded with themes specific to an organization or an individual, without risking the core structure, data mappings, and key relationships that may be embedded in the dataset.

Conclusion

Successful chart design is not a random transition from data to visual. It requires an understanding of important concepts such as pre-attentive attributes, Gestalt principles, and how to apply the Grammar of Graphics. Each component must be handled with care as it is placed on the coordinate system used to systematically create a graph. Each piece has a purpose and plays an important role in the overall story. Applying these core principles helps data designers tell that story and deliver valuable insights to their audience.

Gordan Kuvac is on LinkedIn.