How to Choose the Right Chart: The Decision-Making Process
DataViz Series
Data types define the way we interact with data
If you have read the welcome post, you will know that I was a mathematics tutor for five years. My students may not have realised it, but their questions often strengthened my understanding of mathematics.
On one such occasion, a student directed my attention to a page of their statistics assignment. The question read, “Stand on a sidewalk for 15 minutes and tally the cars you see in their colour, be it red, white, blue, black or green. Plot your results”. Below this was a neatly-drawn histogram with appropriate plotting elements.
This student couldn’t understand why they received full marks for their data collection but only half of the allocated marks for their plot. I replied, “It seems to me that you used a histogram instead of a bar or column chart— histograms aren’t the right chart for categorical data.” Until that moment, I was unaware of my unconscious absorption of some unspoken but foundational data visualisation rules — that appropriate chart usage depends on the data types of the variables in the diagram.
Why is this? Histograms have no spaces between the columns, suggesting that the data are at least ordinal, if not continuous. In contrast, column charts have gaps between the columns to imply that the data are separated categorically, though numbers may also represent these categories.
It was not explicitly taught to me that some charts are appropriate for combinations of specific variable types. The discovery that there exists an underlying set of rules for visualisation choice was mostly self-directed. The conversation with my student taught me to base my data visualisation choices on variable data types. This article will define the processes I have learnt for creating a visualisation, from nominating a set of possible charts to filtering them and adding dimensionality.
Contents of the current article
- The data visualisation process
- Choosing a chart by data type: Further categorisations of data types
- Choosing a chart by data type: Univariate data
- Choosing a chart by data type: Bivariate data
- Filtering visualisations by purpose
- Adding dimension through plotting elements
- Conclusion
A practical example follows in the next article.
The data visualisation process
We all have unconscious processes which necessitate external input to effect change. Though my student helped me realise that I had developed innate mathematical reasoning for deciding between statistical charts, it wasn’t until I stepped into the business world and evolved an analytical eye that I honed my decision process.
I have found that the most effective way to create a data visualisation is to incorporate the scientific method.
- Question: Before we collect or interrogate data for information, we must always begin by formulating a question. For example, you may be curious about the correlation between certain variables.
- Hypothesise: Develop a null hypothesis for your question.
- Predict: Predict the results of your question. If you are visually inclined, you might try mapping these in a flow chart.
- Collect: Collect data and determine the space of possible charts by assessing the data types of your selected variables.
- Filter: Filter these visualisations to ensure the purpose of the charts is congruent with the question that you need to answer.
- Test: Test whether any of the chosen visualisations provide a satisfactory response to your hypothesis.
Your unique situation determines whether another variable adds value to the hypothesis or narrows the narrative overmuch. Once a visualisation is chosen, we can add dimensions through plot elements such as colour, size, data point shapes and facetting.
At times, we might need to combine datasets to satisfy our questions; this situation commonly occurs when either formulating a question (Step 1) or incorporating more variables (after Step 6). If this happens, additional cleaning and processing may be required before proceeding to the subsequent steps.
There are also times when we require a collection of visualisations, such as when building a dashboard (as shown in the article banner) 😉. In this case, we might want to formulate a collection of questions.
Choosing a chart by data type
Further categorisation of data types
In the first post of the Fundamentals Series, we categorised numeric data as continuous or discrete. Though it may not be mathematically correct, we can extend this thought process to other types of data:
Textual data
- Continuous: free-text data
- Discrete: categorical data
Temporal data
- Continuous: continuous-time intervals
- Discrete: sequenced time-series data
Similar to the example of histograms and column charts, categorising all data types in this way allows us to perceive nuanced differences between standard statistical graphs, and therefore understand (from first principles) which type of visualisation is appropriate.
Univariate data
Univariate visualisations describe the composition of a single random variable. These visualisations typically show distributions and measures of central tendency for quantitative variables, whereas qualitative variables may display category hierarchies and their frequencies. For temporal variables, we assess interval patterns for regularity, seasonality or trends.
In Figure 1, we noted that the separation (or lack thereof) between adjacent columns in histograms, column charts and bar charts depends on the independent variable. Note also the continuity of time implied by the connected points within the line chart in Figure 4, since we can fractionate time into infinitesimally small increments. I have suggested some visualisation options for univariate data types in Figure 5.
Creating a chart invariably requires a quantitative aspect. These are inherent in numerical variables, though in qualitative variables, we can use (raw) frequencies or derived percentages, probabilities or categorical scoring, as in the example below.
Bivariate data
Bivariate visualisations are most commonly used to show relationships between two variables, comparisons within hierarchies, or trends over time.
If both variables are quantitative, we likely want to determine the existence or pattern of a relationship. If one of the variables is categorical while the other is quantitative, we are probably comparing values within categories. If both variables are categorical, then we would be comparing proportions rather than values. The following table provides options for bivariate visualisation based on this premise. Note that independent and dependent variables differ between coordinate systems:
Cartesian coordinate system
- Independent variable: x-axis
- Dependent variable: y-axis
Polar coordinate system
- Independent variable: the angle, theta (θ)
- Dependent variable: the radius (r)
Note also that when the visualisation utilises the Cartesian coordinate system, we may invert the axes to read category names more clearly. This way, the viewer doesn’t have to turn their head. We exemplify this by observing the difference between a bar chart and a column chart.
There is at least one exception for axis inversion in the Cartesian coordinate system: continuous-time charts must have the temporal variable displayed on the x-axis. This is for ease of chronological reading and because time varies independently of the quantity being measured.
Filtering visualisations by purpose
Once we have a selection of possible visualisations, we must determine whether they are useful for our purpose. Tables 2 and 3 separate the visualisations, as mentioned earlier, into three broad objectives:
- To analyse distribution, composition or change;
- To determine the existence of patterns, relationships or trends; or
- To compare subsets within the data.
Adding dimension through plotting elements
Colour
One of my favourite games is called I Love Hue. The objective of the game is to sort an unordered mosaic of coloured tiles into a harmonious spectrum. It’s visually appealing, and I find it calming.
Along the perimeter of the mosaic is a change in hue (colour). If you follow an outer tile towards the mosaic centre, you will notice that the path gradually becomes greyer — this indicates a decrease in saturation (colour intensity). Now, notice that some tiles appear to be within the same colour family as their neighbours, and only vary in brightness — this shows a change in lightness. These concepts make up the colour code system that we call HSL.
HSL is the easiest to interpret compared to other colour code systems, such as RGB (red-green-blue) or HEX codes. If you have ever tried your hand at graphic design, photo-editing or created a style sheet, these may sound familiar.
In colour theory, there are three types of colour palettes:
- Qualitative palettes vary in hue while remaining constant in saturation and lightness. The resulting palette contains distinct colours which we can think of as discrete and unordered; therefore, they are most suitable for nominal categories.
- Diverging palettes vary in lightness/saturation, though their extremities are usually different colours. Lightness and saturation are both ordered concepts, so these palettes are best used for variables which are ordered but contain positive, negative and neutral values.
- Sequential palettes are a subset of diverging palettes, except that they are unidirectional. Therefore they are suitable for unidirectional ordered variables.
Size, shape and facets
Quiz time: Name the data types that are best suited to the following plotting elements. The answers can be found in the table following.
a) Variations in size
b) Variations in shape
c) Facetting
By matching variable data types to these plotting elements, we can add as many dimensions as are suitable for a 2D plot. It is important to note that additional variables do not always add value — sometimes, they may be redundant or add unnecessary complexity. At the heart of a chart, one must always choose clarity.
Conclusion
Selecting an appropriate chart depends on the initial question, as well as the number of variables required to answer the question, along with their data types.
Univariate visualisations describe a variable's composition, whereas bivariate and multivariate visualisations show relationships, comparisons or trends. To convert a bivariate visualisation into a multivariate visualisation, we can layer suitable plotting elements such as colour, size, shape and facets according to the data types of the additional variables.