Data Visualization — Grammar of Graphics

Data Visualization

Swathi Sharma
AI Skunks
7 min readMar 14, 2023

--

Data visualization is the representation of data through use of common graphics, such as charts, plots, infographics, and even animations. These visual displays of information communicate complex data relationships and data-driven insights in a way that is easy to understand.

“Do not trust your data blindly, always model on your data”

Summary statistics can always be deceptive. We need to always visualize and understand the data attributes before moving on to feature engineering and building statistical, machine learning and deep learning models.

Different components and pieces of a visualization

Gives a common language for thinking about the ways that we make design choices in visualization

This language is going to describe everything about the data that we use, how we divide and process that data , to the visual channels that we display on our marks, whether we use size , color, position and how we convert data into those channels

Grammar of Graphics is so critical is because it is a foundational technique for many packages that we have for developing and designing visualizations, including Altair

Grammar of Graphics

Grammar vs Grammar of Graphics

— Grammar is described as a system of rules that aids in the definition and establishment of language components.

— Grammar of Graphics is a framework that allows us to explain the components of any visual in a straightforward manner. Instead of random trials and errors, it employs a layered technique that use predetermined components to build a visualization

Grammar of Graphics is a theory developed by Leland Wilkinson for visualizing and communicating data in a structured and coherent way. The Grammar of Graphics consists of a set of principles and rules that guide the design of graphics for data visualization.

The Grammar of Graphics is based on the idea that every graphic can be broken down into a series of components or layers. These components include the data, the aesthetic mapping, the geometric shapes, the statistical transformation, and the scales.

The data component refers to the raw data that is being visualized, while the aesthetic mapping refers to the way that data is mapped to visual properties such as color, size, and shape. The geometric shapes component refers to the basic visual elements used to represent the data, such as points, lines, and bars. The statistical transformation component refers to any calculations or analyses performed on the data, such as aggregations or summaries. Finally, the scales component refers to the way that data is transformed to fit within the visual space, such as scaling the axis.

By breaking down graphics into these components, the Grammar of Graphics provides a systematic approach to designing and interpreting data visualizations. It allows for greater flexibility in visualizing data, as well as a standardized way of communicating the design choices made in the creation of a graphic.

A variant of this, is known as the layered grammar of graphics framework, proposed by Hadley Wickham, reputed Data Scientist and the creator of the famous R visualization package ggplot2

Why do we need Grammar of Graphics?

We need Grammar of Graphics to effectively visualize multi-dimensional data. This systematic approach to visualizing data can help understand each component better when dealing with data of any dimensionality.

Components of the visualization grammar

Grammar of graphics divides a visualization into the following core categories:

Data: What is that we are choosing to show?

Always start with the data, includes specific target dimensions or attributes

Aesthetics: How do we choose to show the data?

Includes axes, relative positions and encoding channels (Eg: Shape, Size, Color) which are useful for plotting multiple data dimensions.

Scale: How do we get it from the data to a visual representation?

Do we need to scale the potential values use a specific scale to represent multiple values or a range, what range of values can be visualized?

Geometric objects: What kinds and shapes of marks are we using?

These are popularly known as ‘geoms’.This would cover the way we would depict the data points on the visualizatio.Should it be points, bars, lines and so on?

Statistics: How do we choose to pre-process or wrangle our data?

What computed information is presented? Includes how data is grouped

Facets: How do we choose to divide our data into multiple representations?

How is the data broken down into different visualizations? Do we need to create subplots based on specific data dimensions?

Coordinate system: How do we choose to position the data on the screen?

How are data values mapped to visual elements? What kind of a coordinate system should the visualization be based on — — should it be cartesian or polar?

By breaking down a graphic into these components, the Grammar of Graphics provides a standardised and flexible approach to designing and interpreting data visualizations. It allows for greater creativity and customisation in the design of graphics, as well as a standardised way of communicating the design choices made in the creation of a graphic.

Example of deconstructing a graph into core components

ggplot2

Layered grammar of graphics framework, which was proposed by Hadley Wickham, reputed Data Scientist was the creator of the famous R visualization package ggplot2. In his paper titled ‘A layered grammar of graphics’, covers his proposed layered grammar of graphics in detail and also talks about his open-source implementation framework ggplot2 which was built for the R programming language.

ggplot2 is an R package developed by Hadley Wickham that implements the Grammar of Graphics theory for data visualization. It provides a powerful and flexible system for creating graphics by allowing users to specify the components of a graphic in a modular and customizable way.

The ggplot2 package allows users to create a wide variety of graphics, including scatterplots, line charts, bar charts, histograms, and more. It also provides a range of tools for customizing and formatting graphics, such as changing the color, size, and shape of points, lines, and bars, adding titles, labels, and legends, and adjusting the overall appearance of the graphic.

The core of ggplot2 is the ggplot() function, which creates a plot object that can be further modified using various other functions in the package. The syntax of ggplot2 is based on the idea of layering, where each component of a graphic is added to the plot object in a separate layer. For example, to create a scatterplot, the data is first loaded into the plot object using the ggplot() function, then the x and y variables are mapped to the aesthetic properties of the plot using the aes() function, and finally, the geometric objects that represent the data points are added to the plot using the geom_point() function.

One of the strengths of ggplot2 is its ability to handle complex data structures, such as data with multiple dimensions or hierarchical structures. The package also provides a range of tools for working with categorical data, including faceting and grouping by different variables.

Overall, ggplot2 provides a powerful and flexible system for creating high-quality graphics in R, and has become a widely used tool for data visualization in both academia and industry.

However in this book, we will learn how to use ggplot in Python to create data visualizations using Grammar of Graphics. There are several Python packages that provide for grammar of graphics. We will focus on plotnine in this book. plotnine is based on ggplot2 from the R programming language, plotnine can be considered as the equivalent of ggplot2 in Python.

plotnine

Plotnine is a Python package that is inspired by the Grammar of Graphics theory for data visualization. It is based on the ggplot2 package in R and provides a similar set of tools for creating graphics using a modular and customizable approach.

Plotnine allows users to create a wide variety of graphics, including scatterplots, line charts, bar charts, histograms, and more. It also provides a range of tools for customizing and formatting graphics, such as changing the color, size, and shape of points, lines, and bars, adding titles, labels, and legends, and adjusting the overall appearance of the graphic.

The core of plotnine is the ggplot() function, which creates a plot object that can be further modified using various other functions in the package. The syntax of plotnine is also based on layering, where each component of a graphic is added to the plot object in a separate layer. For example, to create a scatterplot, the data is first loaded into the plot object using the ggplot() function, then the x and y variables are mapped to the aesthetic properties of the plot using the aes() function, and finally, the geometric objects that represent the data points are added to the plot using the geom_point() function.

One of the strengths of plotnine is its ability to handle complex data structures, such as data with multiple dimensions or hierarchical structures. The package also provides a range of tools for working with categorical data, including facetting and grouping by different variables.

Overall, plotnine provides a powerful and flexible system for creating high-quality graphics in Python, and has become a popular tool for data visualization in both academia and industry.

References

[1] : Wickham, H. (2012). A layered grammar of graphics. Taylor & Francis. https://www.tandfonline.com/doi/abs/10.1198/jcgs.2009.07098Links to an external site.

[2] : Szafir, D. A. (2021, July 13). A grammar of graphics. https://www.youtube.com/watch?v=RCaFBJWXfZcLinks to an external site.

[3] : Gupta, S. (2020, July 27). Importance of data visualization — anscombe’s quartet way. https://towardsdatascience.com/importance-of-data-visualization-anscombes-quartet-way-a325148b9fd2

[4] : Sarkar, D. (D. J. (2018, September 13). A comprehensive guide to the grammar of graphics for effective visualization of multi-dimensional. https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149

[5] : Garcia, M. (2012). Using ggplot in Python: Visualizing Data With plotnine. https://realpython.com/ggplot-python/

[6] : The Carpentries. (2014–2021). Data visualization with ggplot2. https://datacarpentry.org/R-ecology-lesson/04-visualization-ggplot2.html

[7] : The Carpentries. (2014–2021). Making plots with plotnine. https://datacarpentry.org/python-ecology-lesson/07-visualization-ggplot-python/index.html

[8] : OpenAI. (2021). ChatGPT: A Large-Scale AI Language Model. [Computer software]. Retrieved from https://openai.com

--

--