A Crash Course in DATA: Control your charts like an expert with Altair

A guide to instantly boost your charting skills

Dushyant Mahajan
AI Skunks
12 min readMar 2, 2023

--

The demand for data skills in employees is a mandatory factor in today’s data-driven world. It’s more important than ever to have accessible ways to view and understand data to obtain useful insights.

One of the best ways to understand data is to see it.

Data visualization can allow you to get an instant understanding of the data that is just not possible by observing rows of data in a table.

There are two main approaches to creating data visualizations:

1. Imperative - involves writing code to manually create charts. It’s used in libraries such as Matplotlib or Seaborn, where you specify the data, chart type, and appearance details. This approach offers more control but can be more complex and take longer.

2. Declarative - you define what information you want to display in a visualization rather than how to display it. This approach is simpler and more intuitive, as the libraries like Altair or GGplot handle the technical details of generating the chart. You can concentrate on telling the story with your data.

It is up to you to decide which one is more suitable for your data and the one you feel comfortable with. Most of the libraries require you to refer to their long and tiring documentation for syntax.

This is where Altair comes in. Its syntax is clean and easy to understand, as we will see in the examples. I believe after reading this post, and you will feel confident about making charts.

Why Altair?

Altair offers a powerful and concise visualization grammar for quickly building a wide range of statistical graphics. The declarative approach is often more intuitive, as you can focus on the story you want to tell rather than the technical details of how to create the visualization. Altair can be used to create a variety of beautiful plots, such as bar charts, pie charts, histograms, scatterplots, etc.

Altair-logo

The advantage of Altair lies in its API capabilities. As a Python API to Vega-Lite, Altair’s main purpose is to convert plot specifications to a JSON string that conforms to the Vega-Lite schema and then render the visualization inline inside the browser using Vega and Vega-Lite.

Fundamentals of Charting with Altair

The key elements we will need to remember are

  1. Chart object - It takes the dataframe as a single argument. It knows how to emit a JSON dictionary representing the data and visualization encodings.
  2. Chart marks - These decide what sort of mark we would like to use to represent our data. Eg., point, line, bar, area, tick, circle, and more
  3. Chart Encodings - An encoding channel specifies how a given data column should be mapped onto the visual properties of the visualization.

Installing required packages:

$ pip install altair jupyter vega pandas

Basic Plotting

Let’s move ahead and get the dataset from vega_datasets and perform some simple analysis to understand it better. You can also import your data using pandas; for now, I’m using the built-in dataset.

from vega_datasets import data
import altair as alt
cars = data.cars()
cars.sample(5)

In this case, let’s begin by looking at the relationship between mileage (Miles_per_Gallon) and engine power (Horsepower) in a scatter plot.

For that, we’ll want to make a chart with points. So we use mark_point() and encode the x-axis with “Miles_per_Gallon” and the y-axis with “Horsepower.”

alt.Chart(cars).mark_point().encode(
x="Miles_per_Gallon",
y="Horsepower"
)
Voila! It’s as simple as that.

Want to add colors and shapes? Just mention the visual encodings, color and shape.

alt.Chart(cars).mark_point().encode(
x='Horsepower',
y='Miles_per_Gallon',
color='Origin',
shape='Origin'
)
The shapes and colors are automatically mapped based on the data column(here- ‘Origin’).

The key to creating meaningful visualizations is to map properties of the data to visual properties in order to effectively communicate information

Do you need a bar chart? change mark_point() to mark bar(). Adding color to the bar chart (by using the color attribute) creates a stacked bar chart by default.

alt.Chart(cars).mark_bar().encode(
x=alt.X('Miles_per_Gallon', bin=True), #binning the columns
y='count()', #aggregating the data
color='Origin'
)
mapping a binned quantitative field to x and aggregate count to y produces a histogram.

Need a line plot for your data? Use the mark_line

alt.Chart(stocks).mark_line().encode(
x='date',
y='price',
color='symbol'
)

Here are some of the more commonly used mark_*() methods supported in Altair and Vega-Lite:

  • mark_area()
  • mark_bar()
  • mark_circle()
  • mark_line()
  • mark_point()
  • mark_rule()
  • mark_square()
  • mark_text()
  • mark_tick()

Encoding

The key to creating meaningful visualizations is to map the properties of the data to visual properties in order to communicate information effectively. In Altair, this mapping of visual properties to data columns is referred to as Encoding.

Encoding Data types

The details of any mapping depend on the type of data. Altair recognizes five main data types:

  • quantitative — Q(a continuous real-valued quantity)
  • ordinal — O (a discrete ordered quantity)
  • nominal — N (a discrete unordered category)
  • temporal — T (a time or date value)
  • geojson — G (a geographic shape)

If types are not specified for data input as a DataFrame, Altair defaults to quantitative for any numeric data, temporal for date/time data, and nominal for string data.

How Data Type affects Color Scales

Specifying the correct type for your data is important, as it affects the way Altair represents your encoding in the resulting plot.

Below is an example that shows how the data type influences the way Altair decides on the color scale to represent the value and also influences whether a discrete or continuous legend is used.

base = alt.Chart(cars).mark_point().encode(
x='Horsepower:Q',
y='Miles_per_Gallon:Q',
).properties(
width=250,
height=250
)

# Plotting the same chart with different data types
base.encode(color='Cylinders:Q').properties(title='quantitative') | base.encode(color='Cylinders:O').properties(title='ordinal') | base.encode(color='Cylinders:N').properties(title='nominal')
Notice the different legends and colors based on the data type

Encoding Channels

What is a channel? In data visualization, a visual variable used to represent data is called a channel. A channel is typically mapped to a specific aspect of the data, such as its value, position, or color.

Examples of channels include position (e.g., x and y axes), size, shape, color, texture, and orientation.

Each encoding channel allows for a number of additional options to be expressed; these can control things like axis properties, scale properties, headers and titles, binning parameters, aggregation, sorting, and many more.

Binning data

One of the most common uses of binning is the creation of histograms. One interesting thing is that Altair’s declarative approach allows us to start assigning these values to different encodings to see other views of the exact same data.

So, for example, if we assign the binned miles per gallon to the color, we get this view of the data:

alt.Chart(cars).mark_bar().encode(
color=alt.Color('Miles_per_Gallon', bin=True),
x='count()',
y='Origin'
)
This gives us a better appreciation of the proportion of MPG within each country. We see that well over half of US cars were in the “low mileage” category.

Changing the encoding again, let’s map the color to the count instead:

alt.Chart(cars).mark_rect().encode(
x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=20)),
color='count()',
y='Origin',
)
The result is a heatmap

This is one of the beautiful things about Altair: it shows you through its API grammar the relationships between different chart types.

For example, a 2D heatmap encodes the same data as a stacked histogram!

Aggregation of Data

Beyond simple channel encodings, Altair’s visualizations are built on the concept of database-style grouping and aggregation, that is, the split-apply-combine abstraction that underpins many data analysis approaches.

One key operation in data exploration is the group-by; the group-by splits the data according to some condition, applies some aggregation within those groups, and then combines the data back together.

For the car data, you might split by Origin, compute the mean of the miles per gallon, and then combine the results. In Pandas, the operation looks like this:

In Altair, this sort of split-apply-combine can be performed by passing an aggregation operator within a string to any encoding.

For example, we can display a plot representing the above aggregation as follows:

alt.Chart(cars).mark_bar().encode(
x='mean(Miles_per_Gallon)',
y='Origin',
color='Origin'
)
Mean of Miles_per_Gallon based on Origin

Aggregates can also be used with data that is only implicitly binned. For example, look at this plot of MPG over time:

alt.Chart(cars).mark_point().encode(
x='Year:T',
y='Miles_per_Gallon',
color='Origin'
)
Without aggregation, it’s difficult to follow the data.

The fact that the points overlap so much makes it difficult to see important parts of the data; we can make it clearer by plotting the mean in each group (here, the mean of each Year/Country combination):

alt.Chart(cars).mark_line().encode(
x='Year:T',
y='mean(Miles_per_Gallon)',
color='Origin'
)
Each line traces the mean per year for each country.

The mean aggregate only tells part of the story, though: Altair also provides built-in tools to compute the lower and upper bounds of confidence intervals on the mean.

We can change the mark to area() here, and specify the lower and upper bounds of the area using y and y2, and use the ci0 and ci1 mark to plot the confidence interval of the estimate of the mean:

alt.Chart(cars).mark_area(opacity=0.3).encode(
x='Year:T',
color='Origin',
y='ci0(Miles_per_Gallon)',
y2='ci1(Miles_per_Gallon)'
)

3. Compounding charts

Creating multi-panel and layered charts is an essential property to have for a data visualization library. Altair provides a concise API for creating multi-panel and layered charts. Let’s explore a few of them:

  • Layering
  • Horizontal Concatenation
  • Vertical Concatenation

Layering

Layering lets you put layer multiple marks on a single Chart. One common example is creating a plot with both points and lines representing the same data. Let’s use the stocks data for this example:

stocks = data.stocks()
stocks.head()

alt.Chart(stocks).mark_line().encode(
x='date',
y='price',
color='symbol'
)
Line chart

And here is the same plot with a circle mark:

alt.Chart(stocks).mark_circle().encode(
x='date:T',
y='price:Q',
color='symbol:N'
)
Circle Chart

We can layer these two plots together using the + Operator. One pattern we can use often is to create a base chart with the common elements and add together two copies with just a single change:

base = alt.Chart(stocks).encode(
x='date:T',
y='price:Q',
color='symbol:N'
)

base.mark_line() + base.mark_circle()
Layered chart with Circle and Line chart

Horizontal Concatenation

Just as we can layer charts on top of each other, we can concatenate horizontally using alt.hconcat, or equivalently the | operator:

alt.hconcat(base.mark_line(), base.mark_circle())
# Another way of writing the same thing
# base.mark_line() | base.mark_circle()
Horizontal Concatenation

Vertical Concatenation

Vertical concatenation looks a lot like horizontal concatenation but uses either the alt.vconcat() function, or the & operator:

base.mark_line() & base.mark_circle()

4. Grammar of interaction

The grammar of interaction allows the building of interactive features of the plot from components. Using our scatter plot from above and adding some interaction to it.

Interactivity and Selections

Altair’s interactivity and grammar of selections are one of its unique features among available plotting libraries.

Just adding .interactive() fires up the Vega power, and the chart becomes interactable in the browser allowing it to zoom and pan.

alt.Chart(cars).mark_point().encode(
x='Horsepower',
y='Miles_per_Gallon',
color='Origin',
shape='Origin'
).interactive()

Let me walk you through the variety of selection types that are available, and begin to practice creating interactive charts and dashboards.

There are three basic types of selections available:

  • Interval Selection: alt.selection_interval()
  • Single-Selection: alt.selection_single()
  • Multi Selection: alt.selection_multi()

Interval Selection

As an example of a selection, let’s add an interval selection to a chart. We’ll start with our canonical scatter plot:

interval = alt.selection_interval()

alt.Chart(cars).mark_point().encode(
x="Miles_per_Gallon",
y="Horsepower",
color ="Origin"
).properties(
selection = interval
)

The interval selection is a way you can click and drag; the encodings add properties to the selection. We can do selection in 2D as well as 1D. To make the chart more interesting, we can use conditional arguments for the selections, such as changing color using conditions.

interval = alt.selection_interval(encodings=['x'])

alt.Chart(cars).mark_point().encode(
x="Miles_per_Gallon",
y="Horsepower",
color =alt.condition(interval,"Origin",alt.value("lightgray"))
).properties(
selection = interval
)

We can stack the interactive charts using Compounding; the selection between them is also tied. Let’s encode the selection with different variables on the x-axis.

interval = alt.selection_interval(encodings=['x','y'])

chart = alt.Chart(cars).mark_point().encode(
x="Miles_per_Gallon",
y="Horsepower",
color =alt.condition(interval,"Origin",alt.value("lightgray"))
).properties(
selection = interval
)

chart | chart.encode(x='Acceleration')

Tooltips

When a user hovers over a mark, the tooltip displays relevant data and details from another visualization filtered to that mark. You can show related vizzes in tooltips to help your audience engage with the data at a different or deeper level while keeping them in the current context and maximizing the space available for the current view.

Considering the encodings in just the x-direction and adding tooltips we get more details about the different relationships in data points.

interval = alt.selection_interval(encodings=['x'])

chart = alt.Chart(cars).mark_point().encode(
x="Miles_per_Gallon",
y="Horsepower",
color =alt.condition(interval,"Origin",alt.value("lightgray")),
tooltip =['Name', 'Origin', 'Horsepower', 'Miles_per_Gallon']
).properties(
selection = interval
)

chart | chart.encode(x='Acceleration')

We can do even more sophisticated things with selections as well. For example, let’s make a histogram of the number of cars by Origin and stack it on our scatterplot:

interval = alt.selection_interval()
base = alt.Chart(cars).mark_point().encode(
y='Horsepower',
color=alt.condition(interval, 'Origin', alt.value('lightgray')),
tooltip='Name'
).add_selection(
interval
)
hist = alt.Chart(cars).mark_bar().encode(
x='count()',
y='Origin',
color='Origin'
).properties(
width=800,
height=80
).transform_filter(
interval
)
scatter = base.encode(x='Miles_per_Gallon') | base.encode(x='Acceleration')
scatter & hist

5. Saving Altair Charts

It will all be a waste if these cool interactive charts only run in a python notebook. Because Altair produces Vega-Lite specifications, it is relatively straightforward to export charts and publishes them on the web as Vega-Lite plots.

HTML format

All that is required is to load the Vega-Lite javascript library and pass it the JSON plot specification output by Altair. For convenience, Altair provides a Chart.save() method, which will save any chart to HTML.

This HTML file can be transferred and loaded into any browser without the requirement to load the data.

base.save('chart.html')

JSON format

The fundamental chart representation output by Altair is a JSON string format; one of the core methods provided by Altair is, which returns a JSON string that represents the chart content. Additionally, you can save a chart to a JSON file using Chart.save(), by passing a filename with a .json extension.

The JSON format can be useful for sharing and loading the Chart in a notebook. This imported chart can then be viewed easily.

base.save('chart.json')

The Altair charts can also be saved in PNG, SVG, and PDF format. To save an Altair chart object as a PNG, SVG, or PDF image, you can use

chart.save('chart.png')
chart.save('chart.svg')
chart.save('chart.pdf')

Notebook to refer to for code and explanations: https://colab.research.google.com/drive/1Rxg87Pn9Crpd0Kg8VhHFdtxp4EWkYIqY?usp=sharing#scrollTo=9odJ7cqKHipQ

Final words

And there you have it — from basic data visualization to beautiful interactive visualizations in just a couple of minutes and a couple of lines of code. Altair requires some time to get used to the syntax, but that’s true for all visualization libraries.

At the end of the day, the visualizations look much better than the equivalent produced with Matplotlib and Seaborn.

But what about Plotly? Well, that’s a topic for another time. Stay tuned for more content like this.

Thanks for reading.

References:

  1. https://altair-viz.github.io/index.html
  2. https://www.practicaldatascience.org/html/plotting_altair_part1.html
  3. https://robinlinacre.medium.com/why-im-backing-vega-lite-as-our-default-tool-for-data-visualisation-51c20970df39
  4. https://www.nersc.gov/assets/Uploads/09-Python-Vis-with-Altair.pdf
  5. https://medium.com/codex/creating-presentable-data-visualization-with-altair-5a3286e697ab
  6. https://github.com/altair-viz/altair
  7. https://www.analyticsvidhya.com/blog/2021/10/exploring-data-visualization-in-altair-an-interesting-alternative-to-seaborn/
  8. https://www.datacamp.com/tutorial/altair-in-python
  9. https://www.geeksforgeeks.org/introduction-to-altair-in-python/

--

--