A Crash Course in DATA: Control your charts like an expert with Altair
A guide to instantly boost your charting skills
The demand for data skills in employees is a mandatory factor in today’s data-driven world. It’s more important than ever to have accessible ways to view and understand data to obtain useful insights.
One of the best ways to understand data is to see it.
Data visualization can allow you to get an instant understanding of the data that is just not possible by observing rows of data in a table.
There are two main approaches to creating data visualizations:
1. Imperative - involves writing code to manually create charts. It’s used in libraries such as Matplotlib or Seaborn, where you specify the data, chart type, and appearance details. This approach offers more control but can be more complex and take longer.
2. Declarative - you define what information you want to display in a visualization rather than how to display it. This approach is simpler and more intuitive, as the libraries like Altair or GGplot handle the technical details of generating the chart. You can concentrate on telling the story with your data.
It is up to you to decide which one is more suitable for your data and the one you feel comfortable with. Most of the libraries require you to refer to their long and tiring documentation for syntax.
This is where Altair comes in. Its syntax is clean and easy to understand, as we will see in the examples. I believe after reading this post, and you will feel confident about making charts.
Why Altair?
Altair offers a powerful and concise visualization grammar for quickly building a wide range of statistical graphics. The declarative approach is often more intuitive, as you can focus on the story you want to tell rather than the technical details of how to create the visualization. Altair can be used to create a variety of beautiful plots, such as bar charts, pie charts, histograms, scatterplots, etc.
The advantage of Altair lies in its API capabilities. As a Python API to Vega-Lite, Altair’s main purpose is to convert plot specifications to a JSON string that conforms to the Vega-Lite schema and then render the visualization inline inside the browser using Vega and Vega-Lite.
Fundamentals of Charting with Altair
The key elements we will need to remember are
- Chart object - It takes the dataframe as a single argument. It knows how to emit a JSON dictionary representing the data and visualization encodings.
- Chart marks - These decide what sort of mark we would like to use to represent our data. Eg., point, line, bar, area, tick, circle, and more
- Chart Encodings - An encoding channel specifies how a given data column should be mapped onto the visual properties of the visualization.
Installing required packages:
$ pip install altair jupyter vega pandas
Basic Plotting
Let’s move ahead and get the dataset from vega_datasets
and perform some simple analysis to understand it better. You can also import your data using pandas; for now, I’m using the built-in dataset.
from vega_datasets import data
import altair as alt
cars = data.cars()
cars.sample(5)
In this case, let’s begin by looking at the relationship between mileage (Miles_per_Gallon) and engine power (Horsepower) in a scatter plot.
For that, we’ll want to make a chart with points. So we use mark_point()
and encode the x-axis with “Miles_per_Gallon” and the y-axis with “Horsepower.”
alt.Chart(cars).mark_point().encode(
x="Miles_per_Gallon",
y="Horsepower"
)
Want to add colors and shapes? Just mention the visual encodings, color and shape.
alt.Chart(cars).mark_point().encode(
x='Horsepower',
y='Miles_per_Gallon',
color='Origin',
shape='Origin'
)
The key to creating meaningful visualizations is to map properties of the data to visual properties in order to effectively communicate information
Do you need a bar chart? change mark_point()
to mark bar()
. Adding color to the bar chart (by using the color
attribute) creates a stacked bar chart by default.
alt.Chart(cars).mark_bar().encode(
x=alt.X('Miles_per_Gallon', bin=True), #binning the columns
y='count()', #aggregating the data
color='Origin'
)
Need a line plot for your data? Use the mark_line
alt.Chart(stocks).mark_line().encode(
x='date',
y='price',
color='symbol'
)
Here are some of the more commonly used mark_*() methods supported in Altair and Vega-Lite:
mark_area()
mark_bar()
mark_circle()
mark_line()
mark_point()
mark_rule()
mark_square()
mark_text()
mark_tick()
Encoding
The key to creating meaningful visualizations is to map the properties of the data to visual properties in order to communicate information effectively. In Altair, this mapping of visual properties to data columns is referred to as Encoding.
Encoding Data types
The details of any mapping depend on the type of data. Altair recognizes five main data types:
- quantitative —
Q
(a continuous real-valued quantity) - ordinal —
O
(a discrete ordered quantity) - nominal —
N
(a discrete unordered category) - temporal —
T
(a time or date value) - geojson —
G
(a geographic shape)
If types are not specified for data input as a DataFrame, Altair defaults to quantitative for any numeric data, temporal for date/time data, and nominal for string data.
How Data Type affects Color Scales
Specifying the correct type for your data is important, as it affects the way Altair represents your encoding in the resulting plot.
Below is an example that shows how the data type influences the way Altair decides on the color scale to represent the value and also influences whether a discrete or continuous legend is used.
base = alt.Chart(cars).mark_point().encode(
x='Horsepower:Q',
y='Miles_per_Gallon:Q',
).properties(
width=250,
height=250
)
# Plotting the same chart with different data types
base.encode(color='Cylinders:Q').properties(title='quantitative') | base.encode(color='Cylinders:O').properties(title='ordinal') | base.encode(color='Cylinders:N').properties(title='nominal')
Encoding Channels
What is a channel? In data visualization, a visual variable used to represent data is called a channel. A channel is typically mapped to a specific aspect of the data, such as its value, position, or color.
Examples of channels include position (e.g., x and y axes), size, shape, color, texture, and orientation.
Each encoding channel allows for a number of additional options to be expressed; these can control things like axis properties, scale properties, headers and titles, binning parameters, aggregation, sorting, and many more.
Binning data
One of the most common uses of binning is the creation of histograms. One interesting thing is that Altair’s declarative approach allows us to start assigning these values to different encodings to see other views of the exact same data.
So, for example, if we assign the binned miles per gallon to the color, we get this view of the data:
alt.Chart(cars).mark_bar().encode(
color=alt.Color('Miles_per_Gallon', bin=True),
x='count()',
y='Origin'
)
Changing the encoding again, let’s map the color to the count instead:
alt.Chart(cars).mark_rect().encode(
x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=20)),
color='count()',
y='Origin',
)
This is one of the beautiful things about Altair: it shows you through its API grammar the relationships between different chart types.
For example, a 2D heatmap encodes the same data as a stacked histogram!
Aggregation of Data
Beyond simple channel encodings, Altair’s visualizations are built on the concept of database-style grouping and aggregation, that is, the split-apply-combine abstraction that underpins many data analysis approaches.
One key operation in data exploration is the group-by; the group-by splits the data according to some condition, applies some aggregation within those groups, and then combines the data back together.
For the car data, you might split by Origin, compute the mean of the miles per gallon, and then combine the results. In Pandas, the operation looks like this:
In Altair, this sort of split-apply-combine can be performed by passing an aggregation operator within a string to any encoding.
For example, we can display a plot representing the above aggregation as follows:
alt.Chart(cars).mark_bar().encode(
x='mean(Miles_per_Gallon)',
y='Origin',
color='Origin'
)
Aggregates can also be used with data that is only implicitly binned. For example, look at this plot of MPG over time:
alt.Chart(cars).mark_point().encode(
x='Year:T',
y='Miles_per_Gallon',
color='Origin'
)
The fact that the points overlap so much makes it difficult to see important parts of the data; we can make it clearer by plotting the mean in each group (here, the mean of each Year/Country combination):
alt.Chart(cars).mark_line().encode(
x='Year:T',
y='mean(Miles_per_Gallon)',
color='Origin'
)
The mean
aggregate only tells part of the story, though: Altair also provides built-in tools to compute the lower and upper bounds of confidence intervals on the mean.
We can change the mark to area()
here, and specify the lower and upper bounds of the area using y
and y2
, and use the ci0
and ci1
mark to plot the confidence interval of the estimate of the mean:
alt.Chart(cars).mark_area(opacity=0.3).encode(
x='Year:T',
color='Origin',
y='ci0(Miles_per_Gallon)',
y2='ci1(Miles_per_Gallon)'
)
3. Compounding charts
Creating multi-panel and layered charts is an essential property to have for a data visualization library. Altair provides a concise API for creating multi-panel and layered charts. Let’s explore a few of them:
- Layering
- Horizontal Concatenation
- Vertical Concatenation
Layering
Layering lets you put layer multiple marks on a single Chart. One common example is creating a plot with both points and lines representing the same data. Let’s use the stocks data for this example:
stocks = data.stocks()
stocks.head()
alt.Chart(stocks).mark_line().encode(
x='date',
y='price',
color='symbol'
)
And here is the same plot with a circle mark:
alt.Chart(stocks).mark_circle().encode(
x='date:T',
y='price:Q',
color='symbol:N'
)
We can layer these two plots together using the +
Operator. One pattern we can use often is to create a base chart with the common elements and add together two copies with just a single change:
base = alt.Chart(stocks).encode(
x='date:T',
y='price:Q',
color='symbol:N'
)
base.mark_line() + base.mark_circle()
Horizontal Concatenation
Just as we can layer charts on top of each other, we can concatenate horizontally using alt.hconcat
, or equivalently the | operator:
alt.hconcat(base.mark_line(), base.mark_circle())
# Another way of writing the same thing
# base.mark_line() | base.mark_circle()
Vertical Concatenation
Vertical concatenation looks a lot like horizontal concatenation but uses either the alt.vconcat()
function, or the & operator:
base.mark_line() & base.mark_circle()
4. Grammar of interaction
The grammar of interaction allows the building of interactive features of the plot from components. Using our scatter plot from above and adding some interaction to it.
Interactivity and Selections
Altair’s interactivity and grammar of selections are one of its unique features among available plotting libraries.
Just adding .interactive()
fires up the Vega power, and the chart becomes interactable in the browser allowing it to zoom and pan.
alt.Chart(cars).mark_point().encode(
x='Horsepower',
y='Miles_per_Gallon',
color='Origin',
shape='Origin'
).interactive()
Let me walk you through the variety of selection types that are available, and begin to practice creating interactive charts and dashboards.
There are three basic types of selections available:
- Interval Selection:
alt.selection_interval()
- Single-Selection:
alt.selection_single()
- Multi Selection:
alt.selection_multi()
Interval Selection
As an example of a selection, let’s add an interval selection to a chart. We’ll start with our canonical scatter plot:
interval = alt.selection_interval()
alt.Chart(cars).mark_point().encode(
x="Miles_per_Gallon",
y="Horsepower",
color ="Origin"
).properties(
selection = interval
)
The interval selection is a way you can click and drag; the encodings add properties to the selection. We can do selection in 2D as well as 1D. To make the chart more interesting, we can use conditional arguments for the selections, such as changing color using conditions.
interval = alt.selection_interval(encodings=['x'])
alt.Chart(cars).mark_point().encode(
x="Miles_per_Gallon",
y="Horsepower",
color =alt.condition(interval,"Origin",alt.value("lightgray"))
).properties(
selection = interval
)
We can stack the interactive charts using Compounding; the selection between them is also tied. Let’s encode the selection with different variables on the x-axis.
interval = alt.selection_interval(encodings=['x','y'])
chart = alt.Chart(cars).mark_point().encode(
x="Miles_per_Gallon",
y="Horsepower",
color =alt.condition(interval,"Origin",alt.value("lightgray"))
).properties(
selection = interval
)
chart | chart.encode(x='Acceleration')
Tooltips
When a user hovers over a mark, the tooltip displays relevant data and details from another visualization filtered to that mark. You can show related vizzes in tooltips to help your audience engage with the data at a different or deeper level while keeping them in the current context and maximizing the space available for the current view.
Considering the encodings in just the x-direction and adding tooltips
we get more details about the different relationships in data points.
interval = alt.selection_interval(encodings=['x'])
chart = alt.Chart(cars).mark_point().encode(
x="Miles_per_Gallon",
y="Horsepower",
color =alt.condition(interval,"Origin",alt.value("lightgray")),
tooltip =['Name', 'Origin', 'Horsepower', 'Miles_per_Gallon']
).properties(
selection = interval
)
chart | chart.encode(x='Acceleration')
We can do even more sophisticated things with selections as well. For example, let’s make a histogram of the number of cars by Origin
and stack it on our scatterplot:
interval = alt.selection_interval()
base = alt.Chart(cars).mark_point().encode(
y='Horsepower',
color=alt.condition(interval, 'Origin', alt.value('lightgray')),
tooltip='Name'
).add_selection(
interval
)
hist = alt.Chart(cars).mark_bar().encode(
x='count()',
y='Origin',
color='Origin'
).properties(
width=800,
height=80
).transform_filter(
interval
)
scatter = base.encode(x='Miles_per_Gallon') | base.encode(x='Acceleration')
scatter & hist
5. Saving Altair Charts
It will all be a waste if these cool interactive charts only run in a python notebook. Because Altair produces Vega-Lite specifications, it is relatively straightforward to export charts and publishes them on the web as Vega-Lite plots.
HTML format
All that is required is to load the Vega-Lite javascript library and pass it the JSON plot specification output by Altair. For convenience, Altair provides a Chart.save()
method, which will save any chart to HTML.
This HTML file can be transferred and loaded into any browser without the requirement to load the data.
base.save('chart.html')
JSON format
The fundamental chart representation output by Altair is a JSON string format; one of the core methods provided by Altair is, which returns a JSON string that represents the chart content. Additionally, you can save a chart to a JSON file using Chart.save()
, by passing a filename with a .json
extension.
The JSON format can be useful for sharing and loading the Chart in a notebook. This imported chart can then be viewed easily.
base.save('chart.json')
The Altair charts can also be saved in PNG, SVG, and PDF format. To save an Altair chart object as a PNG, SVG, or PDF image, you can use
chart.save('chart.png')
chart.save('chart.svg')
chart.save('chart.pdf')
Notebook to refer to for code and explanations: https://colab.research.google.com/drive/1Rxg87Pn9Crpd0Kg8VhHFdtxp4EWkYIqY?usp=sharing#scrollTo=9odJ7cqKHipQ
Final words
And there you have it — from basic data visualization to beautiful interactive visualizations in just a couple of minutes and a couple of lines of code. Altair requires some time to get used to the syntax, but that’s true for all visualization libraries.
At the end of the day, the visualizations look much better than the equivalent produced with Matplotlib and Seaborn.
But what about Plotly? Well, that’s a topic for another time. Stay tuned for more content like this.
Thanks for reading.
References:
- https://altair-viz.github.io/index.html
- https://www.practicaldatascience.org/html/plotting_altair_part1.html
- https://robinlinacre.medium.com/why-im-backing-vega-lite-as-our-default-tool-for-data-visualisation-51c20970df39
- https://www.nersc.gov/assets/Uploads/09-Python-Vis-with-Altair.pdf
- https://medium.com/codex/creating-presentable-data-visualization-with-altair-5a3286e697ab
- https://github.com/altair-viz/altair
- https://www.analyticsvidhya.com/blog/2021/10/exploring-data-visualization-in-altair-an-interesting-alternative-to-seaborn/
- https://www.datacamp.com/tutorial/altair-in-python
- https://www.geeksforgeeks.org/introduction-to-altair-in-python/