Altair: 🪄Vizardry in Python

Numerical Addiction
12 min readFeb 12, 2023

--

Dynamic and Interactive Visualizations using Altair

Python has tons of visualization libraries to put your data into visual perspective, not to mention interactivity. We have the likes of Matplotlib which is a very mature & popular library with tons of customization options. Other libraries like seaborn which offers a high level wrapper on top of matplotlib to generate better looking visuals with lesser complex code. Similarly, for dynamic and interactive plots, we have plotly, bokeh and folium.

Amongst all of this, we also have the beautiful Altair library which is declarative in nature and the grammar is based on Vega & Vega-lite. One of the key features of Altair is its simple and intuitive syntax, which allows users to quickly create complex visualizations with minimal code.

One of the most powerful features of Altair is the ability to easily combine different chart types in a single visualization. This allows users to create complex graphics that effectively communicate multiple layers of data.

Another major advantage of Altair is its interactivity. With just a few lines of code, users can create dynamic visualizations that allow them to explore their data in new ways.

Here’s a look at Altair’s Syntax:

import altair as alt
from vega_datasets import data

iris = data.iris.url

chart1 = alt.Chart(iris).mark_point().encode(
x='petalLength:Q',
y='petalWidth:Q',
color='species:N'
).properties(
height=300,
width=300
)

chart2 = alt.Chart(iris).mark_bar().encode(
x='count()',
y=alt.Y('petalWidth:Q', bin=alt.Bin(maxbins=30)),
color='species:N'
).properties(
height=300,
width=100
)

chart1 | chart2

The above code may seem complicated and hard to understand, be assured, we will decode this by the end of this article.

Let’s dive in… and as we go along, we will be able to appreciate the beautiful Vega grammar. This may be an introductory article, it however has the power to take you from getting started to considering Altair as a daily driver.

Altair.

Installing & Importing

  1. pandas for data wrangling
  2. vega_datasets for data
  3. altair for … well…
#Installation
pip install pandas
pip install vega_datasets
pip intsall altair

#Imports
import pandas as pd
from vega_datasets import data
import altair as alt

Loading Dataset

We will use my favourite dataset ‘cars’. Here’s how to call it.

cars = data.cars()

The dataset contains information on different car models and their fuel efficiency (measured in miles per gallon, or mpg). It also has information around the car origin, displacement, number of cylinders, horsepower, weight, model year and acceleration. And, obviously the name of the car.

Plotting

  1. Plotting an altair graph needs us to declare the data in the base level Chart first. The data can be in the form of a Pandas dataframe, an alt Data object, a url pointing to a json and csv file or geo-data.
  2. We also need to define the marks, which is information on how we want our visual attributes to be shown on the plot. It can be point, line, bar, etc.
  3. Encodings are then used to map columns to visual attributes of the plot.

Let’s put all this together.

We first call the Chart method and point it to our cars dataframe. Now we tell altair what shape of marks we want. Here, we call on the mark_point method since we want points for this plot.

alt.Chart(cars).mark_point()

Since we haven’t yet given altair the co-ordinates, it places all the points in one place. So we have 406 points overlapping at a single point on the Chart.

All that is left is to tell altair where we want these points. Here, we will plot ‘miles per gallon’ on the x-axis as points.

alt.Chart(cars).mark_point().encode(
x='Miles_per_Gallon'
)

We get a 1D plot of all the datapoints depicting the ‘mpg’ for each car.

Let's add ‘weight’ to the y-axis.

alt.Chart(cars).mark_point().encode(
x='Miles_per_Gallon',
y='Weight_in_lbs'
)

We get a nicely laid out scatter plot as our points are now being encoded with values on each axis.

How about we change the y-axis variable to maybe a categorical variable. Let’s use the country of ‘origin’.

alt.Chart(cars).mark_point().encode(
x='Miles_per_Gallon',
y='Origin'
)

We can see how the ‘mpg’ data is spread across the three countries.

How about using a different mark type. Let’s switch to ticks using the mark_tick method.

alt.Chart(cars).mark_point().encode(
x='Miles_per_Gallon',
y='Origin'
)

We can add another dimension by defining the color parameter. Let’s use the ‘Cylinders’ variable. Let’s only take cars with 4, 6 and 8 cylinders.

#Filtering to get cars with 4, 6 and 8 cylinders
cars = cars.loc[cars.Cylinders.isin([4,6,8])]
alt.Chart(cars).mark_tick().encode(
x='Miles_per_Gallon',
y='Origin',
color='Cylinders'
)

Let’s switch back to a scatter plot between ‘mpg’ & ‘weight’ and break up the x-axis into bins. Altair grammar shines here as we can simply assign an alt.X object to x-axis. Now we can pass in other parameters to that. Let’s activate binning by passing a boolean value to the bin parameter. We will keep the dataset filtered to 4, 6 and 8 cylinder cars.

alt.Chart(cars).mark_point().encode(
x=alt.X('Miles_per_Gallon', bin=True),
y='Weight_in_lbs',
color='Cylinders'
)

We can see that binning has now clustered the points into vertical lines. Lets replace the marks with bars.

alt.Chart(cars).mark_bar().encode(
x=alt.X('Miles_per_Gallon', bin=True),
y='Weight_in_lbs',
color='Cylinders'
)

We get a good looking stacked bar plot that clearly depicts that heavier vehicles have lesser fuel efficiency. Also, cars with higher number of cylinders, have lower efficiency as expected.

How about replacing the y-axis with a count of values in each bin on the X-axis. We can do so by replacing the column name allotted to the Y-axis with the count() function. Let’s get rid of the color parameter for now.

alt.Chart(cars).mark_bar().encode(
x=alt.X('Miles_per_Gallon', bin=True),
y='count()'
)

We now have a nice histogram that counts values in each bin defined on the X-axis. We can switch orientation by merely reassigning the axes.

alt.Chart(cars).mark_bar().encode(
x='count()',
y=alt.Y('Miles_per_Gallon', bin=True)
)

You can alter the number of bins by passing an alt.Bin object to the bin parameter instead of a boolean. This enables more arguments that can go in.

Let’s increase the number of bins here by using the maxbins parameter inside the alt.Bin object. Be careful of the information — noise balance when you use such a parameter.

alt.Chart(cars).mark_bar().encode(
x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=25)),
y='count()'
)

How about we bring back the color parameter.

alt.Chart(cars).mark_bar().encode(
x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=10)),
y='count()',
color='Origin'
)

We now have a stacked histogram.

Let’s introduce another dimension here by defining the column parameter. I’m going to use the ‘Cylinders’ variable for this. This will split the plot into multiple plots stacked in columns, each depicting a category in the Cylinders variable. Use the row parameter if you want the plots to be laid out in rows.

alt.Chart(cars).mark_bar().encode(
x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=10)),
y='count()',
color='Origin',
column='Cylinders'
)

What if we assign the color parameter to a continuous variable.

alt.Chart(cars).mark_bar().encode(
x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=10)),
y='count()',
color='Horsepower',
column='Cylinders'
)

Notice the color palette changing to a continuous one and the legends being replaced by a colorbar to depict the same.

Here I have been using the Cylinders variable as a categorical one. We can have altair change how its reading that variable by passing in :Q after the column name. This avoids the need to change datatypes of columns in our dataset.

Let’s switch to a single plot and use Cylinders as color. This time we will use it as a continuous variable.

alt.Chart(cars).mark_bar().encode(
x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=10)),
y='count()',
color='Cylinders:Q'
)

And if we want to switch back to Cylinders being a categorical variable. Using :N will help altair read Cylinders as a nominal variable.

alt.Chart(cars).mark_bar().encode(
x=alt.X('Miles_per_Gallon', bin=alt.Bin(maxbins=10)),
y='count()',
color='Cylinders:N'
)

There is frequently a need to aggregate data and we can do it right here without first having to groupby our dataframe and then feeding it as an input to altair. We use the mean function here to do so.

Let’s look at the average fuel efficiency(‘mpg’) of cars in each country of origin. Let’s stick to bars for now. Since we want to compare the mean mpg of each country, we can assign each bar a color of its own by mapping ‘Origin’ to the color parameter.

alt.Chart(cars).mark_bar().encode(
x='mean(Miles_per_Gallon):Q',
y='Origin:N',
color='Origin:N'
)

Let’s have the color be assigned to an aggregate. Let’s do count.

alt.Chart(cars).mark_bar().encode(
x='mean(Miles_per_Gallon):Q',
y='Origin:N',
color='count()'
)

We can break up the X-axis into bins instead of mean and have each section assigned a color as per count.

alt.Chart(cars).mark_bar().encode(
x=alt.X('Miles_per_Gallon:Q', bin=alt.Bin(maxbins=10)),
y='Origin:N',
color='count()'
)

How about we switch to a pie chart. We now plot the number of cars in the dataset by the country of origin. We switch the mark type to an arc, assign the angle of each slice in the theta parameter and then define how the color will work. Remember, we have relevant alt objects that we can pass to these parameters and these objects can take in additional arguments for us to customize our output.

Do notice that we now have switched to a more explicit way of calling out arguments. This is mainly preferential.

alt.Chart(cars).mark_arc().encode(
theta=alt.Theta(
field='Miles_per_Gallon',
type='quantitative',
aggregate='count'
),
color=alt.Color(
field='Origin',
type='nominal'
)
)

We can define further parameters to the mark_arc method and change this to a donut plot. Let’s pass the arguments innerRadius & radius.

alt.Chart(cars).mark_arc(
innerRadius=65,
radius=120
).encode(
theta=alt.Theta(
field='Miles_per_Gallon',
type='quantitative',
aggregate='count'
),
color=alt.Color(
field='Origin',
type='nominal'
)
)

How about we assign a variable to the radius. This can be done using the alt.Radius object.

alt.Chart(cars).mark_arc(
innerRadius=30
).encode(
theta=alt.Theta(
field='Miles_per_Gallon',
type='quantitative',
aggregate='count',
stack=True
),
color=alt.Color(
field='Cylinders',
type='ordinal'
),
radius=alt.Radius(
field='Miles_per_Gallon:Q',
aggregate='count'
)
)

This chart would have made more sense had the categories been more(say 5) and had comparable data in terms of counts.

Let’s visualize a bubble plot showing a relation between mpg and weight of the vehicles. We can add other dimensions by allocating size to displacement and color to origin.

alt.Chart(cars).mark_point(
fillOpacity=0.2
).encode(
x='Miles_per_Gallon',
y='Weight_in_lbs',
size='Displacement',
color='Origin'
)

Altar gives you the option to combine multiple chart types. To combine multiple plots, you have the option to use operators like ‘+’, ‘&’ and ‘|’. You also have the option to use vconcat and hconcat. Examples to follow.

How about checking how the overall fuel efficiency(mpg) has changed over time for these countries and while we’re at it, lets plot the confidence interval. For this we combine mark_line and mark_errorband.

line = alt.Chart(cars).mark_line().encode(
x=alt.X(
field='Year',
type='temporal'
),
y=alt.Y(
field='Miles_per_Gallon',
type='quantitative',
aggregate='mean'
)
)

band = alt.Chart(cars).mark_errorband(
extent='ci'
).encode(
x=alt.X(
field='Year',
type='temporal'
),
y=alt.Y(
field='Miles_per_Gallon',
type='quantitative',
title='Miles Per Gallon'
)
)

line + band

Let’s mark a vertical line on the plot to separate cars made prior to 75 and after.

line = alt.Chart(cars).mark_line().encode(
x=alt.X(
field='Year',
type='temporal'
),
y=alt.Y(
field='Miles_per_Gallon',
type='quantitative',
aggregate='mean'
)
)

band = alt.Chart(cars).mark_errorband(
extent='ci'
).encode(
x=alt.X(
field='Year',
type='temporal'
),
y=alt.Y(
field='Miles_per_Gallon',
type='quantitative',
title='Miles Per Gallon'
)
)

xrule = alt.Chart(cars).mark_rule(
color='gray',
strokeWidth=1.5,
strokeDash=[10,5]
).encode(
x=alt.datum(
alt.DateTime(
year=1975,
month='December',
date=31
)
)
)

line + band + xrule

You can see an interactive plot which you can hover your mouse on for further information.

Speaking of interactivity. Let’s see how that works.

I’m going to switch back to a scatter plot between mpg & weight. Color denotes the origin of the said vehicle. Post the encoding, I use the method interactive. This will give us basic level of interactivity with the plot line pan & zoom.

alt.Chart(cars).mark_point().encode(
x='Weight_in_lbs:Q',
y='Miles_per_Gallon:Q',
color='Origin:N'
).interactive()

Let’s plot multiple plots and try to create interactivity between them.

Let’s use the above plot as is and add another plot beneath using the vconcat method. The second plot can feature the number of vehicles in a given country of origin.

What we will aim to build is a set of plots that can interact amongst themselves. Eg. choosing points on the scatter plot automatically changes the data displayed in another related plot.

Since pan & zoom are not the only interactive elements we want to have, let’s get rid of the interactive method.

scatter_plot = alt.Chart(cars).mark_point().encode(
x='Weight_in_lbs:Q',
y='Miles_per_Gallon:Q',
color='Origin:N'
)

bar_plot = alt.Chart(cars).mark_bar().encode(
x='count()',
y='Origin:N',
color='Origin:N'
)

alt.vconcat(scatter_plot, bar_plot)

Now that we have the plots in place, step 1 is to activate selection of points in the scatter plot. We use the alt.selection_interval object to control the same. Then we feed the selection object to the properties of the scatter plot so it responds to the selection.

select = alt.selection_interval(encodings=['x', 'y'])

scatter_plot = alt.Chart(cars).mark_point().encode(
x='Weight_in_lbs:Q',
y='Miles_per_Gallon:Q',
color='Origin:N'
).properties(
selection=select
)

bar_plot = alt.Chart(cars).mark_bar().encode(
x='count()',
y='Origin:N',
color='Origin:N'
)

alt.vconcat(scatter_plot, bar_plot)

Currently the selection does nothing but we know the selection exists. Let’s map it in a manner that the points inside the selection are highlighted while the others change color to let’s say gray. To do so, we use the alt.condition object to the color parameter and use the earlier defined selection as the condition.

select = alt.selection_interval(encodings=['x', 'y'])

scatter_plot = alt.Chart(cars).mark_point().encode(
x='Weight_in_lbs:Q',
y='Miles_per_Gallon:Q',
color=alt.condition(select, 'Origin:N', alt.value('Lightgray'))
).properties(
selection=select
)

bar_plot = alt.Chart(cars).mark_bar().encode(
x='count()',
y='Origin:N',
color='Origin:N'
)

alt.vconcat(scatter_plot, bar_plot)

Now that the selection works as we want, let’s connect it to the barchart so that the barchart displays the frequency distribution of the selected marks. We use the transform_select method on the barchart to do so.

select = alt.selection_interval(encodings=['x', 'y'])

scatter_plot = alt.Chart(cars).mark_point().encode(
x='Weight_in_lbs:Q',
y='Miles_per_Gallon:Q',
color=alt.condition(select, 'Origin:N', alt.value('Lightgray'))
).properties(
selection=select
)

bar_plot = alt.Chart(cars).mark_bar().encode(
x='count()',
y='Origin:N',
color='Origin:N'
).transform_filter(select)

alt.vconcat(scatter_plot, bar_plot)

Amazing interactive plot with a very intuitive code base. How about some reverse selection where we can choose the bar on the barchart and display only the points of the corresponding country in the scatterplot.

select = alt.selection_interval(encodings=['x', 'y'])
multiple_select = alt.selection_multi(fields=['Origin'])

scatter_plot = alt.Chart(cars).mark_point().encode(
x='Weight_in_lbs:Q',
y='Miles_per_Gallon:Q',
color=alt.condition(select, 'Origin:N', alt.value('Lightgray'))
).properties(
selection=select
).transform_filter(multiple_select)

bar_plot = alt.Chart(cars).mark_bar().encode(
x='count()',
y='Origin:N',
color='Origin:N'
).transform_filter(select).properties(selection=multiple_select)

alt.vconcat(scatter_plot, bar_plot)

One last example of the this awesome functionality.

select = alt.selection_interval(encodings=['x', 'y'])
multiple_select = alt.selection_multi(fields=['Origin'])

scatter_plot = alt.Chart(cars).mark_point().encode(
x='Weight_in_lbs:Q',
y='Miles_per_Gallon:Q',
color=alt.condition(select, 'Origin:N', alt.value('Lightgray'))
).properties(
selection=select
).transform_filter(multiple_select)

bar_plot = alt.Chart(cars).mark_bar().encode(
x='mean(Weight_in_lbs)',
y=alt.Y('Miles_per_Gallon', bin=alt.Bin(maxbins=10)),
color='Origin:N'
).transform_filter(select).properties(selection=multiple_select)

alt.hconcat(scatter_plot, bar_plot)

This should get you onboarded to the altair visualization library. There are tons of features and customizations you can use. For reference you can visit the official documentation. Happy Plotting!

--

--

Numerical Addiction

Leader at American Express with a passion for the science behind data. My articles aim at providing quick reads to help elevate understanding of the science.