We need more Interactive Data Visualization tools (for the Web) in Python
This semester (Spring 2018) I taught a Data Visualization course for our Masters in Data Science program. Our MS in Data Science program is a 15-month intensive program that has been successful at producing high-quality data scientists.
Our students come from a variety of backgrounds and have a good understanding of R and Python by the time they take the Data Visualization course. They have used ggplot2 and matplotlib through their various courses and are ready to learn about techniques to visualize large, multivariate data.
Being a data visualization researcher, I wanted to introduce all the wonderful techniques that have emerged from the data visualization research field and so part of the class would be lecture based on the material from research papers, online visualizations, and d3 examples.
Visualization in Python
Most of the data visualization research is being conducted using D3 today. Unfortunately, I had only 8 weeks with the students and I wanted to focus on a mix of theory and practical information that they can use as data scientists. While students were interesting in using visualization techniques to explore and explain, most of the students were less interested in creating beautiful bespoke visualizations using D3. Based on feedback from other professors’ who had taught the class before, there just wasn’t enough time to teach D3 in the short time frame.
Given my love for Python and the students’ comfort level with Python, I decided to introduce students to the amazing (I hoped!) packages in Python that could do everything that I was showing the students.
Static Visualization with Seaborn
Given my past experience with seaborn, I was excited to introduce the students to the beautiful visualizations that seaborn generates. They already had experience with matplotlib and so picking up seaborn was quick with a huge upside. Students were able to make scatterplots (bivariate and multivariate), swarmplots, violin plots, bar charts, boxplots, and histograms with faceting. Students learned that generating swarmplots with large datasets was very time consuming and summarization-based plots (such as violin plots) were a better alternative.
Interactive Visualization using Bokeh or Plot.ly
While seaborn produces beautiful visualizations, they are all static and I wanted the students to experience the benefits of using interaction techniques such as brushing, filtering, zoom, and hover. To that effect, I introduced Bokeh and Plot.ly, which are visualization libraries that allow the easy creation of interactive data visualizations. For the time-series visualization assignment, students could choose either bokeh/plot.ly to implement multi line charts, heatmaps, animated bubble charts, and so on.
Visualizing Trees, Graphs, and Networks
When discussing techniques for visualizing hierarchical data, I was delighted to show off the treemap visualization technique and we compared it with node-link diagrams. Unfortunately, when I dug deeper I found that there was no multi-level treemap implementation :( Even after importing the squarify library, you can only generate one level treemaps in Python!
Graphs and Networks can be analyzed using the fantastic networkx package. Visualizing the networks though must be done using either matplotlib or igraph or plotly (See tutorial on visualizing networks using plotly). igraph has many different options to help a user experiment with configuring a graph, but it is clunky to setup and many students ran into problems when using it. plot.ly on the other hand works well but it has very few options with respect to customizing a network graph. :|
Given that creating interactive maps is a big part of data visualization, I was a bit more hopeful to find packages that will allow the creation of choropleth maps, symbol maps, cartograms, transit maps, and maybe even flow maps. Here is what I found with respect to geovisualization libraries in Python:
- Plot.ly allows you to create choropleth maps and symbol maps but with very little control over the creation of the map.
- geoplotlib is a neat package that is built on pyglet but it is a bit unstable and crashes frequently. It uses OpenStreetMap tiles and even allows for animation-based visualization of spatio-temporal data. I loved this package since it had a neat collection of examples.
- geoplot looks great with some wonderful looking examples, but neither I nor any of our students were able to install it. Given that most of us were not using conda, we should have heeded the warning — “Use caution however, as this probably will not work on Windows, and possibly will not work on OSX and Linux.” :|
- cartopy and geopandas + matplotlib were not tried since they produce static visualizations.
We learned a lot about various text visualization techniques such as tag clouds (such as wordle), docubursts, parallel tag clouds, phrase nets, and word trees. Topic exploration and sentiment visualization techniques were also introduced.
Unfortunately, other than the word_cloud package, there are very few options for anyone interested in visualizing a single document or a large corpus of text in Python.
Interactive Data Visualization for the Web
Bokeh and Plot.ly Dash are the current answers to creating interactive dashboards that allow multi-view brushing and filtering. While Bokeh has very few examples, Plot.ly Dash is non-trivial to use for users experienced in creating visualizations in Python.
Plot.ly Dash is built on Flask, Plotly.js, and React.js and increases the barrier to creating synchronized multi-view visualizations. A few student teams in my class used Plot.ly Dash for their final projects, but they experienced a steep learning curve. Here is a neat example that visualizes a dataset about TED talks using Dash by Ryan Campa and Shikhar Gupta.
Is Altair going to be the one?
As the course was progressing there was news of Python + Vega combining in the form of Altair! Knowing that Vega is from the UW Interactive Data Lab, I was excited to use it. Jim Vallandingam’s excellent “Introduction to Altair” tutorial was a great starting point.
Jake VanderPlas, the primary developer for Altair, recently posted links to his Python notebook and video from PyCon 2018. I have been playing with it every since and I like it a lot! I hope that it serves the needs of data scientists who want to explore their data and create interactive visualizations to explain their data internally and externally.
Data scientists would love to use visualization libraries and packages in Python and I hope tools such as Altair are the answer. Packages such as plotly, seaborn, bokeh, geoplotlib, etc. will continue to evolve and more functionality will be added. Here’s to a brighter future for Interactive Data Visualization (for the web) using Python! :)