PyCatFlow: Visualizing Categorical Data Over Time
PyCatFlow is a Python package for visualizing temporal changes to categorical data. It is inspired by Bernhard Rieder’s visualization tool RankFlow, which allows to visualize ranked lists over time, for example, the changes in search results for queries on Google or YouTube. In my opinion RankFlow is an immensely useful tool despite its minimalistic user interface and my difficulties to prepare data for it. As it turned out these difficulties largely stemmed from “misusing” RankFlow, or to put it in more positive terms, it stemmed from appropriating RankFlow for other purposes than it was designed for.
In this article I first provide some background on how the idea for PyCatFlow emerged as a result of appropriating RankFlow for studying the technical evolution of Facebook’s APIs. The second part describes how the tool can be used in Python. I created a Jupyter notebook to run the tutorial and tool in a cloud-environment via Binder. While the background information in the first sections is by no means necessary for learning how to use the tool, it provides some context for understanding the visualizations PyCatFlow generates.
Background: Appropriating RankFlow
RankFlow allows to compare ranked lists (over time). In its simplest form it requires tabular data to be arranged in such a way that each column represents a ranked list.
Each ranked list can be supplemented by weights thereby adding another layer of information to the data. If we take YouTube search results data, for example, the view count, the upvote count or the upvote-downvote-ratio could be used as weight information. For the sake of simplicity the exemplary data consists only of ranked lists and results in the following flow diagram.
Each column in the data table is presented as a stack of nodes that is ordered according to the rank within the given data set. Identical nodes are furthermore connected between columns. This brings continuities as well as changes to the fore and enables the analysis of patterns of stasis and change. In their paper From Ranking Algorithms to ‘Ranking Cultures’ Bernhard Rieder, Ariadna Matamoros-Fernández, and Òscar Coromina used RankFlow to study different morphologies of topics on YouTube.
In a number of recent projects I was confronted with research questions and data that called for similar visualizations. Against the background of public controversies about how Facebook accumulates and shares data, colleagues and I, for example, were interested in how the platform governs the circulation of data through permissions.* Based on scrapes of live web data as well as on archived materials from the Internet Archive we assembled a lists of permissions that existed for different versions of the API.
Creating a RankFlow visualization based on this data requires rearranging the data set. For each version of the API there needs to be a column containing a ranked list of permissions. Yet, the information presented in the publicly available developer documentation is not ranked like search results, i.e. permissions are not ordered by some relevance metric. Therefore, creating an order for the RankFlow diagram is a design decision, i.e. items can be sorted alphabetically, by their frequency of occurrence in the data set or based on additional data.
In practice, adapting our data to the required data structure of RankFlow was not only tedious work, but also a strong indicator that we were misusing the existing visualization tool. A similar challenge emerged once the graph was generated, because certain information within our data was not yet represented in the visualization. The app_review column in the above table contains information about whether requesting a permission had to undergo a review process at a specific time. Finally, we wanted to highlight in the visualization when a permission was introduced and when it was deprecated. To do this we removed the color coding of the generated RankFlow graph and manually assigned distinct colors to them.
To speed up the post-processing of the diagram, I wrote a python script that post-processed the XML data in the SVG file that RankFlow generated. Yet, this was less a solution than a very messy workaround that possibly only worked for the data at hand. Instead of investing time in perfecting this workaround I decided to create a visualization tool similar to RankFlow, which is well suited for temporal data that does not contain explicit ranking information, but potentially additional categorical data.
How to use PyCatFlow
Currently the tool exists as as Python package called PyCatFlow. It has been released on PyPi and is maintained on Github. The initial implementation was created in close collaboration with Herbert Natta. If you know Python, then using PyCatFlow is pretty straight forward. Install it with pip and follow the instructions in the README.md on GitHub. If you don’t know any Python or simply want to try out the tool, I created a Jupyter Notebook that you can run in Binder. Jupyter Notebooks are interactive notebooks in which Python code is executed step by step in your browser, i.e. code cell by code cell. Binder provides an infrastructure to do this online so you do not have to install anything on your local computer.
The notebook makes use of interactive widgets so even non-programmers should be able to create visualizations with PyCatFlow. You can run it on mybinder.org. The notebook is published as a GitHub Gist. Using the Gist URL you can easily recreate the Binder environment in case the link is not working anymore for some reason. Simply paste the Gist URL into the interface on mybinder.org and select Gist from the dropdown menu. After pressing launch Binder starts an existing environment or creates a new environment from scratch. The latter may take a few minutes.
Once the environment has started double click the file PyCatFlow.ipynb in the left hand side of the interface to open the notebook. The notebook contains a combination of code and text cells. The blue bar on the left highlights the currently active cell. You can progress through the notebook by pressing the play icon in the menu bar or by pressing shift+enter on your keyboard. Once a code cell has been executed you see a small number before it which indicates the order of execution. These execution numbers are relevant because code cells can be run in any order and rerun again and again. Trying to run code cells in an arbitrary order often leads to wrong results and execution errors. Markdown text cells have no execution number, but are rendered. The notebook of this tutorial is meant to be executed from top to bottom. I grouped the code cells into multiple steps that are indicated by headlines. The tutorial refers to these steps. In case something goes wrong or you want to start over you can reset the notebook by selecting the menu “Kernel/Restart Kernel and Clear All Outputs”.
Step 1: Loading data
Once you executed the second code cell an interactive widget appears prompting you to select a data file for upload. The notebook only accepts CSV-files. Choose the appropriate separator which is used in your file as the delimiter between data columns. Available options are tabulator, comma or semicolon.
After choosing the file navigate to the next cell and execute it. Do not execute the second code cell again since this overrides your selection! The first five rows in your data set will be printed on screen. You can use the example data set published alongside the source code of the tool as example data which contains four columns: column, items, category and column order. The data does not stem from the study of the evolution of Facebook’s APIs that I described above, but looks at what code libraries an Open Source project depends on. Each row is a so-called “dependency” that the Chatbot framework Chatterbot relied on at a certain point in time. In case you use your own data and it is not correctly printed please check if you used the correct separator.
Step 2: Mapping data columns to the visualization
In step 2 data columns are mapped on aspects of the visualization. You can select up to four columns from you data, but only two are required for PyCatFlow to render graphics.
- Viz columns* refer to the columns of the visualization that represent distinct points in time.
- Viz nodes* are the items that are visualized over time.
- Viz category contains additional categorical data that will be color coded. This is optional and can be left blank.
- Column order contains optional information about how the columns shall be arranged in the visualization. In case this is not specified, the tool checks whether the data in viz columns is numerical or a string and orders the columns in the visualization accordingly. Be aware that entries like “v. 2.10” will be treated as strings and as a consequence “v. 2.10” follows “v. 2.1”, but is displayed before “v. 2.2”.
The interactive widget assumes that columns of the visualization are contained in the first data column and the nodes in the second data column. Adjust accordingly if your data set is structured differently. The other two dimensions are optional. Select the columns “category” and “column order” for third and forth option if you use the example data set.
Step 3: Set properties of the visualization
In the third step you can set a number of properties for the visualization. The width, the minimum and maximum height of nodes, the spacing in between nodes, one of three connection types as well as the sort order of the nodes.
Once these parameters are set and the following code cell is executed, the generated visualization is printed on the screen. You can now adjust the parameters in the above cell and rerun the visualization cell or proceed.
Step 4: Save the result
Once you are satisfied with the resulting visualization you can save the visualization to file. You can choose the file name in the respective widget. Each visualization is automatically stored as both a PNG and a SVG file using the same name. Make sure to download your results from the file list on the left hand side. To do so, select all desired files, open the context menu with a right mouse click (or two finger tap) and press “Download”.
Advanced Settings
Besides the parameters that can be adjusted using the interactive widgets in the tutorial, PyCatFlow offers a number of additional parameters to optimize the visualization. I describe how these advanced settings can be used at the very end of the notebook.
*In late 2019 and early 2020 Anne Helmond, Tatjana Seitz, Angeles Briones, Fernando van der Vlist and I met for two data sprints at CAIS Bochum. During these sprints the described study on Facebook was conducted. We published the results as a working paper The Technicity of Platform Governance: Structure and Evolution of Facebook’s APIs.