UpSet.js — Behind the (technical) Scenes

an interactive JavaScript re-implementation of UpSet(R)

16 min readMay 3, 2020

This article is part of three part article series

Technical Stack and Software Architecture

In this last part of the article series about UpSet.js, I wanna focus a on the software architecture, tools, decisions, and concepts behind UpSet.js. The writing style is different to the other parts, since it discusses the decisions and experiences I made during the development.

Based on my experience from LineUp.js I had a pretty good idea how to approach this project. I decided to create the underlying library as a pure functional React component published as @upsetjs/react. Advantages are that one doesn’t have to worry about how to update the DOM, the sophisticated DOM diff updates, and the caching strategies to avoid updates in the first place.

Another option would have been to use the also popular Vue.js framework. However, I decided against it for the following reasons: an advantage of Vue.js are its templates, such that one can write HTML with annotations and it will render the content. While this is useful for developers who are coming from the HTML world, it is a disadvantage for me. I’m coming from the JavaScript world and used to create DOM element through code e.g., using D3. Thus creating this separation between HTML templates and the code was suboptimal for me. Another advantage of Vue.js is its integrated reactivity. One can declare properties and computed properties and Vue.js automatically updates elements that need to be updated. I central point of this assumption is that you have one Vue.js instance. It happened to me more than once, that in order to use a certain Vue.js component, I had to add a plugin or configure my global Vue.js instance. In this case, I prefer React, since the idea of a virtual DOM and JSX/TSX is more general. For example, one can configure the TypeScript compiler to use a different JSX factory such that one can use a different framework. A good example is Preact, which has the same API as React.

It is true that React has no integrated reactivity features and uses a “naive” approach to determine whether a competent needs to be re rendered. But creating JavaScript objects is fast and with functional react components and functions like React.memo, React.useMemo, React.useCallback, one can avoid unnecessary updates. The last reason was React’s TypeScript support. I’m a big fan of TypeScript and don’t wanna miss it anymore. UpSet.js is completely written in it and fully typed. Since React components are either classes or functions one can fully type the properties and the state. Whereas during the time of writing this article, it was more difficult to proper type all properties in Vue.js, especially when you hand over an array or object. There are ways to do it but not as clean as with React. Moreover, you need special IDE plugins for a proper recognition and type checking while with React / TSX is just works. Thus, I decided to focus on React for this project. But due it its popularity I created a Vue.js adapter at @upsetjs/vue.

I decided against explicit CSS files nor do I use a CSS-in-JS solution but embed styles directly in the generated SVG via a style tag. This avoids that users have to import another resource like the CSS file and that a CSS-in-JS solution appends a style element on its own to the DOM. In addition, it addresses another design goal of avoiding complex runtime dependencies. In my experience the more complex dependencies are, the more likely it is that something goes wrong because of versioning conflicts.

I decided to use Yarn2 for this whole project when possible. The main reason was its Plug’n’Play (PnP) feature. While the support for this technique by various tools is still in an early stage, it is a promising idea, addressing a serious problem with regular node_modules: the number of files. Especially with bigger monolithic repositories the number of modules are almost countless. With PnP the compressed NPM package ZIP files aren’t extracted but directly served and the central .pnp.js file is used to manage package resolutions. Last but not least, yarn install doesn’t take ages anymore and fills up your hard drive with countless small files that are are burden to my file system.

Add-ons

Another design decision was to extract the rendering of box plots as aggregations of numerical attributes to its own repository and integrate it into the UpSet.js plot using an add-on mechanism. The reasons were twofold: On the one hand, I think that the most people will use the UpSet plot without any numerical attributes thus integrating it as part of the core library would increase it complexity without a big benefit. On the other hand, the add-on idea allows users to render not only one attribute but an arbitrary number. Moreover, users can develop their own add-ons, for example a histogram variant for showing aggregation of categorical data.

Styling

As mentioned in the previous section, a single style tag is used within the SVG element to customize the style of the UpSet plot. In an early version of the library, I just used inline styles style=" " for styling. However, one cannot implement simple hover effects like .test:hover or ::before and ::after rules without implementing custom logic. Thus, I switched to a style tag, generating class names and rules automatically. In a first version, I was using static class names that are independent of the data provided to the React component. I assumed that the styles are automatically scoped within the SVG element. Well, I was wrong. As soon, as I had multiple UpSet plots in an application (in my example a Jupyter Notebook), changing the style of one changed the styles of all. For example, switching the theme from light to dark changed the output of all UpSet plots in the view.

Thus, I needed a way to generate “unique” class names. A common way is to create a global side effect for example by introducing a global counter. However, UpSet.js is designed to be stateless and side effect free. A closer look on the issue revealed that class names have to be unique for the style they apply. So, only when two classes apply a different style they need to have a different class name across UpSet component instances or updates.

My solution was to combine the React.useMemo hook with a random number generator. The useMemo hook is usually used to memorize expensive computations and only recompute them when one of its dependencies has changed. In my use case, I generate a new class name suffix each time a style property affecting this class has changed. Thus, two UpSet updates with the same style attributes result in the same class names avoiding unnecessary updates while still ensuring its stateless character. This heuristic is not optimal in terms of DOM updates but good enough such that most for example, selection updates, don’t trigger a change in class names.

Project and Repository Structure

Model and (React) View

In order to better reuse certain model classes, I separated the React View from the underlying model classes. However, both are managed in a single monolithic repository at https://github.com/upsetjs/upsetjs to avoid fragmentation of the two projects. The model classes are released as @upsetjs/model on NPM as a separate package.

React and other frameworks

Integrating a React component into other applications and systems can be difficult. Like one has to setup React and ReactDOM first before using the library. Moreover, when trying to use it within another web framework such as Vue.js, one has to create a binding between two different virtual DOMs and worlds. Thus, I created a bundled version of UpSet.js released as @upsetjs/bundle. The bundled version has no runtime dependencies and uses Preact under the hood to render the UpSet React component. Preact is like the small brother of React with a compatible API but just 3kb in size.

So, just one JavaScript file that you have to embed to start playing. The package.json file references the @upsetjs/model library. However, this dependency is just needed during compile time for providing proper TypeScript typings and thus a better developer experience. An important aspects regarding the typings of the bundle is that there are no references to React or Preact. It not just hides the internals of the rendering perfectly but also avoid that React or Preact typings populate the typings space when importing the library. Another lesson I had to learn, when suddenly two global “JSX.Elements” were available and not compatible to each other.

Repositories

The core of UpSet.js along with the UpSet.js app and its integrations into other web frameworks are located at the monolitic repository at https://github.com/upsetjs/upsetjs. In the end the following packages are managed by this repository:

@upsetjs/model
is the base package defining model classes and functions to compute set overlaps or set intersections
@upsetjs/react
is the main visual UpSet.js component implemented as a stateless React component using the hooks API
@upsetjs/math
contains some utility math functions for example for computing the statistics to render box plots
@upsetjs/addons
contains the boxplotAddon for rendering box plots as aggregations of numerical attributes in an UpSet.js plot
@upsetjs/bundle
is the main library when integrating UpSet.js in other frameworks. It has no runtime dependencies and uses Preact to render the @upsetjs/react component under the hood. Moreover, it bundles the @upsetjs/addons library, too.
@upsetjs/app
contains the UpSet.js App web application hosted on https://upset.js.org/app
@upsetjs/vue
contains a Vue.js component that uses the @upsetjs/bundle version to render UpSet within Vue.js
@upsetjs/vue-example
a sample application using the @upsetjs/vue library

Besides the monolithic upsetjs repository, each integration has its own repository on Github. The reasons is that different systems provide different template repositories and require a different layout and bundling logic. For example, a PowerBI Custom Visual has to be managed by NPM according to the Microsoft Guidelines. However, I keep the different repositories in sync by following the semantic versioning rules. Especially major and minor version updates are published simultaneously while patch releases can be independent to react on dependency updates or hot fixes.

CI/CD and Project Website

Github Actions

For this whole project I’m using Github actions to automate testing and building of different libraries. It is well integrated into Github and thus avoids using an external service like Travis or CircleCI. Github actions are easy to use and with various third-party actions very powerful.

The project website at https://upset.js.org is stored at https://github.com/upsetjs/upsetjs.github.io. Thanks to the great https://js.org/ project, I can register and maintain the domain upset.js.org free of charge.

The website itself is mostly generated and is automatically updated by the different repositories. For example, the README.md and all built files of the upsetjs repository are automatically committed to the website after a successful built. Another example are the RMarkdown files and the generated R package documentation that are hosted on https://upset.js.org/integrations/R.

In addition, I follow the convention of having a master and a develop Git branch. The master branch reflects the latest published version and its built artifacts are published on the website. The develop branch contains the not yet published changes and its built artifacts are published at https://upset.js.org/next.

Generated API Documentation and Storybook Stories

The different components and libraries are documented depending on their type and purpose. For @upsetjs/model an API documentation using TypeDoc is located at https://upset.js.org/api/model. For @upsetjs/react I use Storybook, its stories are located https://upset.js.org/api/react. The R package uses pkgdown to generate a package documentation at https://upset.js.org/api/r and the jupyter extension uses Sphinx at https://upset.js.org/api/jupyter. While readthedocs is a common choice for automatically generate Sphinx documentation, I decided to host it myself (or better let Github Pages host it) to have everything at one place and under my control.

UpSet.js App

The UpSet.js app is implemented in React using the Material UI library and MobX for state management. Since it is an isolated application, I used the Material UI embedded CSS-in-JS solution. A main focus was the import and export functionality.

The provided datasets are dynamically loaded from the original UpSet web application by proxing them through https://cors-anywhere.herokuapp.com/ to avoid CORS issues. However, the default dataset is a custom one about Game of Thrones characters. Tho, I’m currently reading the books and it contained some spoilers :-(. Moreover, users can upload their own data in CSV format. Uploaded Data are stored in the IndexedDB of the browser. IndexedDB is an interesting API which is easy to use especially with Dexie. Exporting the data was more interesting. In one of my other applications the LineUp.js App I implemented similar features of which I knew how to use the API of external services like CodePen, CodeSandbox, or JSFiddle. Challenges included how to encode the data and how to generate the JavaScript code that will generate the same plot as in the UpSet App itself.

Embedded Version

Another part of the application is generating an embedded version in which the whole dataset is encoded only in the URL. Generating a short URL was quite challenging. I use LZ-string to encode the JSON data dump that should be transferred. However, generating a short JSON data dump that is fully interactive required some tweaks. In order to be fully interactive the set overlap of all set (intersections) are needed, such that the proper fraction can be highlighted when hovering over a set. In the simplest form, one can derive the overlap by storing all individual data items for each set and compute it on the fly. However, for larger datasets with multiple thousand elements, the data overhead becomes bigger and bigger. Thus, for larger datasets, I used a different approach. Another possibility is to pre-compute all overlaps and provide the overlap information to UpSet.js. The number of visible set combinations is at most 100 such that they remain separable in the plot. Thus, the number of possible set overlaps is around 100 ^ 2. Since the overlap matrix is symmetric, it reduces to around 5,000 possible overlaps. By pre-computing them and storing the result in the data dump, the data dump becomes independent of the number of elements in the dataset while preserving full interactivity.

The embedded version of the app at https://upset.js.org/app/embed.html supports three data sources. First, the data is given as an encoded URL search string as described before. Second, the data is manually uploaded by the user by providing a file chooser field. Third, by listening to message events in case the site is used within an iframe. Moreover, as soon as the embedded version receives a data dump it manipulates its own DOM to persist the data in a hidden script tag. The rationale is that when storing the website as a HTML file through the browser, it is still self contained since the data are part of the saved DOM elements. The use case of storing the website is also the reason why all scripts and styles are directly embedded into the HTML file thus having no external references.

Vega Lite

One of the recent standard export formats in UpSet.js is a generated Vega-Lite specification. Vega-Lite is a grammar of interactive graphics. Ideas are similar to the grammar of graphics by specifying data, marks, and how the data should be encoded in the marks like in the x position. Generating the two bar charts was easy. The dot plot in the bottom right border required some creativity. The dot plot is showing a matrix with the dimensions: sets x set intersections. If the dot is filled, it means that the corresponding set is part of the set intersection. Since Vega is just rendering marks that have a corresponding data point, I needed a full matrix to render. However, I didn’t wanna store a data structure in the form: {combination: 'S1,S2', set: 'S1', value: 1} since it would take too much space. Thus, I played with the transform operations. In the best case there is an operation similar to an SQL outer join but I wasn’t that lucky. (if you know how to do it, please text me) However, there is the flatten operation, which converts a structure like {combination: 'S1,S2', sets: ['S1', 'S2'], value: 1} to [{combination: 'S1,S2', set: 'S1', value: 1}, {combination: 'S1,S2', set: 'S2', value: 1}]. Great, half way done, I could generate the marks for the filled dots. Creating the marks for the not-filled dots, required another trick and the following data structure in the end:

[ 
{ combination: 'S1,S2', sets: ['S1', 'S2'], value: 1, nsets: [''] },
{ combination: 'S1,S2', sets: [''], value: 0, nsets: ['S3', 'S4] }
]

So for each combination, I needed two rows. The first one for storing the positive sets, the other for storing the negative sets. Then I first applied a flatten transform on the sets attribute and then another one on the nsets attribute. In this case the result looks like

[ 
{ combination: 'S1,S2', set: 'S1', value: 1, nset: '' },
{ combination: 'S1,S2', set: 'S2', value: 1, nset: '' },
{ combination: 'S1,S2', set: '', value: 0, nset: 'S3' }, 
{ combination: 'S1,S2', set: '', value: 0, nset: 'S4' }
]

The last step was to create a computed property set_nset which was defined as set_nset = set + nset, thus defining the actual set name for the dots by concatenation of the two properties. This explains also the need for the empty [''] entries. Without it, either the first or the second flatten operation would have removed the line, since it would be like an empty array to flatten. The final plot was then a simple heatmap showing marks identified by combination and set_nset and colored according to the value property.

With this tricks, I needed to store N set elements for showing the vertical bar chart on the left, M combination elements for showing the horizontal bar chart and another M combination elements for generating the dot plot. Actually, there are only two datasets: N sets and 2*M combinations. The horizontal bar chart just filters out all elements which have value: 1 before rendering it. Neat, isn’t it.

Integrations

Developing the UpSet.js library was just one part of the Ecosystem. Another major design goal was to create integrations into major data science tools. In LineUp.js I developed similar integrations for R and Jupyter Notebooks which gave me a head start but also showed me improvements that I wanted to do better this time.

R/RShiny/RMarkdown

R is a popular language for statisticians. It is easy to use, however, each time I use it I have to remind myself that arrays start in R with 1 instead of 0. This was the first time I saw the benefit of having a stateless React component as the core since it avoids some (buggy) state synchronization between R and the library. So, I just have to ship the data to the integration and render the plot. One goal was supporting the same data format as UpSetR, namely list input, expression input, and data.frame input. Moreover, I used the builder pattern using the %>% operator. This avoids that I had to generate a function with numerous arguments and makes it easier separating between different data sources.

I created interactivity by using custom Shiny events that are sent back to the server. https://github.com/upsetjs/upsetjs_r/blob/master/r_package/shiny/events.R is a simple R Shiny application that uses the custom R events to synchronize the selection with other R Shiny widgets. Thus, linking and brushing can be easily implemented.

In addition, I created adapter to the Crosstalk library. Crosstalk is part of the HTMLWidget environment and the idea is to implemented linking and brushing directly on the client without the need of a server. However, I haven’t seen many practical examples using this concept.

Jupyter Notebooks

Jupyter Widgets are an extension mechanism to create interactive Jupyter Widgets. The library takes care of the management and foremost synchronization between the front-end JavaScript view and back-end Python model. Great idea in theory but tricky due to version conflicts. In order to make the Jupyter Widgets work, the JavaScript client library @jupyter-widgets/jupyterlab-manager, the server python library ipywidgets, and Jupyterlab has to be compatible to each other. The current version of Jupyterlab while writing this article was 2.1.1. After some testing a working configuration was Jupyterlab 2.1.1 with ipywidgets 8.0.0a0 and @jupyter-widgets/jupyterlab-manager 3.0.0-alpha.0. For Jupyterlab 1.2.x it is ipywidgets 7.5.1 and @jupyter-widgets/jupyterlab-manager 2.0.0. This took a a while to find out and also some nerves when the front-end keeps saying my UpSetJS view class couldn’t be instantiated cause of a version mismatch.

PowerBI

Creating an integration for PowerBI from Microsoft turned out to be quite easy. It it web based in the first place and uses TypeScript for declaring its API. https://powerbi.microsoft.com/en-us/developers/custom-visualization/ is a good starting point. The basic idea is to use their pbiviz CLI tool to create a new project, launch a development server (with a custom certificate to serve https), and package the project. The capabilities.json file is the central point in declaring what type of data and which options the visual supports. Thanks to JSON-Schema you even have proper auto completion and validation. The data can be transformed in various ways, for my purposes I used the simplest one and computed the sets by myself. The input format is like the Dataframe option in R or Jupyter. The elements slot requires one dimension that identifiers the rows. For example, the name of the Game of Thrones characters. The sets slot requires one or more dimensions or measures. Each dimension/measure represent one set in the final plot. Determining whether element E_i from the first slot belongs to the set S_j, is checking whether S_j[i] is trueish, like a number that is not zero, the boolean value true, or a text which starts with a “t”. Finally, the attributes slots requires zero or more numerical measures that are used for generating box plots.

Tableau Dashboard Extension

The Tableau extension was more difficult to achieve. One reason is that Tableau Desktop is a fat client using QT. Their extension API is limited to creating dashboard extension. A dashboard extension is a website that is communicating via iframe messages with the Tableau instance and is embedded using QTWebkit. A dashboard extension can just access data from any sheet within the same dashboard. The same holds for setting selected items. Tableau support debugging the dashboard extensions by using the Chrome remote debugging protocol. Still quite painful when using it in practice, since you have to make sure using a matching Chromium version. For example, using the latest Chrome didn’t work, due to a removed JavaScript function for defining custom elements. Thus, it was quite painful to test the extension. Therefore, I dumped all possible data from the tableau extension API to a JSON file and then used it to mock the API when developing the extension.

The data format is similar to the one for PowerBI. However, I had to implement the user interface by myself. Tableau is known for dragging features to the plot or slots to apply them. I tried to mimic this user interaction by allowing the user to drag the features to the elements/sets/attribute slots. I was implementing it in my mocked system using the standard Drag-n-Drop API. However, it didn’t worked within Tabelau, since it seems like that the window itself rejects all dragged objects. I filled an issue at https://github.com/tableau/extensions-api/issues/310 about and in the meanwhile I was using a custom implementation based on mouseDown, mouseMove, mouseUp, mouseEnter, and mouseLeave MouseEvents. A simple demo is at https://codepen.io/sgratzl/pen/MWaoOXz.

Conclusion

In this last part of the article series about UpSet.js I described the rationale behind the architecture of UpSet.js and its internal structure. Moreover, it described challenges and their solutions I encountered during the development of the library and its integrations. Developing UpSet.js was fun and I will use my experiences of this project for the next one.