Designing for Machine Learning — Part 2

Cluster Analysis in Kira — A WIP Case Study

In Part 1 I explained clustering and how it helps Kira’s app users get their jobs done; check it out, it’s a good primer! In Part 2 I’m going to cover some exploration and ideation my colleagues and I have done around clustering.

Layering clusters

The designs in Part 1 only consider what would happen if a user wanted information on one field (e.g., clusters in Change of Control). What if we want to cluster all the fields for more insight? Or cluster all documents? Or see all the field clusters in one document type?

Example of constructing widget in Google Analytics, and some of the config options available.

If we allow users to choose what data they want to view, we can create charts to answer the questions above. I am a huge Google Analytics fan, from my former life in marketing and communications. Their dashboard creation is an inspiration; you can create custom widgets by selecting a chart type and then the dimensions and metrics to be measured.

The following is an exploration of some configuration options for users to construct their own cluster charts. Essentially what I am designing for is the possibility for the user to layer clusters upon clusters.

In Step 1, a user selects a base dimension (in this case, Document Type). In Step 2, the user selects the type of document (Employment Agreements) and can then choose to layer a second dimension (field) on this cluster.
With these choices, a user can see clusters of all the business agreements in this project, and then dig deeper to find the Change of Control clusters in those agreements. The system suggests a bar graph, but the user could also select to view this information in any of the other chart types (e.g. Bar & Text).
These bar graphs would be generated from the choices above. Left, “Cluster this dimension” is selected for both document type and field; Middle, the fields are not clustered. On the left, the base dimension (Document Type) clustering is disabled, as it would confusingly yield the same visualization as if Fields were selected as the base dimension.

Exploring all the chart types

For users to be able to create their own charts, we need to have a ready suite of chart types available, aside from the traditional bar graph above. Ideally, what we would do is recommend a chart type or only make certain types available based on what they have chosen as their base dimension. 

Because of our familiarity with Javascript, myself and the Lead Designer looked to D3.js, a Javascript library for rendering charts on browsers, for inspiration. Some of it’s benefits:

  • It is agile, fast and takes care of all the DOM manipulation and transitions. 
  • We have already used it elsewhere on our app (which is written in Clojure), so we knew it could integrate well. 
  • D3 has an extensive library of examples that we could look through to see what fit well with our data. 

The following will outline three charts/graphs and how we think they work well with our data.

Treemaps

Treemaps are a great way to show distinct groups of data as a representation of a whole, while also giving insight into the sub-groups of that data.

The Cancer Genome treemap shows extra information when the map blocks are selected, and overlays a second layer of info on the blocks when you select a vairable (e.g. race or gender). The Zoomable treemap allows you to go deeper into each subset of colour-coded blocks, and zoom back out when needed.
This mockup shows all the clusters in all the fields in a Kira project; mousing over a block would give you information about each cluster that is grouped into that block (e.g., Cluster A, Outliers). Clicking on any block would take you to a list of all the variants in that cluster (not shown).

Pros of treemaps:

This would be an ideal way to show clustering data, especially if we added a second layer of clustering to this chart, for example, a view that showed all the clusters of a Change of Control field within each of the blocks above.

Cons of treemaps: 

It is not always easy to spot which block is the largest, especially if they are arranged in creative ways to preserve space.

Heatmaps

Heatmaps show you the intensity of an event when two datasets cross.

The Day/Hour heatmap (left) is a static heatmap, more for visual inspiration. Ian YF Chang’s heatmap (right) can sort items in the horizontal AND vertical columns, with different data sets loaded into the dropdown.
The proposed Kira App heatmap shows, in this example, documents on the x-axis, intersected with fields on the y-axis, and the ability to sort on both axes. This design shows the data sorted by the field in the first column.

Pros of heatmaps:

It allows you to quickly compare data across an entire project and identify trends. For example, you can quickly spot if one document has many outliers, or one field does not have extractions across many documents.

Cons of heatmaps:

There is no easy way to get all the information on one dataset . For example, if you wanted to see all the outliers in this entire project grouped together, it’s not possible on a heatmap because of the limits of the table layout.

Sunburst graphs

These are our absolute favourite, because they are both interactively fun, and allow for nesting clusters easily in more than one dimension.

The Sequences sunburst shows the progression of a single path as a slice of it’s previous section, while the Zoomable sunburst (right) allows you to see each subsequent section of the graph in it’s own isolated view.

Our graph is a combination of the two — zooming in when a user selects a section, while also retaining the view of the whole slice, and a lighter version of the entire graph in the background for context.

Pros of sunburst graphs:

You can segment information while showing it as part of the greater dataset, and infinitely nest clusters within clusters! They also have a lot of interactive possibilities.

Cons of sunburst graphs:

Similarly to treemaps, it is difficult to instantly see how large of a slice something is as part of the whole. Bar and line graphs are really the only way to visually represent comparable information, so perhaps it is best to look at these charts as visually bringing other insight to our data.

Since this is a work in progress, I will continue to update this article as I explore other graphs. In the meantime, stay tuned for Part 3, where I will discuss prototyping and user testing!