Designing a mental model for Computer Vision applications

A case study on graph interfaces

At Cogniac we are building a platform to let people create custom computer vision models without needing to code. The workflow is simple:

  1. Create an application to classify images into groups (i.e. cats and dogs).
  2. Upload example images.
  3. Give feedback to train a deep neural network model.
  4. Detect cats from dogs in a matter of minutes.

It worked very well for a simple cat/dog classifier but real life applications are much more complex; one model can’t answer every problem. To solve this issue, applications can be chained from one to the other to create complex workflows.

Let’s say you want to classify dog breeds when given a set of cat and dog images. You can first filter out cats from dogs and then classify dog breeds. Using this method, the second classifier has less variability to work with so it’s faster to train. It also enables us to re-use the cat/dog classifier for other purposes.

Two applications linked together to classify animals (just cats and dogs) into cats and dog breeds

As the only Designer and Web Developer at Cogniac, I was asked to create a web interface to make chaining applications possible.

Context

Designing for internal teams

Cogniac is at an early stage. We are still trying to find the right application for our technology. When a client approaches us, we work with them to understand their problem and find a solution using our internal tools. The web-app is the primary interface through which Cogniac employees are building, training, and monitoring computer vision models. Our employees are the main users of the web-app.

The existing information architecture

Cogniac’s building blocks are “Applications” and “Channels”:

  • Applications are like functions: they take images in, process them and send images out. Applications can detect a region of interest, classify, relight, or crop images. Most applications use deep neural network models.
  • Channels are repositories for similar images: they move images from one application to another.

The existing web-app did a fine job at showing everything related to an application, but each application lived in its own silo. Once you re-use a channel in a different application, it is difficult to visualize how those applications are linked to one another. Here is the corresponding information architecture:

The navigation between applications breaks down quickly.

Cat/Dog Classifier and Dog Breed Classifier use the same “dog” channel, but navigating from one application to the other is not possible unless going back to the list of applications.

The 3 stages of a computer vision application

Experimentation and early client projects told us that computer vision applications go through three stages:

  1. Creation. A client contacts us with a problem. They think they can use computer vision to solve it. Most people — especially from old-fashioned companies — are having a hard time translating their goal to actionable steps. We help them find the right computer vision application and set it up for them.
  2. Feedback. After setup, users need to provide example images. The more training images (also known as labeled images) the better the application will be. This is the most time consuming stage. Some applications require 1000's of training images to reach a decent performance.
  3. Monitoring. Once the application has reached a reasonable performance, clients will use the Cogniac API to upload images and receive real-time predictions. The team will monitor the application performance and give additional feedback if needed.

An engineer approach to graphs

It took a few iterations to arrive at the final design.

From an engineering perspective, applications and channels are nothing more than nodes on a graph. Representing it as a graph is as accurate and complete as it gets. I came up with a graph that strangely resembles a UML diagram or the Blender’s node editor. Each application is its own bubble with an icon to represent the type. Channels, represented as squares, connect applications together.

I was inspired by drawing and diagramming tools. I got rid of the nested pagination and exposed the objects in the center of the page. The large fullscreen canvas is the main focus. Visualizing a complex linkage of applications and channels feels empowering.

As the user interacts with the graph, contextual panels slide up. It brings additional information while being unobtrusive.

That was easy right? Let’s implement it! Hold on, we missed a few important things. What happens when there are 100 applications displayed at the same time? How do we teach people to use it? How do we guide users to accomplish their goal?

The problem with graphs

Engineers love graphs. If you want to impress someone, just tell them you created a graph visualization composed of 10k nodes that uses a physics simulation engine!

Now that’s cool! Is it useful? I am not certain. Source

The truth is that graphs are not used that often. Only niche applications use graphs: mind mapping, electrical circuits, or IT network applications, to name a few. They mostly look like digital drawing boards. Those tools are dedicated to experts who need the freedom to experiment.

Diagramming programs have been around for ages

Our world is full of graph-based systems, yet user interfaces don’t expose them. You navigate the web, one link at a time. Your LinkedIn network is just a list of contacts. Why is that?

Graphs offer too much flexibility and not enough guidance.

Graphs suffer from the following flaws:

Hard to learn. New users don’t yet have the right mental model to understand how pieces fit together. It doesn’t provide any context. They are designed for experienced users who are already familiar with the system.

Poor error prevention. Users will experiment and a good interface should prevent users from making mistakes. Unfortunately, graphs are able to represent any kind of linkage, even the most unrealistic ones.

Loops don’t make any sense whatsoever, yet the graph representation allows them

Doesn’t scale. Above 100 nodes, it get very messy. A graph only composed of 10 nodes will have 45 edges if all 10 nodes are connected to each other. To reduce confusion, we want to minimize edge crossing, group related nodes together, and keep the size of the graph as small as possible. There is some research on “elegant graph layouts”, but there is no ideal solution.

No clear intent. Graphs don’t tell users what to do; they have to figure it out on their own. It can be empowering as well as discouraging. Graphs are not well suited for task driven applications.

Poor information hierarchy. It’s the perfect tool for data exploration, but it makes it hard to find what you are looking for. Imagine trying to look for your favorite show on Netflix if it was a graph, you would probably get lost quickly.


How do we improve graphs?

Making graphs more usable means removing some flexibility by adding constraints. Constraints reduce the number of decisions users have to take. They lower the learning curve by telling users how objects relate to each other and how one uses an object. Constraints are defined by the problem you are trying to solve.

A graph is not a bad choice in itself, rather, it’s the use that the designer makes of it that’s inappropriate

There are many types of graphs with more or less freedom: directed graphs, trees, complete graphs, etc. The key is to find the right representation.

Coming back to our example, we can reduce the problem space by taking into account the following insights:

  • Direction matters. Channels carry images from one application to another.
  • No loops. Images can’t pass through an application more than once, otherwise it would create an infinite loop!
  • Branching outward. Most applications have one input and up to ten outputs. But most of the time users try to classify images from a source (e.g. animals) to 2–3 categories (e.g. cats/dogs)
  • Few nodes and edges. Users only have 5–50 applications and 10–100 channels.
  • Same starting node. Most applications use the same input channel (e.g. the same image folder).
  • Flat structure. Most applications are not chained from one to the other. Just a few applications re-use the output channel of another application.

What we just described is a directed tree graph. We can represent it in layers with each child application depending on the parent application.

It’s an improvement but we can do better. So far we have talked about the objects themselves and their relationships but we missed the bigger picture. Why are we building this? What is the user goal?

Cogniac is a platform that allows people to solve real world problems using computer vision. To “solve a problem” is key here. Users have a goal they are trying to accomplish. They are using computer vision tools to assist them. They also know their environment so it’s our job to bridge the gap between their current state and the end goal. Here are a few examples:

  • An insurance company wants to assess the damage to a roof after a storm. Input: roof images, output: number of broken tiles.
  • A car manufacturer wants to detect defects on a critical engine part. Input: images of parts, output: pass/fail signal to keep or eject the part.
  • An airport wants to increase safety. Input: security camera footage, output: alert when an abandoned bag is detected.

Everything revolves around the output. In our case, the output is the last application in the chain. On the other hand, the input is the channel clients upload their images to. The only thing that really matters is what’s directly impacting the output (i.e. the upstream applications). This means we can reduce the tree to a single branch:

Only applications 1 and 4 impact the outputs H and J

If we do the same for each final application, we can transform the original tree into 4 linear pipelines:

Four goals means four pipelines. We have reduced the overly complex graph to just a few linear pipelines.

Encoding

So far we have talked about the general structure of the graph/pipeline. The representation of the nodes and edges is just as important, if not more. I iterated on the design a few times until I found a good solution:

The last iteration looks nothing like a traditional graph. It is much more constrained, but it is also more meaningful.


I lied, the pipeline is not always linear. A goal can depend on multiple applications at once. Take the following example:

To make the “Bird filter” work better we just want to give it animal or pet images

We want to categorize birds from two existing channels, Pets and Animals. However, pet and animal images are generated by two existing applications (Pet filter and Animal filter). It doesn’t really look like a single linear pipeline anymore.

Nevertheless, because it is such a rare case, we decided to represent it as a pipeline. Applications and channels on the same level are aggregated.

It is a bit harder to understand which application is linked to which channel. However, it displays all applications and channels impacting the final goal in a reduced amount of space.

Putting everything together

The new pipeline is a central element of the redesign. It serves two primary functions: displaying application metrics and navigating between applications and channels of the same pipeline.

Learnings

After I finished creating the mocks, I implemented it in Ember.js and deployed the new pipeline view a few weeks later. The team gave me very positive feedback. The pipeline overview made model building much easier.

I also learned a lot during this process. Here are a few points I would like to share:

  • Think about edge cases but don’t get caught up in the details. It’s easy to give too much emphasis on edge cases and end up with a cluttered design that puts everything on the same level. It often happens when you are thinking too much about the technology. Instead, prioritize more important and frequently used features.
  • Back up every design decision. I like to see interfaces as physical products. Cheap and poorly designed products wear off as people use them. Thoughtful design decisions are a measure of the quality of an interface. Bad design will be bulky and full of last-minute patches.
  • You can always do better. It’s easy to get attached to a design concept and find excuses to not change it. “Look, it fits the user mental model perfectly!” “ — Yes, but this whole part is missing”. Throw everything away and start from scratch, I find it liberating sometimes!

Thanks for reading! Have you dealt with graphs? Let me know in the comments what challenges you faced.