Thinking Like the Machine: Using Metrics to Generate and Quantify UX Design

Published in

Salesforce Designer

7 min readJul 24, 2020

In the popular imagination, there’s still a wide gulf between machine-generated content and human creativity. But that gulf is narrowing as architects, product designers, filmmakers, and a wide range of other creative professionals use generative design to create and iterate on design possibilities.

UX design is no exception. Over the last year, Salesforce’s Einstein Designer team has been working to create models and algorithms that generate design variations, which human designers can adapt or riff on, speeding the design process and helping designers stretch their creative muscles. Our vision is to create multiple machine-learning models that can help people and companies create better designs, faster. Even professional designers can benefit from such tools, which accelerate the design process and free up designers to focus on user needs and higher-level problem solving, improving the user experience.

Impressive goals. But first we needed to build the models — and to do that, we had to think like the machine.

The Challenge of Quantifying Design

When taking on a machine learning problem, one of the first questions to ask is, “What data do we have?” And the kind of data we needed — codified design rules presented in ways the machine can understand — wasn’t exactly hanging out in a large, publicly available database, ready to hop into our models. We had to create it ourselves. We thought more deeply about how to generate data about design that we could use to train models, and soon realized that what we needed were quantitative ways to measure design.

OK, measuring a creative process that blends the intuitive and the practical. No problem, right?

One natural place to look, we thought, would be the design curriculum, where foundational principles such as Gestalt theory could help us measure aspects of design. In practice, however, we found these principles to be too fine-grained, and too tied to human perception. It’s easy for a human to identify a figure on a ground, for example, but it turns out to be ridiculously hard to create a computational model of that phenomenon. Just look at the decades of work in computer vision it’s taken to get computers to identify objects in photos with some level of accuracy.

What we needed instead, we realized, were metrics that are measurable in the design itself.

A Search for Metrics

So who’s successfully measuring aspects of UX design? That question led us to the accessibility community, which has developed a robust set of metrics for evaluating design standards. For example, color contrast is a well-defined metric that measures and assesses the readability of any given piece of text. But while accessibility metrics are fantastic starting points, and must be integrated into any generative design system, they aren’t sufficient to fully measure a design.

Design is fundamentally a method for organizing information for communication. So to quantify a design, we need metrics that capture how its information is organized. With this idea as our guide, the Einstein Design team identified three major aspects of design that can be measured and used to express that design’s intent: grouping, prominence, and coding.

Grouping is the practice of sorting aspects of a design into groups based on their relationships. For example, we might group a title and body text, or an image and its caption. Together these two groups combine to form a larger grouping — an article tile.

*Grouped article tiles, each with nested groups of design elements*

Prominence describes perceptual relationships between elements and identifies a hierarchy of perception, in which the controlling element is perceived first. Prominences can be represented in a tree-like information hierarchy. For example, a title should ideally be seen before its supporting paragraph, which in turn is seen before the associated image caption.

*Tile elements, numbered in order of prominence*

Coding is the use of symbolic references to associate an element with a class of elements. Coding creates meaning without using words. For example, a blue rectangle with rounded corners and white text tells users that a design element is a button. In early UX design, skeuomorphic elements taught users what they meant using the design language of the analog world.

*Representative button designs of the last 20 years*

Measuring Metrics

Ok, now that we’ve identified these three important metrics, how do we measure them?

Grouping

With grouping, our measurement work was partially done — HTML is fundamentally a tree structure that describes the grouping of elements. The problem here: Data collected from the web is noisy. There are hundreds of ways to build a website, and web design techniques are becoming ever more complex and diverse. And while it would be awesome if every web page featured perfect semantic markup, the reality is more sobering. So on top of the HTML information for each web page we analyzed, we analyzed pixel distance, element alignment, and a handful of other heuristics. With this data set, we built an initial grouping metric that we used to bootstrap our data collection process.

When measuring any design metric, you also need a way to quantify and analyze your accuracy, so you know how to improve your metric. We validated the direction by having human testers identify related text elements. The team behind Bricolage used a similar method, asking multiple humans to manually label data, then comparing agreement between individuals to create a baseline accuracy for the metric. The benchmark for the automated solution is set by human testers who agreed on ~70% of a design’s labels. In essence, this is a way to separate objective and subjective aspects of design.

*Matrix and arc graph representing design element grouping. Visualizations by Moritz Stefaner.*

Prominence

Like grouping, prominence is defined by a combination of subjective and objective qualities. We started by evaluating text prominence, developing a scoring method designed to sort text styles from most to least prominent. Using font size, font weight, and text/background contrast, we created a useful approximation; while individual text comparisons may be debatable, most people would find the text at the top of the list more prominent than that at the bottom. Most important, a rudimentary metric like this one can be used to bootstrap more sophisticated models. Here again, we created tools that let human testers compare the prominence of two pieces of text, and used the results to generate a baseline accuracy metric.

Sorted font list. Visualizations *by Moritz Stefaner.*

This process revealed one immediate limitation: Prominence is highly contextual. Placed side by side, a bold font is more visible than a thin font. But a thin font that is large enough, with enough space around it, can be more eye-catching than the bold text. Color is another factor. Generally red is more eye-catching than other colors, but on a site where red is used frequently, a complementary green will stand out. We can use computer vision models to estimate visual salience and account for many such contextual effects.

*Heat map of an advertising photo measuring visual prominence*

Coding

With coding, the play of subjectivity and objectivity is driven by viewer experience. For example, with the transition from skeuomorphic to flat design, users have become accustomed to a new UI language. Coding is highly dependent on the design zeitgeist: For someone to think that something looks like a button, they first need to have experienced similar buttons. This raises the possibility that, as in fast fashion, visual designers could plot the evolution of button design over the decades, then project forward to predict the next big style.

In one of our first attempts at measuring coding, we took screenshots of all Salesforce Lightning Design System (SLDS) components, clustered them based on visual appearance, and used our knowledge of what each one is and does to generate a rich labeled dataset. We can use this to create models that generate designs for new components, that look like other components with similar functionality. For example, when generating designs with buttons, a model could ensure that the new buttons appear similar to those in the SLDS system, without explicitly limiting the generator to their templates.

*Visual map of component clusters for use in coding. Visualization by Moritz Stefaner.*

A foundation for quantifying design success

These three metrics — grouping, prominence, and coding — lay a foundation for quantifying design, and eventually for building machines that know how to design for humans. The resulting models crystalize the objective aspects of design while leaving room for designers’, and users’ subjective perception. Building on concepts at the root of accessibility, they offer measurable metrics and tools that people and companies can use to create humanist, human-friendly designs that communicate their function clearly and consistently.