Understanding User Interfaces with Screen Parsing

Jason Wu
ACM UIST
Published in
8 min readOct 22, 2021

This blog post summarizes our paper Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots, which was published in the proceedings of UIST 2021.

The Benefits of Machines that Understand User Interfaces

A screenshot of the Apple Notes home screen is fed into our Screen Parsing system which predicts its hierarchy. This result is visualized, revealing, for example, that several items exist together in a list.
Screen Parsing aims to uncover the hidden structures of UIs from their visual appearance. By modeling element relationships, we can answer questions about what and how information is being presented.

Machines that understand and operate user interfaces (UIs) on behalf of users could offer many benefits. For example, a screen reader (e.g., VoiceOver and TalkBack) could facilitate access to UIs for blind and visually impaired users, and task automation agents (e.g., Siri Shortcuts and IFTTT) could allow users to automate repetitive or complex tasks with their devices more efficiently. These benefits are gated on how well these systems can understand and interact with the underlying applications. Many rely on the availability of UI metadata (e.g., provided by accessibility services) to function and fail when this information is unavailable. To maximize their support of apps and when they are helpful to users, these systems can benefit from understanding UIs solely from visual appearance.

UIs are, unsurprisingly, designed for consumption by human beings, and it can be difficult for machines to understand what functionality is present in a UI, how its different components work together, and how it can be operated to accomplish some goal. Recent efforts have focused on predicting the presence of an app’s on-screen elements and semantic regions solely from its visual appearance. These have enabled many useful applications: such as allowing assistive technology to work with inaccessible apps and example-based search for UI designers. However, they constitute only a surface-level understanding of UIs, as they primarily focus on extracting what elements are on a screen and where they appear spatially. To further advance the UI understanding capabilities of machines and perform more valuable tasks, we focus on modeling the higher-level relationships by predicting UI structure.

Achieving Better Understanding of UIs through Hierarchy

Example of an input screenshot and corresponding screen parse. The screen parse is a tree that connects all visible elements on the screen with edges describing their relationship.
An example of an input screen (Left) and the corresponding UI Hierarchy (Right). The tree contains all of the visible elements on the screen (the output is complete), groups them together to form higher-level structures (abstractive), and nodes can be used to reference UI elements (the output is grounded)

Structural representations enhance the understanding of many types of content by capturing higher-level semantics. For example, scene graphs enrich visual scenes by making sense of interactions between individual objects and parse trees disambiguate sentences by analyzing their grammar. Similarly, structure is a core property of UIs reflected in how they are constructed (i.e., stacking together views and widgets) and used (i.e., how users perceive and interact with UIs). Modeling element relationships can help machines perceive UIs as humans do — not as a set of elements but as a coordinated presentation of content.

We introduce the problem of screen parsing, which we use to predict structured UI models (a high-level definition of a UI) from visual information. We focus on generating an app screen’s UI hierarchy (i.e., presentation model), which specifies how UI elements are grouped and rendered on the screen. The following are properties of UI hierarchies.

  • Complete — the output is a single directed tree that spans all of the UI elements on a screen
  • Grounded — nodes in the output reference on-screen elements and regions
  • Abstractive — the output can group elements (potentially more than once) to form higher-level structures.

Predicting UI Hierarchy from a Screenshot

Diagram of our system’s three steps to screen parsing. First, UI Element Detection detects on-screen elements using object detection. Next the UI Hierarchy is predicted which results in a tree-like structure. Finally, container nodes in the tree are labeled as groups.
An overview of our implementation of screen parsing. To infer the structure of an app screen, our system (i) detects the location and type of UI elements from a screenshot, (ii) predicts a graph structure that describes the relationships between UI elements, and (iii) classifies groups of UI elements.

To predict UI hierarchy from a screenshot, we built a system to

  1. detect the location and type of UI elements from a screenshot,
  2. predict a hierarchical structure that describes the relationships between them, and
  3. classify semantic groups.

The first step of our system processes a screenshot image using an object detection model (Faster-RCNN), which produces a list of UI element detections. The output is post-processed using standard techniques such as confidence-thresholding and non-max suppression. This list tells us what elements are on the screen and where they are but does not provide any information about their relationship.

Next, we use a stack-pointer parsing model to generate a tree structure representing UI hierarchy. Like other transition-based parsers, our model incrementally predicts a tree structure by generating a sequence of actions that build connections between UI elements using a pointer mechanism. We made two modifications to adapt the parsing model for UI hierarchies. First, we injected a “container” token into the input, allowing the model to create multi-level groupings. Second, we trained the model using a dynamic oracle to reduce exposure bias since the multi-level nature of UI hierarchies leads to exponentially more “optimal” action sequences that produce the same output.

Finally, we apply a set classification model to label containers (i.e., intermediate nodes) based on their descendants. We defined seven container types (including an “Other” class) that represent common groupings such as collections (e.g., lists, grids), tables, and tab bars.

The Apple Notes app is fed through Screen Parser. UI elements are iteratively inserted into a tree structure by our model, then intermediate nodes in the tree are assigned labels such as “Collection” and “Button.” We show a gallery of 4 screenshots and their UI hierarchies.
Screen Parser uses a multi-step process to infer the UI hierarchy from a screenshot. Element detections are iteratively grouped together using a parsing model that produces a sequence of special actions called transitions (transition-based parsing).

We trained our models on two mobile UI datasets: (i) AMP dataset of ~130,000 iOS screens, and (ii) RICO, a publicly available dataset of ~80,000 Android screens. Both datasets were collected by crowdworkers who installed and explored popular apps across 20+ categories (in some cases excluding certain ones such as games, AR, and multimedia) on the iOS and Android app stores. Each dataset contains screenshots, annotated screens, and a type of metadata called a view hierarchy. The view hierarchy is an artifact generated during UI rendering that describes which interface widgets are used and “stacked” together to produce the final layout. Not all screens in our dataset contain this metadata (e.g., apps created using third-party UI toolkits or game engines). We apply heuristics to detect and exclude examples with missing or incomplete view hierarchies. The view hierarchies are similar to the presentation model we aim to predict, with a few differences, so we transform them into our target representation by applying graph smoothing, filtering, and element matching between different data sources.

More details about our machine learning models and training procedures can be found in our paper.

Experiments

We used several metrics (e.g., F1 score, graph edit distance) to perform a quantitative evaluation of our system using the test split of our mobile UI datasets. Our main point of comparison was a heuristic-based approach to inferring screen groupings used in previous work, and we found that our system was much more accurate in inferring UI hierarchy. We also found that our final training procedure (i.e., using a dynamic oracle) led to significant performance gains (23%) over standard methods for training parsers.

Bar chart showing the performance of each system using F1 score and Edit Distance. Screen Parser Dynamic consistently outperforms all baseline systems.

Our system’s performance is affected by a number of factors such as screen complexity and object detection errors. Accuracy is highest for screens up to 32 elements and degrades following that point, in part due to the increased number of actions the parsing model must correctly predict. Complex and crowded screens introduce the additional difficulty of detecting small UI elements, which our analysis with a matching-based oracle (computes best possible matching between object detection output and ground truth) shows as a limiting factor.

Chart comparing the F1 score of screen parser and baselines against the screens of increasing complexity. Performance of all systems declines for more complex screens, with highest drop occurring after 32 elements. Complexity is divided into 5 buckets of 0 to 16 elements, 16 to 32 elements, 32 to 48 elements, 48 to 64 elements, and more than 64 elements.

UI Hierarchy Facilitates and Improves Applications

We present a suite of example applications implemented using our screen parsing system. These applications show the versatility of our approach and how the predicted UI hierarchy facilitates many downstream tasks.

UI Similarity

Scatter plot of screens represented as 2-D points corresponding to similarity in embedding space. We show four pairs of screenshots where each pair is similar structurally, but has surface-level differences such as scaling, language, theme, and dynamic content.
We used our system to generate embedding vectors for different UI screens that capture their structure, instead of their surface-level appearance. We show that embeddings for the same screen are minimally affected by different display settings (e.g., scaling, language, theme, dynamic content).

Recent efforts in modeling UIs have focused on representing them as fixed-length embedding vectors that encode different properties such as layout, content, and style. Some downstream tasks, such as app crawling and information extraction, rely on characterizing screens by semantic structure rather than aesthetic appearance. The intermediate representation of our parsing model can be used to produce a screen embedding, which describes the hierarchical structure of an app. Our structural embedding can help minimize variations from display settings such as (i) scaling, (ii) language, (iii) theme, and (iv) small dynamic changes.

Accessibility Improvement

An app screen processed with heuristics and our screen parsing model. We show that our model leads to fewer grouping errors, which is beneficial to screen reader navigation experience.
Element boxes are annotated using their navigation ordering, where the number represents how many swipes are needed to access the element when using a screen reader. While both results contain errors, in this case, Screen Parser correctly groups more elements, which decreases the number of swipes needed to access elements.

Recent work has successfully generated missing metadata for inaccessible apps by running an object detection model on the UI screenshot. Their approach to generating hierarchical data relies on manually defined heuristics that detect localized patterns between elements (e.g., a heuristic for finding image captions could search for text elements located near images). However, these approaches may sometimes fail because they do not have access to global information necessary for resolving ambiguities. We use the predicted UI hierarchy to compute the element groupings and navigation order for screen readers.

Code Generation

A UI screenshot re-rendered on a tablet form factor using the code generated by our system. Some errors are visible in the generated output, such as incorrect element ordering.
By mapping nodes in the UI hierarchy to declarative view-creation methods, we can generate code for a UI from its screenshot. Here, a restaurant app is re-rendered on a tablet form-factor.

Existing approaches to code generation also rely on heuristics to detect a limited subset of container types. We employed a technique used by compilers to generate code from abstract syntax trees (AST) (the visitor pattern for code generation) and applied it to the predicted UI hierarchy. Specifically, we performed a depth-first traversal of the UI hierarchy using a visitor function that generates code based on the current state (current node and stack). The visitor function emits a SwiftUI control (e.g., Text, Toggle, Button) at every leaf node and emits a SwiftUI container (e.g., VStack, HStack) at every intermediate node. Additional parameters required by view constructors (e.g., label text, background color) were extracted using OCR and a small set of heuristics.

Acknowledgments

Many people contributed to this work and gave feedback on this blog post: Xiaoyi Zhang, Jeff Nichols, and Jeff Bigham. This work was done while Jason Wu was an intern at Apple.

For more information about machine learning research at Apple, check out the Apple Machine Learning website.

Paper Citation

Jason Wu, Xiaoyi Zhang, Jeffrey Nichols, and Jeffrey P. Bigham. 2021. Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots. In Proceedings of the 2021 ACM Symposium on User Interface Software & Technology (UIST). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10.1145/3472749.3474763

--

--