Exploring and Mining the Designs of Webpages, Apps, and Code

One beautiful aspect of both engineering and design is that there are multiple ways to solve the same problem. The problem could be getting cars across a river or presenting information to a user. There are many means to an end, and the optimality of each solution may be context-dependent. A good engineer or designer can think flexibly about what each situation calls for.

Sub-optimality can have high personal, safety, and economic costs. For example, in the software industry, maintenance dominates the cost of producing software (Engineering SaaS). Poorly designed code may be the culprit because it requires significantly more maintenance.

There’s no silver bullet for becoming a great engineer. However, best practices will probably always include analyzing others’ designs, highlighting their successes and failures, as well as composing one’s own designs and receiving critiques from others.

This method of training is particularly exciting today, when there is such a diversity of design examples available online. Thousands of apps are available for download from app stores. Kaggle maching learning competition submissions are not public by default, but are sometimes released by authors and collected for the benefit of others. Github and Bitbucket host millions of repositories, many of them public and searchable.

If these solutions can be indexed in useful ways, they can be mined to ask basic questions like,

  1. In the face of a common design choice, what do people most commonly pick? Has this changed over time?
  2. What are popular design alternatives?
  3. What are examples of design fails that should be learned from and never repeated?
  4. What are examples of design innovations that are clearly head-and-shoulders above the rest?

The sections that follow highlight research efforts that have pushed the frontier of mining design forward. These papers have already been published at top-tier conferences and journals, like the ACM User Interface Software and Technology (UIST) conference and the ACM Transactions on Computer-Human Interaction journal (TOCHI).


“d.tour: Style-based Exploration of Design Example Galleries” [PDF]

At UIST’11, Ritchie, Kejriwal, and Klemmer described a user interface for finding “relevant and inspiring design examples” from a curated database of web pages. This work is intended to support designers who like to refer to or adapt previous designs for their own purposes. Traditional search engines only index the content of web pages; this system indexes web pages’ design style by automatically extracting global stylistic and structural features from each page. Instead of manual browsing, users can search and filter a gallery of design-indexed pages. Users can provide an example design in order to find similar and dissimilar designs, as well as high-level style terms like “minimal.”

“Webzeitgeist: Design Mining the Web” [PDF]

Two years after d.tour, Kumar et al. published a paper on design mining web pages at CHI ’13. Design mining is defined as “using knowledge discovery techniques to understand design demographics, automate design curation, and support data-driven design tools.” This work goes beyond searching and filtering a gallery of hundreds of curated webpages. Their Webzeitgeist design mining platform allows users to query a repository of hundreds of thousands of web pages based on the properties of their Document Object Model (DOM) tree and the look of the rendered page. A 1679-dimensional vector of descriptive features are computed for each DOM node in each page.

Webzeitgeist enables users to ask and answer some of those originally highlighted questions, with respect to this large web page repository:

  1. What are all the distinct cursors?
  2. What are the most popular colors for text?
  3. How many DOM tree nodes does a typical page have? How deep is a typical DOM tree?
  4. What is the distribution of aspect ratios for images?
  5. What are the spatial distributions for common HTML tags?
  6. How do web page designers use the HTML canvas element?

To dig into examples of a particular design choice, users can, for example, query for all pages with very wide images. The result is a set of horizontally scrolling pages. Alternatively, users can query for webpages that have a particular layout, like a large header, a navigational element at the top of the page, and a text node containing greater than some threshold of words, in order to see all the examples of pages that fit those layout specifications. Specific combinations of page features can imply high-level designs as well so, with careful query construction, users can query for high-level ideas. For example, querying for pages with a centered HTML input element AND low visual complexity retrieves many examples that look like the front pages of search engines.

(Android) Apps

Two recent papers, “Insights into Layout Patterns of Mobile User Interfaces by an Automatic Analysis of Android Apps” [ACM DL page] by Shirazi et al. and “Collect, Decompile, Extract, Stats, and Diff: Mining Design Pattern Changes in Android Apps” [ACM DL page] by Alharbi and Yeh describe automated processes for taking apart and analyzing Android app code as well as empirical analyses of corpuses of Android Apps available on the Google Play app store. Shirazi et al. analyzed the 400 most popular free Android applications, while Alharbi and Yeh tracked over 24,000 Android apps over a period of 18 months. Alharbi and Yeh caputured each update within their collection window, as well. They decompiled apps into code from which UI design and behavior could be inferred, e.g., XML and click handlers, and tracked changes across versions of the same app. Both papers analysed population-level characteristics of their corpuses, answering questions like:

  1. What is the distribution of layout design patterns, among the seven standard Android layout containers?
  2. What are the most common design patterns for navigation, e.g., tab layout and horizontal paging? Have any apps switched from one pattern to another?
  3. How quickly are newly introduced design patterns adopted?
  4. What are the most frequent interface elements? And combinations of interface elements? How many applications does that combination cover?

The authors each answer some of these questions, with respect to their corpuses, in their respective papers.

Mining Patterns in Programs

There’s a lot of code out there. Ideally, it is not just correct, it is simple, readable, and ready for the inevitable need for future changes (The Zen of Python, MIT’s largest software engineering course). How can we help students reach this level of programming composition zen? How can we learn from others’ code, even after we become competent, or even an expert, at the art of programming?

For the same reason we look at patterns in design across web pages and mobile apps, we can look at the design choices already made by humans who share their programs. Rather than using web crawlers or app stores, we can process millions of public repositories hosted online. What can we learn about good and bad code design decisions from these collections?

Regularity In Code

Several papers make similar observations, arguments, and emperical validation of the regularity that can be found in code. Hindle et al. (ICSE ’12) were motivated by the assertion that human-produced natural language and human-produced program language may both be “complex and admit a great wealth of expression, but what people write … is largely regular and predictable.” The authors argue that the assertion may be even more true for code than for natural language. Allamanis and Sutton (FSE ’14) observe that there are syntactic fragments, i.e., idioms, that serve a single semantic purpose and recur frequently across software projects. Fast et al. (CHI ’14) observe that poorly written code is often syntactically different from well written code, with the caveat that not all syntactically divergent code is bad.

Mining Idioms

Idiomatic code is written in a manner that experienced programmers perceive as “normal” or “natural.” Idioms are roughly equivalent to mental “chunks.” I will borrow a few examples from Allamanis and Sutton:

  • for(int i=0;i<n;i++){…} is a common idiom for looping in Java.
  • do-while and recursive looping strategies are not.

An experienced Java programmer will be able to understand the code whether it’s idiomatic or not, but it may take longer. They may even be distracted by questions, e.g., “Why did the author make this choice?”

Fast et al. (CHI ’14) break the definition of idioms into two levels. An example of a “high-level” idiom is code that initializes a nested hash. An example of a “low-level” idiom is code that returns the result of an addition operation. Some languages support a variety of different, equally good ways to do the same thing; others encourage a single, idiomatic way to achieve each task.

Idioms can and do recur throughout distinct projects and domains (unlike repeated code nearly verbatim, i.e., clones) and commonly involve syntactic sugar (unlike API patterns). In general, clone detectors look for the largest repeated code fragments and API mining algorithms look for frequently used sequences of API calls. Idiom mining is distinct because idioms have syntactic structure and are often wrapped around or interleaved with context-dependent blocks of code, like the block of statements within an the idiomatic for loop in the previous paragraph.

There are enough idioms for some languages that they have lengthy, highly “starred” and shared online guides, e.g., JavaScript Patterns. StackOverflow has many questions asked and answered about the appropriate language or library-specific idioms for particular, common tasks. It is difficult for expert users of each language or library to catalogue all the idioms. It is much more practical to simply look at how programmers are using the language or library and extract idioms from the data.

Hindle et al. (ICSE ’12) used statistical language models from natural language processing to identify idiom-like patterns in Java code. They found that corpus-based n-gram language models captured a high level of project- and domain-specific local regularity in programs. Local regularities are valuable for statistical machine translation of natural language; they may prove useful in analogous tasks for software as well. For example, the authors trained and tested a corpus-based n-gram model token suggestion engine that looks at the previous two tokens already entered into the text buffer and predicts the next one the programmer might type.

Allamanis and Sutton (FSE ’14) automatically mine idioms from a corpus of idiomatic code using nonparametric Bayesian tree substitution grammars. The mined idioms correspond to important programming concepts, e.g., object creation, exception handling, and resource management, and are, as expected, often library-specific. They found that 67% of the idioms mined from one set of open source projects were also found in code snippets posted on StackOverflow.

Fast et al. (CHI ’14) computed statistics about the abstract syntax trees (ASTs) of three million lines of popular open source code in the 100 most popular Ruby projects hosted on Github. AST nodes are normalized, and all identical normalized nodes are collapsed into a single database entry. The unparsed code snippets that correspond to each normalized node are saved. Codex normalizes these snippets by renaming variable identifiers, strings, symbols, and numbers to var0, var1, var2, str0, str1, etc. Note that this fails when primitives, like specific strings and numbers, are vital to interpreting the purpose of the statement.

Their resulting system, Codex, can warn programmers when they chain or compose functions, place a method call in a block, or pass an argument to a function that is infrequently seen in the corpus. It is fast enough to run in the background of an IDE, highlighting problem statements and annotating them with messages like, “We have seen the function split 30,000 times and strip 20,000 times, but we’ve never seen them chained together.” Codex can be queried for nodes by code complexity; type, i.e., function call; frequency of occurrence across files and projects; and containment of particular strings.

Mining Larger Patterns in Code

In the code of working applications, Ammons et al. (POPL ’02) observed that “common behavior is often correct behavior.” Based on that observation, they use probabilistic learning from program execution traces to infer the program’s formal correctness specifications. Inferring formal specifications for programs is valuable because programmers have historically been reluctant to write them. During program execution, the authors summarize frequent patterns as state machines that can be inspected by the programmer. As a result, the authors identified correct protocols and some previously unknown bugs.

Buse and Weimer (ICSE ’12) go beyond idioms to mining API useage. Based on a corpus of Java code, they find examples that reference a target class, symbolically execute it to compute intraprocedural path predicates while recording all subexpression values, identify expressions that correspond to one use of the class, capture the order of method calls in those concrete examples, then use K-mediods to cluster these extracted concrete use examples with a custom formal parameterized distance metric that penalizes for differences in method ordering and type information. Concrete use examples within the same cluster are merged into abstract uses represented as graphs with edge weights that correspond to counts of how many times node X happens before node Y. Finally, they have a synthesis method to express these abstract use graphs in a human-readable form, i.e., representative, well-formed, and well-typed Java code fragments.

Mining Names

Without modifying execution, names can express to the human reader the type and purpose of an object, as well as suggest the kinds of operators used to manipulate it (Jones ‘07). Perhaps as a direct result, variable names can exhibit some of the same regularity exhibited by code, in general. Høst and Østvold go so far as to call method names a restricted natural language they dubbed Programmer English.

Høst and Østvold ran an analysis pipeline on a corpus of Java that performs semantic analysis on methods and grammatical analysis on method names; it generates a data-driven phrasebook that Java programmers can use when naming methods. In their ECOOP ’09 paper the following year, they formally defined and then automatically identified method naming bugs in code, i.e., giving a method a name that incorrectly implies what the method takes as an argument or does with an argument.

They did this by identifying prevalent naming patterns, e.g., contains-*, which occur in over half of the applications in the corpus and match at least 100 method instances. They also determined and catalogued the attributes of each method body, such as whether it read or wrote fields, created new objects, or threw exceptions. If almost all the methods whose names match a particular pattern, e.g., contains-*, have an attribute or do not have some other attribute, it automatically determined to be an implementation rule that all names in the corpus should follow. On a large corpus of Java projects, this analysis pipeline found a variety of naming bugs.

Five years later, Fast et al.’s Codex produced similar results; by keeping track of variable names in variable assignment statements, it can warn programmers when their variable name violates statistically-established naming conventions, such as the (probably confusing) naming of a Hash object “array.”


There will probably always be top-down direction available about design: recommendations based on past experience or folk wisdom. However, audiences evolve: new conventions and shared understandings spread among users. Technological advances change design constraints. Insights that can be mined from the products of the design process, whether those products are interactive web pages or files full of method definitions, are a key piece of reflecting on what’s good and what’s bad about what we make.

A single golf clap? Or a long standing ovation?

By clapping more or less, you can signal to us which stories really stand out.