Towards Automating Insight

An E³ — Expressive, Efficient, and Effortless — Visual Data Exploration System

Published in

Data People

8 min readSep 1, 2017

This blog post describes our latest research paper on the Zenvisage system, published at VLDB’17; the blog post was largely written by my PhD student Tarique Siddiqui, and edited by me — the other co-authors on the paper are Albert Kim, John Lee, and my colleague Karrie Karahalios.

Data visualization is the primary means via which data analysts — many of whom have limited programming skills — explore their data. While the usability and visual encoding capabilities of data visualization tools have undergone a massive evolution over the years, when it comes to searching for patterns, trends, and insights in large and complex datasets, these tools are severely limited. Using visualizations to identify the needle in the haystack has largely been a tedious and manual process of trial and error, preventing analysts from rapidly deriving desired insights.

Example: Ad analysts often examine the behavior of keywords over time to help their clients better manage their ad portfolio. This involves searching for keywords that behave differently with respect to other keywords. Lacking extensive programming skills, analysts load their datasets into visualization tools such as Tableau or PowerBI, and specify constraints over region and time attributes to generate click-through rate over time plots for each keyword. Then, they manually compare these plots to find the ones that behave unusually. If they are able to find a few keywords that match the requirements, they move on to the next step in their analysis, otherwise they search over a newer collection of plots generated by applying a different set of constraints. This process is repeated until they can find the desired keywords. However, when the number of the keywords or the constraints gets too large, this manual examination process becomes simply unsustainable — analysts often have to go through tens of thousands of visualizations, which adds to their frustration, while making exploration severely error-prone.

Similar examples are seen across a plethora of other domains involving genomic data, astronomical data, environmental data and others, where analysts have to manually peruse thousands of visualizations to search for each insight.

What if you could “skip ahead to the insights”?

Zenvisage is a visual data exploration system that automates the above-mentioned process of manual examination of visualizations, and fast-forwards users to their desired visualizations — by automatically identifying and recommending interesting visualizations. The name Zenvisage, is a portmanteau of zen and envisage, meaning to effortlessly visualize.

Zenvisage provides three key features to address the challenges involved in automating the manual search process:

An Intuitive Visual Interface with built-in interactions, summarization and recommendations
A Powerful Query Language (ZQL) for issuing sophisticated multi-step queries; and
A Scalable Optimization Engine (SmartFuse) to traverse over tens of thousands of visualizations to find a few interesting ones within interactive response times.

I. An Intuitive Visual Interface

Zenvisage’s visual interface provides a number of features to help users to get started with their exploration process, as well as to let them search for patterns directly via sketching, and drag and drop based mechanisms. Figure 1 depicts the key components of the interface.

Users select the attributes in Box 1) to specify the space of visualizations they are interested in. Zenvisage then automatically populates Box 2 with typical or representative trends to give users a high-level overview of the trends and patterns present in the dataset. Thereafter, users can either draw a shape or pattern that they are looking for in the editable sketching canvas (Box 3), or alternatively drag and drop one of the displayed visualizations into the canvas. Zenvisage then computes and populates the matching visualizations in Box 4. Zenvisage also lets users customize the distance metrics used for matching, the degree of smoothing, the x-axis range of interest, among other options.

II. Zenvisage Query Language (ZQL)

As a second and more advanced mode of query specification, Zenvisage provides a powerful query language, called ZQL, for issuing sophisticated multi-step queries. Forming the core of Zenvisage (in that all interactions get compiled down to ZQL), ZQL is a high-level visual data exploration language that operates over a collection of visualizations, i.e., it takes as input a collection of visualizations, and outputs a collection of visualizations. ZQL also captures key operations and data-mining primitives for operating at the level of visualizations.

At a high-level, a ZQL query consists of one or more rows, where each row has well-defined set of columns, namely Name, X, Y, Z, Viz, Constraints, and Process. These columns can be conceptually grouped into two groups: X, Y, Z, Viz, and Constraints columns defining a collection of visualizations, and are thus used for composing visualizations; and the Process column used for operating on one or more collections of visualizations, by specifying high-level comparison, sorting, and filtering operations.

For more details on the formal syntax and semantics of the languages, please refer to our VLDB’17 paper, where we show that ZQL is complete, and can handle a large number of data exploration operations.

Here are some examples that depict at the capabilities of ZQL at a high level.

Example 1

[Similarity Search over Z] A ZQL query that visualizes top 10 products whose sales over year trends are similar to that of the product chair.

As depicted in the Table above, the query can be expressed using three lines. In the first line, we retrieve the visualization of the product chair, and retrieve the visualizations for the rest of the products in the second line, and then select top 10 products whose trends are similar to that of chair. The third row outputs the visualizations of the selected products.

Example 2

[Similarity Search over X and Y] A ZQL query retrieving two different visualizations (among different combinations of x and y) for chair and desk that are the most dissimilar.

Here, in the first two lines, for each of the products chair and desk, we generate collections of visualizations formed by different combinations of X and Y, and then select a pair of X and Y attributes where the corresponding visualizations in two collections are very similar. In the last two lines, we output the visualizations for selected X and Y attributes.

Example 3

[Outlier Search over Z] A ZQL query which returns the sales visualizations for the 10 products whose sales visualizations are the most different from the others.

In this example, we first create two instances of the same collection of visualizations corresponding to all the products. Then, using the process column in line 2, we compare the visualization of each product with the visualizations of the rest of the products, selecting 10 products which are most different from all the products. Finally, we output the visualizations of selected products.

Example 4

[Shape Search over Z] A ZQL query which returns the sales visualizations for all products which have a negative trend.

Unlike the previous examples, here we operate over a single collection of visualizations, first composing the collections and then filtering to select products where the overall trend is negative. Finally, we output the visualizations of the selected products.

III. Query Optimization

Zenvisage runs on top of a relational database; thus while executing queries, it first retrieves the necessary data by issuing SQL queries, constructed out of X, Y, Z, and Constraints columns; and then applies further processing on the retrieved results based on the Process column. Unfortunately, naively translating each line of a ZQL query into SQL queries leads to thousands of SQL queries — generating and processing each one independently as an aggregate query, would take several hours, rendering the tool non-interactive.

To remedy this, we developed SmartFuse, a ZQL optimization engine. Smartfuse first converts a ZQL query into an acyclic graph of visualization collection © nodes and processing (P) nodes, and then uses a combination of parallelization, speculation, and combination for grouping and batching of SQL queries before issuing them to the underlying database. Overall, with these schemes, smartfuse can get up to a 60X improvement in execution time over the naive approach.

Results from a User Study

We conducted a user study to compare Zenvisage with existing visualization tools such as Tableau that require manual searching of insights. The participants were asked to perform tasks involving matching and comparison among one or more visualizations in a small collection of 30–40. While it might not extremely surprising to see that Zenvisage enables faster (50 to 100%) exploration, we found that participants were ready to compromise with suboptimal answers when using the existing tools, often not looking further than the first few visualizations in the collection. Because of this, the accuracy of existing visualization tools was significantly lower (70% vs 96%) than Zenvisage.

We also explicitly asked participants to compare Zenvisage with other data analytics tools. Here are some interesting user reactions.

“If I am doing my social science study, and I want to see some specific behavior among users, then I can use tool A [Zenvisage] since I can find the trend I am looking for and easily see what users fit into the pattern.”

Another participant experienced in Tableau commented:

“In Tableau, there is no pattern searching. If I see some pattern in Tableau, such as a decreasing pattern, and I want to see if any other variable is decreasing in that month, I have to go one by one to find this trend. But here I can find this through the query table.”

Zenvisage complements existing database and data mining systems, and programming languages, allowing a quicker initial data exploration before performing complex analysis. An experienced data-analyst said:

“The obvious good thing is that you can do complicated queries, and you don’t have to write SQL queries… I can imagine a non-cs student [doing] this.”

Acknowledgements

Many thanks to the NSF, NIH, Google, the Siebel Energy Institute, and Adobe for funding this work. We owe a huge debt of gratitude to our collaborators+beta testers across a variety of domains ranging from astrophysics, to battery science, to genomics.