# Draw Conclusions Early With IncVisage

## A New Paradigm for Incremental Generation of Visualizations

*This blog post is a high level overview of our VLDB’17 paper, titled “I’ve Seen Enough: Incrementally Improving Visualizations for Rapid Decision Making”, with** code on github**. Extended version of our paper **here**. Written primarily by Sajjadur Rahman, with small edits from me; coauthors include Maryam Aliakbarpour, Hidy Kong, Eric Blais, Karrie Karahalios, and Ronitt Rubinfeld.*

Data visualization on increasingly large datasets remains cumbersome: when datasets are large, generating visualizations can take hours, impeding interaction, preventing exploration, and delaying the extraction of insights.

One approach to generating visualizations faster is to use *sampling* — we can display visualizations that incrementally improve over time, and eventually converge to the visualization computed on the entire data. However, such intermediate visualizations are approximate, and often fluctuate drastically, leading to *incorrect conclusions*. *We propose sampling-based incremental visualization algorithms that reveal the “salient” features quickly while minimizing error, enabling rapid and error-free decision making.*

**Example:** In the first row of Figure 1 below, we depict the variation of present sampling algorithms as more samples are taken: at *t_1*, *t_2*, *t_4*, *t_7*, and when all of the data has been sampled. This might be what visualizing the results of a standard sampling algorithm might provide. If a user sees the visualization at any of the intermediate time points, they may make incorrect decisions. For example, at time *t_1*, the user may reach an incorrect conclusion that the values at the start and the end are lower than most of the trend, while in fact, the opposite is true — this anomaly is due to the skewed samples that were drawn to reach *t_1*. The visualization continues to vary at *t_2*, *t_4*, and *t_7*, with values fluctuating randomly based on the samples that were drawn.

Another approach, that we call **IncVisage**, and depict in the second row is the following: at each time point *t_i*, reveal one additional segment for a *i*-segment trendline, by splitting one of the segments for the trendline at *t_{i-1}*, when the tool is confident enough to do so. Thus, IncVisage is very conservative at *t_1* and just provides a mean value for the entire range, then at *t_2*, it splits the single segment into two segments, indicating that the trend increases towards the end. Overall, by *t_7*, the tool has indicated many of the important features of the trend: it starts off high, has a bump in the middle, and then increases towards the end. This approach reveals features of the eventual visualization in the order of prominence, allowing users to gain early insights and draw conclusions early. The proposed approach can be applied to heatmap visualizations as well — depicted in row 4 for the corresponding standard sampling approach shown in row 3 — as is typical in heatmaps, the higher the value, the darker the color.

### Incrementally Improving Visualizations: The IncVisage Approach

How do we go about generating these increments? Consider the setting when the aggregates of all the groups (x axis values) are known beforehand. Therefore, we don’t need to perform sampling and the problem reduces to only finding the best *k*-segment approximation at each iteration *k*. At each iteration, we end up splitting a segment from the previous iteration into two new segments.

Now, how do we decide where to split? Let’s see an example (see Figure 2). Given segment *S*, there are many candidates for splitting — we only show three. But which one do we pick? Well, in this case, the candidate at the bottom gives us the maximum “jump” — intuitively, we want to pick a segment that is large enough to split and the difference between the values is also large. This idea of split is captured by a concept called *improvement potential* where we consider both the difference between values and the segment size. (See our paper for more details.)

We now generalize our approach to online sampling where we draw samples in iterations. *It turns out that we can use an empirical version of the same measure of improvement potential and still obtain optimal results.* Now, how many samples should we draw in each iteration to obtain optimal results? Somewhat surprisingly, if we draw samples uniformly across x-axis groups at each iteration, we can still satisfy the guarantees. We also derive a matching lower bound for the sampling complexity of the underlying algorithm. Moreover, we demonstrate that by geometrically decreasing the total number of samples drawn across iterations, we obtain an algorithm that is highly interactive with only a small effect on the accuracy of the approximations. We discuss the theorems that support these claims in detail in the paper.

### Experimental Takeaways

So how does this work in practice? Here are some key takeaways from our experimental evaluation. There’s lots more in the paper!

**IncVisage is Extremely Fast.**

Our approach is orders of magnitude faster than the sequential scan approach of reading the entire dataset. In Figure 3, we plot the execution times of Scan and different iterations of IncVisage on three datasets of varying size. The dark vertical arrows highlight the difference completion time between IncVisage and Scan. From figure 3, it is apparent that IncVisage facilitates faster decision making by revealing important features fairly quickly (iteration 5, 10, and 50 in Figure 3). At the same time, as the size of the dataset grows, the execution time for sequential scan also increases, whereas the execution time for different iterations of IncVisage stays almost the same, irrespective of the dataset.

**IncVisage provides Highly Interpretable Visualizations and Better Decision Making.**

We conducted two user studies on evaluating the interpretability and usability of IncVisage. The first study demonstrated that the visualizations generated by IncVisage are highly interpretable. In the second study, we compared IncVisage with the standard approach of depicting the entire visualization estimate as samples are taken, in terms of decision making capability. The study showed that IncVisage outperforms the standard approach in terms of accuracy (IncVisage = 94.55%, Standard = 45.83%), with comparable latencies. Moreover, users felt that it was easier to pinpoint the answers using IncVisage and terminate early. We present some user reactions here:

“… easier to know when I wanted to stop, because I had the overall idea first. And then I was just waiting to get the precise answer because I knew it was coming…”

”…really interesting to watch over time. Especially, at first when things were one or two blocks of color and then to sort of watch the data emerge and to then watch the different boxes become something…I actually caught myself watching for a long time.”

“… I preferred IncVisage, because it made it easier to kind of narrow down at least the range…Versus with the standard approach…if you picked the wrong one, you were very wrong, potentially.”

### Conclusions

We developed an incrementally improving visualization generation algorithm that reveals insights to the analyst when confident, thus allowing faster and more accurate decision making, along with a tool called *IncVisage *that implements this algorithm.

*We are hopeful that IncVisage represents a promising first step in the development of visualization tools that embody these principles — help users get early but confident insights — on very large datasets.*

#### Acknowledgements

Many thanks to NSF, NIH, the Siebel Energy Institute, Adobe, and Google for supporting this research.