The Challenges of Dataset for Data Visualization Tools
Data influences the design of data visualization tools in many ways. The data type directly guides how the data can be visually displayed. Looking beyond the question of visual display, this article asks “how do characteristics of the dataset affect the design of an interactive data analysis tool?” Instead of the classic static/dynamic classification, this article examines three aspects of a dataset: familiarity, delivery and request frequency.
Static vs. Dynamic
In addition to the common classification based on data types (multi-dimensional, temporal, tree, graph, etc.), dataset are often classified as static or dynamic.
Static dataset — when the dataset is fixed and never changes. Most work in storytelling and data journalism uses this kind of data.
For this situation, because everything is known and can be tested beforehand, the design can be tailored and optimized to deliver the best experience for the target data.
Dynamic dataset — when the dataset may change. The visualization may receive a different dataset, which it was never tested against before. These are more common for dashboards and data analysis tools. For example, distribution of users by country in the past 30 days, which the data are updated daily, or stream of Tweets that are continuously being delivered.
Unexpected data behavior can lead to many kinds of problems ranging from small glitches to apps crashing. When developing for dynamic dataset, many practitioners start from a static dataset to avoid dealing with unexpected behavior. This allows one to focus on exploring the design space based on the sample first and adjust the solution to generalize later.
However, this broad classification into static or dynamic does not fully guide us through the variety of design considerations that should be taken into account. Instead, we may characterize the type of dataset based on these aspects:
Will the visualization need to work with dataset that has not been tested during the development?
If so, these are a few things to consider:
- Edge cases: Situations that are unlikely but can happen. For example, a scatterplot was designed for the test dataset and works fine. However, the new dataset may contain 100 points at the same location.
- Higher volume: For example, the test dataset has 100 points, so the scatter plot works nicely. What will happen if the dataset has 1,000,000 points.
- More variety: An example is a table that display all attributes for each row. What if there are hundreds of attributes?
You may end up designing a solution that can accommodate the extreme conditions above, which may involve additional data processing to aggregate the output or special treatment from the visual design. Another option would be to define the limitations of dataset that the visualization can handle.
Given a request, how are the data delivered?
Whether it is familiar or unfamiliar, the data must be stored somewhere. This could be one or more text files, database tables, web services, Hadoop, etc, which may or may be not the final form it will be delivered to the visualization.
For instance, a file
users.json is transferred as-is to the visualization. On the other hand, the shopping order information are stored in database table and need to be queried when requested.
Various data storage and the transformation steps that happen along the journey from stored data to final data, which could be anything from SQL queries to MapReduce jobs, can all lead to different delivery behaviors which then influence the design.
2.1 Delivered all at once
When the data are delivered all at once, the key factor is speed. How long does it take for the data to arrive from the moment users requested for it?
- Very Fast: The data are delivered in milliseconds. These are usually small text files that can be loaded quickly or even bundled with the visualization code itself.
- Fast: The data are delivered in less than a minute. These are common for simple SQL queries. Users will experience a wait from the moment they request for data until the visualization is ready, but likely to stay on the same page until it is loaded. A loading indicator is recommended.
- Slow: The data are delivered in minutes due to more expensive queries. Users may switch to other tabs or applications while waiting with likelihood to check back for progress. Progress bar and gentle notification once the data are ready are recommended. The key is to ensure that it is still working. Seeing the percent completion number increases gives a lot of confidence. Emotional interface design can play parts here. I once added a cute dancing duck to one of the dashboards and it helped people feel less angry waiting for results.
- Offline: The data are ready in hours. This situation is more common for MapReduce jobs or alike. At this point nobody will likely to wait. The more preview you can provide before requesting the full data will be very useful. A notification email or some kind of alerting once the job completes is recommended as it is very unlikely for the users to still have that web app tab running, or still being in front of the computer. The notification should provide deep link to send user back to the visualization of the requested data with very little effort.
Another alternative to avoid these wait time is to schedule and precompute the final data. This could be done for recurring report, such as daily report on number of users who login.
2.2 Delivered partially until completed
Another option for the big data processing is to output some data during the process to provide incremental feedback along the way. This could be:
- Partially completed data: By breaking the dataset into smaller independent parts and compute them separately, the completed part can be used to make decision with confident while waiting for other parts to complete.
- Overall approximation: Vague ideas of how the final data may look like. This feedback can be used to hint if the output is as expected. It also help users catch mistake earlier and can abort the job before 100% completion. See examples of the systems that use this approach in [2,3].
a.k.a. real-time or live data. This is a special case when the data are delivered continuously and indefinitely, which are common for sensors or social media that show data in the moment. The challenges are:
- Never-ending: There is no completed state. Users may pause the stream but may as well resume. Accumulation of data without overloading the system can be challenging.
- Speed: Data transformation must be fast. Otherwise the displayed data will not be up-to-date.
- Information overload: The visualization should highlight the most recent changes and may provide context from accumulated data. Technique such as Visual Sedimentation  is an example of this.
- Synchronization: It is harder to ensure every user of the tool see the same thing at the same moment in time. For example, I open the visualization first. One minute later you open it on another laptop. Ensuring that both will see the same thing can be challenging if data accumulation is needed.
3. Request frequency
Will the user request for data only once and everything operates in memory from the moment the data are delivered , or will there be one or more actions that can trigger new requests?
For system with potential for multiple request, delivery speed then becomes even more significant. Even the fast queries can have effect. Research  has shown that interactions with higher latency tool (+500ms) cause the users to unconsciously interact less with the tool.
There are many active researches on improving the capabilities of databases for interactive analysis  to speed up and improve the experience of the entire process. Hopefully, it will get better and better.
Nevertheless, the visualization team should still keep these considerations in mind when making decisions on characteristics of the dataset and consequently the design of the visualization tool.
 The 1st Workshop on Data Systems for Interactive Analysis www.interactive-analysis.org
 Barnett, Mike, et al. “Stat!: An interactive analytics environment for big data.” Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 2013.
 Fisher, Danyel, Igor Popov, and Steven Drucker. “Trust me, I’m partially right: incremental visualization lets analysts explore large datasets faster.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2012.
 Huron, Samuel, Romain Vuillemot, and Jean-Daniel Fekete. “Visual sedimentation.” IEEE Transactions on Visualization and Computer Graphics 19.12 (2013): 2446–2455.
 Liu, Zhicheng, and Jeffrey Heer. “The effects of interactive latency on exploratory visual analysis.” IEEE transactions on visualization and computer graphics 20.12 (2014): 2122–2131.
Thanks Micah Stubbs for suggesting the grammar and writing style fixes.