Anyone Can Visualize!

Natural language for authoring data visualization.

Abhijith Reddy
VisUMD
4 min readDec 16, 2022

--

Photo by Firmbee.com on Unsplash

A picture is supposedly a thousand words. However, authoring such rich visualizations to present data is a challenge for many and the learning curve is still steep for new users looking to use tools to create visualizations. But new advances in natural language processing (NLP) have made data visualization more flexible and accessible to a wider audience. The premise of so-called Natural Language Interfaces (NLI) is to use speech or text to create visualizations to analyze large-scale and complex data. However, implementing NLP characteristics for data visualization is challenging and not straightforward. In this blog post, we will discuss NL4DV, a new toolkit from Georgia Tech for creating NLIs for visualization authoring.

NL4DV is a Python package that developers can utilize for rapidly incorporating natural language capabilities into a visualization tool. It was developed to address common challenges involved in the design and implementation of an NLI, specifically for visualization. Incorporating natural language technologies for a use case requires working proficiency with the methods involved.

NL4DV was developed to reduce this learning curve to a minimum. By processing a given dataset and the input query, NL4DV returns a response object optimized for visualization. Developers can build functionality directly by parsing the response to other components in their system. NL4DV’s output in response to a query is modular and consists of three components attributes, tasks, and visualizations. Thus, projects that do not utilize the default visualization grammar can choose to work with selective components by integrating them with their existing systems programmatically. Natural Language queries are ambiguous. From partial references to implicit intent, addressing this ambiguity in interpretation is critical for supporting development. NL4DV allows developers and users to specify dataset-specific aliases that would be otherwise interpreted differently or ignored.

NL4DV starts by processing the dataset to gather information like the type and range of values for each attribute. Next, to generate relevant visualizations in response to a given query, NL4DV performs the following steps.

  1. Query Parsing: The query parser runs a series of NLP functions on the input text to extract parts of speech (POS) tags, a dependency tree to understand relationships, and a sequence of keywords and phrases known as N-grams.
  2. Attribute Inference: NL4DV iterates through the N-grams to identify similarity with data attributes mentioned both explicitly and by referring to an attribute’s values to return an attributeMap. Developers can specify any aliases or domain-specific terms via an optional argument.
  3. Task Inference: After detecting the mapping between N-grams and data attributes, the remaining N-grams are checked for identifying references to analytic tasks. This is done by leveraging the parts of speech tags and dependency trees generated by the query parser. Identified attributes, values, and tasks are compiled as a taskMap.
  4. Visualization Specification Generation: Visualization types mentioned as part of the query are identified from the extracted keywords and encoded. Other possible visualizations are identified based on the attributes using predefined heuristics. Finally, all the inferred visualizations are appended into a visList.

Applications

By default, NL4DV generates the JSON output that conforms to Vega-lite, a popular high-level grammar of interactive graphics. Therefore, it can be used with any existing environment supporting Vega-lite. Rendering visualizations in python environments like Jupyter Notebook is easy and straightforward. Datasets and novice programmers can render visualizations without having to know about visualization design or Python packages such as Matplotlib, Plotly, Pandas, etc.

Similarly, NL4DV can be used for developing web-based NLIs by parsing the backend call to analyze_query’s output in python with JavaScript code to render visualizations even in real-time using Vega-Embed.

The result is an NL-based application that supports data visualization via query inputs as shown below.

With help of NL4DV, natural language interfaces for visualization can be built using just a few lines of code to parse the JSON output. In future releases, the team plans to support follow-up queries. Doing so enables a conversational style of interaction often supported by NLIs. Also, the package only interprets visualization-oriented queries. To support wider use cases, the query interpreter should be capable of generating visualizations for commands that don’t specify tasks or simply question-based like “How many people watched a movie with the least rating?”. Finally, the team plans to incorporate provisions for wider customizability in determining attribute and task inferences via custom NLP models based on feedback. If you are interested in learning more about NL4DV and using it for your work feel free to visit nl4dv.github.io.

References

  • Narechania, A., Srinivasan, A., & Stasko, J. (2021). NL4DV: A toolkit for generating analytic specifications for data visualization from natural language queries. IEEE Transactions on Visualization and Computer Graphics, 27(2), 369–379. https://doi.org/10.1109/tvcg.2020.3030378

--

--