Everyone’s an Analyst: Enabling Natural Language-style Queries over Structured Data
Author: Anmol Takiar, Software Engineer Intern
Every summer, we host talented student interns to work on challenging data science and data engineering projects at Mobikit. In this article, we share some details about the project that our software engineer intern, Anmol Takiar, completed this past summer. Anmol is a rising senior at Ohio State studying Computer Science. (PS: we’re always looking for awesome teammates, contact us!)
Data is on everyone’s minds — whether it’s big data or small, structured or unstructured, in a database or floating around the organization. As we leap into the age of data, a time where actionable insights are tied closely to competitive advantage, companies have an existential need to democratize access to big data.
A key observation we made when working with customers is that our users often possess immense domain knowledge about their data but are unfamiliar with the tools or languages needed to perform complex data analysis. This gap can be found in nearly every large enterprise. It is clear that we need to increase the accessibility of data exploration and analytics tools, helping everyone decipher large, complex datasets — from the newly hired data analyst to employees who’ve never analyzed data before.
Our tool, AnalystBox, enables autocompletion-based natural language-style search over structured data. It’s as if you literally had an “analyst in a search box.”
What could you do if you were able to ask your data a question as effortlessly as you would ask Google? What insights could you derive? What business decisions could you empower? How useful would it be to know what songs were playing when the car speed was high, what times accidents occur most frequently, or perhaps how many times were drivers changing the radio when an accident occurred?
While we’re not the first to try and solve this problem, what sets us apart is the simplicity of the approach we’ve taken. Traditional approaches include incremental query builders or natural language to query translators. Yet, query builders come with learning curves. And translators often require GPUs, can take minutes to translate a question, and can widely range in accuracy, with current state-of-the-art around 80% (Min 2019, Ma 2020, Yu et al. 2020). Instead, AnalystBox immediately derives questions from configurable rules and the data itself. Then it helps autocomplete your query with the closest generated question that could answer it.
How does it work? First, AnalystBox’s backend python package connects to your dataset and generates useful questions, results, and keywords based on the data present. For example, numerical data values will have questions surrounding their maximum, minimum, or average values. These questions also correspond to similarly generated results and keywords. These results can be SQL expressions, geographical coordinates, or anything else. Yet, this is just the start. Writing a few additional rules can generate millions of basic univariate to complex multivariate questions. Once done, AnalystBox’s React component enables powerful search with autocomplete, tokenized bolding, and keyword highlighting. Not only do these capabilities save time, they also educate the user on the types of questions that are worth asking. Finally, once a question is selected, we quickly surface the insight: a database could run the corresponding SQL query, a map could navigate to the geographical coordinates, etc. The possibilities are endless.
With AnalystBox, you can empower anyone to explore data, ask meaningful questions, and harness complex insights. Find AnalystBox on Mobikit’s GitHub. If you’re interested in any of our analytics capabilities or the other work that we do, send us a note at hello@mobikit.io.