How Instabase Makes “Complex Data” Usable

Dylan Lockman
Instabase Blogs
6 min readJan 28, 2020

--

At Instabase, we make unstructured and structured data useful for business productivity. Complex data, or data that can only be used by humans, presents a variety of gray problems. On a daily basis, these activities look like transcribing customer data from PDFs into business systems, screening for specific phrases in contracts, and even validating other people’s manual work. The tools that have been developed to render this information useful often are either brittle rules or opaque machine learning models under the hood. Given the variety of complex data and the fact that gray area surrounds it, the best approach is to have algorithmic rules and machine learning techniques available to be used together.

Something like a simple receipt is deceivingly complex. While the information presented may be easy for humans to process, this isn’t necessarily the case for a computer. A computer or business system expects structured data with clearly defined tags like “subtotal” or “return-policy” to give it meaning or context.

Regardless of industry — financial services, insurance, healthcare, and so forth — organizations collect and receive incredible volumes of data, but it rarely arrives at their desk in the exact form they need and lacks context. As a result, companies throw an immense amount of time, resources, and technology at this “complex data” to render it usable for a variety of purposes. For some data, rule-based systems work. For other forms of data, predictive or probabilistic techniques such as machine learning can be used. And there is always a subset of exceptions that fall back into the hands of large in-house or outsourced operational processing teams.

Despite these investments, organizations continue to face the same challenge: incomplete results from incomplete solutions.

As the scenarios and variety of data and objectives are broad, Instabase’s philosophy is that diverse tools and proper context provide the most coverage and completeness.

So, what are gray problems and why describe them as gray? Consider the similarities and differences between the following types of documents:

  • Your last receipt from the grocery store
  • A contract drawn up between you and a supplier
  • A scanned image of your passport

They may contain common features such as date, total, and perhaps some addresses or multiple line items. An individual can identify these features and then consider each one as a distinct data point as they visually navigate the document and decide what to take away from it. But, humans are smart. A person subconsciously takes advantage of multiple techniques at once and applies context, which motivates them towards a thought or decision.

  • This is a grocery receipt… there are 4 line items of ice cream ranging from $3.99 to $7.99.
  • This is a contract… it is effective today and it describes the statement of work, terms, and material costs.
  • This is my passport… I see multiple date values on the page. My passport expiration date is next year.

There are multiple layers of context we use to quickly get the information we need.

  1. First, knowing a document’s type will shape our intent and help us find what we want. The relative significance of a date value printed on a passport has a much greater weight than a date on a grocery receipt. Without consciously deciding to, we are employing this context to determine what we are looking for and the reason it is important.
  2. Second, information has many contextual hints buried in how it’s presented. Clues such as relative position on a page, data labels, or standard date formats enable an individual to quickly identify and accurately ingest this information.

People are not limited to a single approach. They process the available details, overlay context, and select the best tool, which is often a combination of tools. This process may seem trivial given the examples above, however when considering the mountains of documents and inbound information that organizations struggle to address, this mental model provides an elegant framework to mimic in computing.

Great advancements have been made in algorithmic data parsing and machine learning, however, these techniques remain disjointed and challenging to implement.

No matter how advanced our tools become, why do we struggle to give them the contextual awareness that we, as humans, find so innate?

There is an immense opportunity to make these techniques available in such a way that they can be connected and contextualized to solve the mixed bag of complex data problems. If one considers algorithms as efficient shortcuts and machine learning models as context, the best solution is a hybrid — or the combination of black and white.

Consider the supplier contract mentioned above. On any given page, there can individual features such as an effective date, a table of materials and corresponding prices, but there also can be paragraphs of text that outline terms and conditional terms. To fully review, extract, and understand the contents, two completely different lenses must pass across the same document to retrieve and analyze key details. Individual features (numbers, dates, addresses) are best addressed by algorithmic extraction as it is precise and has an explainable audit trail. By contrast, interpreting information from sentences and paragraphs (risky terms in a contract) is better suited by Natural Language Processing (NLP) toolkits which are grounded in machine learning.

Instabase is making complex data usable through a platform of connected black, white, and gray tools.

These tools, targeting unstructured and semi-structured data, can be grouped into three primary categories: data extraction, natural language processing, and document analysis. While the tools have many strengths as individual components, their overall capability greatly increases when they are stacked on each other as building blocks. Instabase’s native workflow engine serves to connect and sequence apps, which enables organizations to think holistically about their business process.

In a single flow, a document can be parsed into structured machine-readable text, classified based on a variety of criteria, processed to extract specific fields of interest, and analyzed with NLP models. This same Instabase flow can be tuned to retrieve an entirely different set of facts and run different analyses across scenarios like contracts, financial reports, news articles, claims reports, emails, and more. Considering that the flow described above only uses 5 tools and there are nearly 30 on the platform today, the addressable problem space with this framework grows exponentially.

The all-too-often used metaphor of “right tool for the right job” fails to capture the fact that challenging problems are rarely solved by a single tool and require a combination of approaches. Regardless, if the objective is to extract the key fields from a tax document for a loan decision; to harvest terms & conditions, dollar amounts, and dates from supplier contracts; or to screen web media for adverse news events; a comprehensive toolset that invites context is invaluable. By breaking down the difficult aspects of complex data into components which familiar tools can solve for, Instabase is changing the way in which individuals, teams, and organizations are able to make use of information in its most raw form. To effectively address the gray problems complex data presents today and to be prepared to tackle the many which will arise in the future, adopting a toolset that is as contextualizable as it is diverse is essential.

--

--

Dylan Lockman
Instabase Blogs

Sales Engineering @ Instabase. In my spare time I enjoy photography, traveling, and tennis.