The “Swiss Army Knife” of Document Processing — Why we invested in thingsTHINKING

Published in

Earlybird's view

7 min readMay 12, 2021

We are excited to have led the € 4.5 million seed financing round of Karlsruhe-based thingsTHINKING alongside a syndicate of experienced angel investors. The exceptional team around the founders Dr. Sven Körner, Dr. Mathias Landhäußer, Malik El Guesaoui and Georg Müller built a semantic platform that works out-of-the-box, speaks 100+ languages and can be trained within minutes to process any type of document.

You hate it, I hate it, everyone hates it: searching for information in a mess of data. As Drew Houston, the Co-Founder and CEO of Dropbox once said: “We live in a world where it’s easier to search all of human knowledge than my company’s knowledge”.

⏳ The bottleneck of knowledge work

Knowledge workers globally spend on average more than 20% of their time searching, comparing and gathering unstructured information, resulting in huge efficiency losses. While this number stems from a McKinsey study from about a decade ago, the increasing amount of data, together with its widespread distribution across organisations, has certainly worsened the problem. Luckily, natural language processing (NLP) is here to help and remove the inefficiencies — at least hypothetically, but I will come to the crux in a second. According to recent estimations, the resulting NLP market potential is expected to grow to more than $ 25 billion with a CAGR of approximately 20% by 2024. Let’s first get on the same page and discuss what NLP is and what it isn’t.

👩‍🎓 What is NLP?

Per definition, NLP is a subfield of AI and linguistics concerned with the interactions between computers and human language. The category was initially created by Turing in 1950 and since then, a range of different techniques emerged; these range from symbolic NLP (1950s — 1990s), to statistical NLP (1990s — 2010s), to Neural NLP (2010s — today) aimed at programming computers to process and analyze large amounts of language data. Today, the field is very broad and consists of a variety of different approaches with their own benefits and shortcomings. The most recent and most promising approaches are so-called transformer models. These weigh the influence of different parts of the input data and hereby give context that confers meaning to a word. Google’s BERT and OpenAI’s GPT-3 are two impressive examples for pre-trained transformer models. Despite powerful and unlimited NLP approaches, they remain useless at solving inefficiency if not properly productised and integrated into the workflows of knowledge workers.

🌊 The first wave of NLP startups

Thankfully, a range of researchers have productised their own (or others) academic results and applied that to real-world problems. To give some context, I’d like to spread the market across two dimensions: one axis describes the industries and the other axis describes the specific use cases within an industry.

Industry/use case framework for document processing/NLP (source Earlybird)

Throughout the last years, we’ve closely followed the NLP and document processing market and found that although most startups claim to build a horizontal platform that removes inefficiencies across industries and use cases, and although most founders claim to expand in concentric circles over time, reality shows that their focus starts and oftentimes remains very narrow, i.e. at the intersection of one or few specific industries and one or few specific use cases. When trying to understand this in more detail, it becomes obvious that the Chicken-Egg-problem of horizontal AI has the answer.

🧩 A highly fragmented landscape

As described in my “Deconstructing the AI Landscape” post, horizontal tech providers often have difficulties getting access to proprietary training data. Without suitable training data that contains industry and use case specific language, the respective models hardly provide any real value. Once the flywheel starts spinning, startups must decide whether to double down on the industry/use case intersection (by serving customers and thereby receiving more data which in turn creates more value for the same type of customers), or whether they focus their resources on new industry/use case intersections to prove their horizontal potential. Seemingly, most startups opt for the first to build up commercial traction and competitive moats, i.e., they dive into a specific industry/use case rabbit hole. As a result, they transition from horizontal tech providers to vertical integrators which limit their focus and thus the addressable market size.

On a macro level, this leads to a highly fragmented landscape of what we call “one-trick-ponies” that serve one industry/use case intersection perfectly well but have difficulties to replicate their value proposition along the industry or use case dimensions. As customers tend to have a colourful bouquet of use cases, they are not only forced to spend significant resources to train new models for their very specific use cases but they also end up with a messy NLP stack from a range of providers with different interfaces etc. This is counterintuitive as the ultimate goal was to remove inefficiencies rather than create new ones.

🚀 Clear the stage for semantha

Based on more than 14 years of NLP research at Karlsruhe Institute of Technology (KIT), thingsTHINKING has built a semantic platform that combines the superpowers of various NLP techniques. Different from incumbent solutions, their pre-trained platform has a comprehensive understanding of context and the actual meaning of words (LIBRARY in figure below). For example, semantha understands that “the car was very fast” means the same as “the vehicle was moving quickly” or “der Rennwagen fährt 260 km/h”. Different words and different languages (100+ already covered), same meaning.

⚙️ How does semantha work?

To start with, customers present semantha with a few specific examples of what they are looking for in the documents, just like they would with a new co-worker. For example, if an insurer needs to compare different contracts and make sure that a specific type of clause gets removed, they provide a few examples of such clauses and have the model learn the actual meaning. While these examples educate the industry and use case specific LANGUAGE MODEL, the respective anonymized semantic fingerprints are used to advance the ever-growing LIBRARY.

Moreover, thingsTHINKING creates an ONTOLOGY GRAPH on the basis of their LIBRARY and the LANGUAGE MODEL to then semantically analyze any kind of text document, identify the relevant clauses on the basis of meaning and either just highlight them or even substitute the respective clauses with pre-defined alternatives — at scale. thingsTHINKING started with text data, but transformers will allow them to feed any data type (from video to images to audio data) into their platform soon, aka provide multi-modal NLP functionality. Allowing customers to stay in their accustomed work flows, semantha provides integrations into MS Office and allows users to create their own solutions through its 150+ REST/API web services.

Speaking to experts from our portfolio company UiPath, the RPA leader at the center of any kind of document processing and one of thingsTHINKING’s partners, it became crystal clear why semantha is so special. They call her “the Swiss Army Knife of document processing” as she is highly versatile, easy to use and provides value from day 1. Obviously, this is what distribution partners are looking for when approaching new customers with so-far unseen NLP or document processes.

With a “too good to be true gut feeling”, we spoke to a range of thingsTHINKING’s license customers and were impressed by the glimmer in their eyes. As you can imagine, we conduct hundreds of customer references over the course of a year, but the ones for thingsTHINKING truly stood out. While some customers spoke about significant ROI expectations and next-level efficiency in their core business, others were already brainstorming how they can roll out and leverage the semantha platform across departments and use cases. Serving a range of customers from SMEs to enterprises such as BASF, Gothaer or Hella, the team decided to take the next step and raise a seed financing round for further expansion.

We are happy to share that together with a strong angel syndicate, Earlybird led the € 4.5 million Seed financing round of thingsTHINKING to revolutionize document processing.

Team thingsTHINKING, welcome to the #EBVCgang! We’re excited about the journey ahead.

>To learn more about thingsTHINKING and the document processing revolution, please see their website. They are hiring! 🕵️‍♂️

Are you a founder, industry expert, VC or researcher interested in the field of AI? I’d be more than happy to learn about your work, so feel free to reach out via andre@earlybird.com.