Why is ML-Based Data Extraction in Insurance a Game Changer?

Discover how machine learning can help insurers automate data extraction and retrieve key information from PDFs within a few seconds.

Published in

Intelliarts AI

11 min readJul 15, 2022

Insurance data is a real goldmine for insurance companies and insurtechs. Policy submissions, claims and complaints, cost evaluations, contracts, expert and health reports — all this gets documented in everyday business operations.

Unfortunately, at least 80% of this data is collected and stored in unstructured formats, PDFs included. This complicates access to insurance data and, thus, affects decision-making and makes this process longer.

As a subfield of AI, machine learning (ML) can help insurers automate data extraction and retrieve key information from PDFs in a matter of seconds. Underwriters and insurance agents will get quick access to critical data, without the need to go through thousands of pages manually.

What is information extraction in insurance?

Data mining vs. data extraction in insurance

Information extraction fits the wider concept of text mining, i.e. an essentially AI technique used to convert raw, unstructured data into structured one. The purpose of all this is pretty straightforward: a computer understands only structured information, which makes text mining so critical in many industries, from finances to law to healthcare to insurance.

In text mining, we can then talk about data extraction. Its idea is to retrieve useful info from a large body of text by gaining an understanding of its entities, attributes, and relationships. How is this possible? Naturally, with the help of machine learning. ML algorithms automatically scan the information and retrieve the core words or phrases from unstructured insurance text.

Here is the simplest example for you: the insurance company gets a request for dog insurance. During automated claims processing, an insurance agent looks for this specific case and types “dog insurance” keywords into the ML-based system. Instead of scanning through hundreds of pages, the agent will have to go through only those tens that the system will provide to them. The system can even highlight the areas where it’s mentioned “dog insurance”.

Benefits of data extraction in insurance

Information extraction has become increasingly useful in insurance in the latest years since more insurers decide to keep up with the times and go online, as the next logical step to digitization. (Take a look at the EY Insurance Industry report below.)

As more data is not only stored but gets collected in digital format, insurance companies can use this to their advantage. Specifically, ML-based text extraction could help insurers turn volumes of the info they store into useful data, extract the information seamlessly, and improve operational efficiency.

Let’s list specific benefits of the use of ML-powered text extraction in the insurance industry:

Optimized document processing. Since ML algorithms help to extract critical info automatically, an insurer should expect enhanced operational efficiency. Unstructured documents will be handled faster. And this will speed up business processes and cut costs by at least 40–60%, as mentioned in the Capgemini report.
Reduced manual data extraction. With ML, insurance agents no longer need to search for the info by scanning through claims, policies, contracts, and agreements personally. Instead, they will get the data seamlessly and in a format ready to use for further integration into the insurer’s document management system.
Increased accuracy. ML technology analyzes the correlative and causal allocation of data. This means an ML model doesn’t look for words or phrases in isolation but scans the environment, which brings more accurate results in the end. ML also scans the context and can identify synonyms and other related words. If you’re looking for a “dog”, an ML model will likely identify such words as “pet” or “husky”. Lastly, one more critical argument here is the self-learning nature of ML, i.e. the more you use an ML data extraction solution, the more it “learns”, and you can expect better efficiency.
Improved customer experience. If an insurer could handle claims and underwriting faster and more accurately, customers will then become more loyal to the company. So, you could think about ML-based data extraction as a business strategy to differentiate in the market and boost your competitive advantage.

Data extraction use cases in insurance

But how exactly can text extraction be useful to you in the insurance industry? Here we distinguish the two most prominent applications, both related to document processing in insurance.

1. Underwriting

Underwriters’ task is to determine the level of risk associated with every specific contract. During the underwriting process, agents evaluate every single piece of information that the client submits to them, from the financial status to health reports. This analysis usually takes lots of time and effort. As it often happens, the most important information is buried under hundreds of pages of PDF files.

As discussed, an ML-based data extraction solution could help insurers unlock this valuable applicant data quicker and in the most productive way. ML reduces the processing time for most standard cases, with underwriters extracting info in a couple of minutes. Meanwhile, professionals have extra time to focus on more complex cases.

Why use data extraction in underwriting?

2. Claims processing

Claims processing is another area where data extraction is of the greatest benefit to insurance companies. The procedure includes the analysis of insurance claims and complaints to understand how accurate the provided information is, whether it’s authentic, and whether the company should accept or reject the claim request. Here, insurance agents also filter claims by their type, the insurer’s products or services, and the complexity. A critical part is also to check the request for fraud detection.

Alike with underwriting, claims analysis includes processing a vast amount of materials. And here is where an ML approach can be useful for data extraction. An ML solution allows insurers to retrieve valuable information fast and accurately. Thus, the insurance agent could make conclusions about the claim in shorter terms and estimate the expected costs more efficiently. As a result, this could reduce treatment time and operational errors in claims settlement.

Automated data extraction: What’s going on under the hood?

In the world of machine learning, we’re speaking about the optical character recognition (OCR) task when we say that we want to extract text from images, PDFs included. This is how computers can make sense of texts and make them machine-readable.

The common scenario for the use of OCR is when you have to extract data from PDFs. On the one hand, printed documents like PDFs are structured, which makes them easy to parse. Besides, there are many tools developed specifically for this type of OCR task since it’s quite popular.

On the other hand, the very nature of the PDF document makes it difficult for text extraction, since it was developed with the goal to share the information between platforms easily while preserving both content and layout of the document. This explains why PDFs are usually so difficult to edit. The OCR task here is complicated, also depending on the type of information needed. Is it just text? Or does the position, fonts, etc. also matter? Everything is possible with machine learning. However, every extra layer of information requires more professionalism from your data scientists.

Strategies for OCR

From the technology point of view, text extraction could be divided into two steps. First, your ML engineers will need to detect text appearances in the image. An ML algorithm can sort of scan the document and isolate the areas where there is any text in there. One way of doing this is to draw boxes around any text that the ML model identities — a single word or a group of characters gets locked in a separate box.

Automated data extraction: Optical character recognition

The next step for ML is to convert the text into a machine-understandable format. This means presenting the unstructured PDF text in the structured format so the agent could use it for their benefit. Generally, we can distinguish between three main approaches here:

Classic computer vision techniques: In this scenario, ML engineers have to use filters so the characters become visible against the background. Then, contour detection is applied so the characters could be recognized one by one. Image classification is the last step that will help with the identification of the characters.
Specialized deep learning approaches: As a special form of ML, deep learning is based on neural network architecture with many (deep) layers to be trained. This way, ML engineers don’t need to select any features before training the algorithm(s). With specialized deep learning approaches for information extraction, we can speak of algorithms like EAST(Efficient accurate scene text detector) or CRNN (Convolutional-recurrent neural network).

An Efficient accurate scene text detector algorithm

3. Standard deep learning: ML engineers can also choose a more standard deep learning detection approach, which means using algorithms like SSD (Single-Shot Detector), YOLO (You Only Look Once), and Mask R-CNN (Mask Region-Based Convolutional Neural Network).

Steps of data extraction development

Now let’s briefly outline how your insurance company should proceed with the development of an ML-powered data extraction solution. We can mention four major steps here:

A good starting point will be to determine your business goals and specific objectives that you want to achieve with your data extraction solution. For instance, this could include the details like what documents you’d retrieve your information from or what information would it be (text only, graphics, etc.). All this will affect the approach and the tools you will use (we’ll talk about them in a few minutes).
Next, you will work with data, which is the backbone of your ML solution. An insurance company should have a good understanding of the data they have/need to strengthen their future solution. Research your data sources, explore their quality and quantity, and make an informed decision about data collection and the potential use of open datasets.
During the data preparation process, data scientists transform raw data, so they could run it through ML algorithms to get important insights and make predictions. Data preparation includes data pipeline design, data processing, and transformation.
Finally, data engineers can move to building an ML model and training a neural network. The size of the dataset doesn’t matter; large datasets work well as well as small ones. However, it’s important to have relevant data as the data quality directly impacts the efficiency of the ML model in the future. Also, monitoring the results of your solution is a must for the project to be successful.

At the end of the project, your insurance company should get a workable tool that will allow you to extract text out of PDFs, preferably in some meaningful blocks that the system will then submit to the user on request. As a full-cycle ML project, data engineers could also develop a user-friendly interface so that employees with no technology background can also use the ML solution easily.

ML-based data extraction tools

As said, today’s market counts multiple tools for data extraction from PDFs. This OCR task is complicated, though the availability of ready-made solutions that data engineers can use in order not to build the model from scratch is obviously a big advantage. We review the three most popular document extraction tools that are commonly used to build an ML solution.

Amazon Textract

Amazon Textract is a deep learning-based service for automated data extraction from PDFs, though it works well for both handwriting and any types of scanned documents. Unlike lots of OCR software that relies on manual configuration, this tool can read and process a PDF document effortlessly and extract the information accurately and in no time.

With this tool, an agent just uploads, for example, a claims document, and gets back all the texts, tables, and forms in a more structured way. As with any ML tool, Amazon Textract is prone to continuous learning. With more data fed into the system built upon this tool, data extraction will become more productive for your company.

Why choose Amazon Textract for automated data extraction from PDFs

Tesseract OCR

Tesseract OCR is an open-source text recognition engine, which is also recognized as the most widespread and qualitative OCR library. Its latest version, Tesseract 4.00, is distinguished by configured line recognition wrapped in the new neural network system based on LSTM (Long short-term memory). Still, it also preserves the legacy of Tesseract 3, focusing on recognizing character patterns.

The primary advantage of this tool over other data extraction tools is the support of an extensive variety of languages, even Arabic or Hebrew. Another unique feature of Tesseract includes its compatibility with many programming languages and frameworks.

Why choose Tesseract OCR for data extraction?

Cloud Vision API

Delivered as a Google Cloud service, Cloud Vision API is a powerful assistant for developers in integrating vision detection features, including OCR. The same as the two other tools discussed, Cloud Vision can detect and retrieve text from images, PDF files included.

The tool has two annotation features (data features): text detection and document text detection, though your data engineers will be interested in using the latter. Document text detection helps with data extraction specifically optimized for dense texts. In ML, density is associated with printed and written texts and is contrasted to the concept of sparsity when the text is written “in the wild”, for example, graffiti on the wall.

Why choose Cloud Vision API for data extraction?

Wrap up

ML-based data extraction could be very useful in insurance, where companies collect volumes of data daily and via multiple channels. Since insurance is a business of information, insurers should strive for automation and seek ways how to harness this info, on the one side, and process the data quicker, on the other. With an ML-powered data extraction solution, insurers could retrieve info from PDFs seamlessly and fast, which can significantly improve daily tasks in underwriting and claims processing.

However, the development of this tool requires a certain expertise in machine learning and data science as well as deep industry knowledge. In case your insurance company or insurtech cannot cover this expertise through their own efforts, Intelliarts has a great team of ML professionals, and we’ll be glad to help you.

Together we’ll make any manual data detection and processing a thing of the past for your insurance company.