Automating Spec Sheet Data for Manufacturing with NLP & OCR

Published in

Sciforce

9 min readAug 5, 2024

Our solution uses advanced Natural Language Processing (NLP) and Machine Learning (ML) to automate the extraction of detailed specifications and measurements from complex product spec sheets.

Client Profile

The client is a company in the manufacturing industry that provides advanced technology solutions to improve and simplify production processes. They needed a sophisticated system to manage various types of documents, such as spec sheets, technical manuals, and product datasheets. The client’s request was as follows:

Data Management System: The client required a robust system capable of efficiently handling and organizing various forms of documentation.
Integrating Advanced Technologies: They sought to leverage Natural Language Processing (NLP) and Machine Learning (ML) to automate data extraction and enhance accuracy.
Consistency Across Formats: The system must process and extract data consistently from documents in different formats and structures.
Error Reduction: The client aimed to reduce manual data entry errors and ensure the reliability of extracted data.

Challenge

Our task was to automatically extract data fields from spec sheets that presented several complexities:

Multiple Model Numbers with Shared Details

Each spec sheet often contained information about multiple product models. These models shared certain details, while other details varied, typically presented in a table format. This required our system to accurately distinguish and associate the correct details with each model.

2. Heterogeneous Data from OCR

The spec sheets were processed using Optical Character Recognition (OCR) services, which, while useful, introduced inconsistencies. OCR-generated data often included errors in text recognition and formatting issues. The variability in data quality made it difficult to apply a uniform extraction approach.

3. Semi-Structured Document Formats

The spec sheets were semi-structured, meaning they contained both structured elements (like tables) and unstructured text. The tables detailing differences between models were not always clearly defined or consistently formatted, further complicating the extraction process.

4. Complex Table Structures

The tables in the spec sheets varied in complexity, with some including nested information, merged cells, or irregular layouts. These variations made it challenging to parse and extract relevant data accurately.

5. Manual Data Entry and Verification

While some clients provided manually filled data to guide the extraction process, the variability in manual entries and the need for precise matching with the spec sheets added another layer of complexity.

6. Field and Unit Identification

Identifying and extracting specific fields and their corresponding units of measurement required sophisticated techniques to ensure accuracy, especially when dealing with non-standardized terminology and formats.

Solution

To address the challenge of extracting data fields from semi-structured spec sheets, we developed a dual-approach solution tailored to the client’s needs. This involved creating two distinct pipelines to handle different scenarios of data availability:

Manual Data Matching

For scenarios where the client provides specific texts and categories that need to be extracted, we developed a detailed script designed to accurately match manually filled text with corresponding locations in the spec sheet. The process includes:

Precise Matching: The script takes a PDF document and manually filled data as input. It matches the manually entered specification parameters, values, and model names with their corresponding locations in the spec sheet.
Output Details: The script outputs bounding boxes and page numbers for each manually filled entry. This ensures that each piece of data is precisely located within the document, allowing for easy verification and further processing.

2. Automatic Spec Category Filling

For scenarios where the client provides only the expected spec categories, we developed a comprehensive script to automatically populate the values and units of measurement for each spec category. The process involves:

Automated Filling: The script takes a PDF document and a list of expected spec categories as input. It automatically identifies and fills in the relevant spec values and units of measurement.
Output Details: The script outputs bounding boxes, page numbers, spec values, and units of measurement for each spec category. If the script’s confidence level is low, it outputs several candidates for human review to ensure accuracy;

Features

Text Extraction

Begins with custom OCR to accurately extract text along with bounding boxes from scanned documents, converting scanned images into machine-readable formats. For text-based documents, a PDF reader directly reads and extracts text and bounding boxes, ensuring precise text location mapping.

2. Candidate Matching

It involves identifying potential matches for manually filled data points, such as specification parameter names, values, and model names, with text extracted from the OCR process. Algorithms like Levenstein distance, Jaccard similarity, and cosine similarity measure the similarity between manually filled data and OCR-extracted text, ranking candidates based on similarity scores to select the best matches.

3. Clustering

This is a crucial step where numbers, units of measurement, names of categories, and their synonyms are extracted from the OCR output. The OCR output is then post-processed to create an index of words and their corresponding bounding boxes.

Clustering algorithms such as Chinese whispers construct a graph where nodes represent words and edges represent distances between bounding boxes, clustering words based on proximity. Predefined pixel ranges assign ranges around each word, identifying clusters with the most neighbors as cluster centers. Small or irrelevant clusters are eliminated to focus on meaningful groups related to spec categories.

4. Periodicity Exploitation

Periodicity exploitation analyzes periodic structures within the document to identify repeating patterns. By calculating distances between clusters and finding pairs with similar relative distances, the system can determine associations and group clusters based on identified periodic patterns and calculated distances, ensuring accurate association of data with specific products.

5. Spec Extraction

Here we identify clusters that are likely to contain the necessary specs and extract the relevant information. The extraction methods are adjusted to handle complex table structures within the documents, and text patterns like number x number x number are utilized to identify and extract product dimensions.

6. Anomaly Detection

Anomaly detection further enhances accuracy by extracting important words for each document to improve clustering accuracy and generating distributions of spec values to detect anomalies. This step increases data accuracy by comparing extracted product specs across documents to enhance the overall quality and reliability of the data.

Development process

Initial Stage

Requirement Analysis

Collaborated with the client to understand the specific requirements, including the types of documents, the data to be extracted, and the expected output formats.

2. Design

Created the architecture for two distinct approaches (Plan A and Plan B) to handle the different scenarios of data availability.

Intermediate Stage

Custom OCR Development

Developed a custom OCR solution tailored to the specific needs of scanned documents, ensuring accurate text extraction with bounding boxes.

2. Algorithm Development

Designed and implemented algorithms for text extraction, candidate selection, and ranking for Plan A, and for location finding, clustering, and spec extraction for Plan B.

3. Pipeline Integration

Integrated the OCR and algorithms into a cohesive pipeline that could process the spec sheets end-to-end.

Final Stage

Testing and Validation

Conducted extensive testing on a variety of spec sheets to validate the accuracy and reliability of both approaches. Adjusted algorithms and refined processes based on feedback and observed performance.

2. Optimization

Optimized the system for performance and scalability, ensuring it could handle large volumes of documents efficiently.

3. Deployment

Deployed the solution on the client’s infrastructure, providing training and support to ensure smooth adoption and integration into their existing workflows.

How it works

Scenario A

Text Extraction

For scanned documents, we use custom OCR to extract text along with their bounding boxes. For text-based documents, a PDF reader is utilized to directly read and extract text and bounding boxes.

2. Candidate Selection

Identifying potential matches for manually filled data points, such as specification parameter names, values, and model names, with the text extracted from the OCR process. Each manually filled data point is then associated with its corresponding bounding box extracted from the OCR.

3. Candidate Ranking

The candidate ranking process uses the Levenstein distance algorithm to calculate the similarity between the manually filled data and the OCR-extracted text. To ensure the most accurate matches, additional ranking methods, such as Jaccard similarity and cosine similarity, are implemented.

Finally, we rank the identified candidates based on their similarity scores and select the best possible match for each manually filled entry.

Scenario B

Text Extraction

The same as in Scenario A: custom OCR for extracting the information from scanned documents or a PDF reader to read the document directly.

2. Location Finding and Clustering:

Numbers, units of measurement, names of categories, and their synonyms are extracted from the OCR output. This output is post-processed to create an index of words and bounding boxes. WordNet is used to find related terms. Clustering algorithms, like Chinese whispers or predefined pixel ranges, group related data points.

Chinese whispers construct a graph to cluster words based on proximity, while predefined pixel ranges identify clusters by assigning ranges around each word. Small or irrelevant clusters are eliminated to focus on meaningful groups related to the spec categories.

3. Cluster Association

We start with exploiting the periodicity of the document structure to find a period in the number of lines of text (or pixels).

We collect distances between clusters by searching for pairs of element pairs with similar relative distances (e.g. for elements A, B, C, and D with distances A-B similar to distances C-D). Then we extract a period by averaging distance A-C across all pairs of pairs. One period contains data about one product.

It’s possible to group clusters by the obtained periods and return results. To enhance results, we similarly exploit periodicity in types of words (numbers, units of measurement, or spec categories).

4. Spec Extraction:

As a first step, we search for a number, a unit of measurement, or the spec category within the same cluster or its neighbor clusters. If possible, we detect if the cluster is a table and change the extraction algorithm for it.

To enhance the extraction quality, we compare the product specs to other products in the document. We use patterns in the text such as number x number x number as a common pattern for dimensions of the product.

Result

Enhanced Efficiency:

The automated system reduced the time required to process spec sheets by 70%, allowing for faster data availability and decision-making. The need for manual data entry was significantly diminished, freeing up human resources for more strategic tasks and increasing overall productivity by 50%.

2. Improved Accuracy:

The custom OCR and advanced text extraction techniques achieved an accuracy rate of 95%, significantly reducing errors associated with manual data entry. The implementation of candidate ranking algorithms, including Levenstein distance, Jaccard similarity, and cosine similarity, ensured the most accurate matches for manually filled data, reducing data entry errors by 80%.

3. Scalability:

The solution is capable of processing up to 10,000 spec sheets per month without compromising performance, meeting the growing data processing needs of the client.

4. Cost Savings:

By automating the data extraction process, the client experienced a 60% reduction in labor costs and a 50% reduction in error rectification costs. The efficient processing pipeline decreased operational costs associated with delayed data processing by 40%.

5. Client-Specific Benefits:

The client reported a 70% reduction in manual data entry errors and a 60% faster processing time. The ability to handle heterogeneous and semi-structured data formats expanded the client’s capability to manage diverse document types, enhancing their overall data management strategy.