Structured data acquisition, cleaning, normalization, and matching

The Internet, remember the Internet came before AI, contains a massive amount of structured and unstructured data. Google indexes the unstructured data and makes it searchable. Problem solved. However, structured data on web pages is generated from offline databases and some times in the case of stores from corporate databases. Structured data which was nicely organized in offline data bases is displayed in HTML pages. Distinct fields are merged together. Optional data appears in pages. Data that was incorrectly entered/transformed/loaded into the offline database appears in web pages. Fields are ommited from web pages. The end result is that web pages contain data records which are incomplete, often with undocumented or partially documented schemas, which contain errors, as visible and invisible data field names (DFN’s) and data field values (DFV’s), and which are distributed throughout a hierarchical HTML structure. Reverse engineering online product databases, merging the same product data records from different sites and making the product data searchable is an unsolved problem. No company has demonstrated at scale that they have Quality Product Data.

A store web site is normally created from a template and a database. The template describes the HTML structure, usage of the CSS and Javascript and the variables to inject database values into the web page at specific locations. The template is the key to understanding the data record in the web page. The template represents the schema that contains the data record in the web page. The template may contain visible and hidden data field. The template may contain stand alone DFV’s and/or DFN/DFV pairs.

Scraping data is typically done by manually writing a program to extract data from the web pages at a site which use a specific template or by manually writing a description of the template in the web page and feeding the description to an extraction program.

Automatically or semi-automatically generating extractors for multi record web pages has been done in the past with mixed results. Researchers and engineers generally think that generating extractors for single record pages is a bridge too far.

This patent describes a outline for a method to generate extractors for single record web pages.

https://www.google.com/patents/US8190556

Data Record Science (http://www.datarecordscience.com) has a new semi-automatic method for generating extractors for multi and single record pages. Two different types of extractors are generated. The results of an extraction by the two different types of components can be compared to determine the correctness of the extracted data.

Once the priduct data is extracted it is processed by an advanced product data record processing pipeline which cleans, normalizes, and matches/groups the same product from different sites, including variants to make Quality Product Data at scale.