DataElixir: A framework to work with unstructured data (cleansing and deduplication)

Applications that escapade data for decision making are the anodynes, of the myriad of business problems, in protean sectors such as healthcare, insurance, finance and social media. However, in legion scenarios, the nature of the available data is either unforthcoming or is not readily palatable. The available data, (specifically text) is grimy — full of heterodox entities and eclectic in nature. In more pragmatic data scenarios, it is redundant — there are instances where text records appear disparate but are tantamount. In this article, I will delineate about DataElixir : An automated framework to clean, standardize and deduplicate the unstructured data set.

Consider a data set of organization names generated on a user survey which consists of variegated ambiguities such as:

N-grams: IBM corporation private , IBM comrporation limited, IBM limited 
Spellings: IBM corporation, Ibm1 corparation, ibm pyt ltd
Keywords: IBM Co, IBM Company, ibm#, ibm@ com

DataElixir is a framework to redress this vagueness. It is a complete package to perform large-scale data cleansing and deduplication hastily. The overall architecture of DataElixir is described in the following image:

DataElixir — Architecture

User configurations bridle the overall workflow. Input text records are first passed through Cleaner Block in order to make them pristine. This block is comprised of three types of functions: entity removal (example: punctuation, stop words, slang etc.), entity replacement (example: abbreviation fixes, custom lookup) and entity standardization (using patterns and expressions).

The cleansed records are then indexed by Indexer Block. The paramount role of Indexer is to ensure the brisk processing (cleansing and deduplication) of text records. Since every record has to be compared with every other record, the total comparison becomes n*n. Indexer Block uses Lucene to for indexing purposes and returns the closely matched candidate pairs can be referred by quick lookup. Experimentation suggested that Indexer block is able to reduce the overall processing time to one-third.

The candidates are then compared using Matcher Block which consists of flexible text comparison techniques such as Levenshtein matching, Jaro-Winkler, n-gram matching and phonetics matching. Matcher computes the provides the confidence score as well, which is the indication of how closely two records match.

For one of my project in academic research, the task was to perform machine translation of all the village names of Asia. I used pre-trained supervised learning algorithms for this purpose. The results were decent, but a there was a lot of redundancy in the translated names. Thus, I created DataElixir as the antidote to the problem. A data set of about one million rows of village names was cleansed and deduplicated in about 3 minutes.

Currently, the source-code is not open-sourced, however, I have released a light version of DataElixir — dupandas, which follows the same architecture.

It is written in pure python. Though it works in a homogeneous manner as DataElixir, there are a few leftover works and obviously scope of improvements. Feel free to check it out and suggest any feedback.