How to use Duke for DeDuplication of BigData
So, recently I came across this amazing Deduplication engine written in java for finding matching records in-between two files by linking the records. It creates indices of the records using the lucene database and then matches two records by using comparators (specified in an XMLconfig file) and gives them a confidence score. You can find more on how it works here.
To set it up on your machine, step 1 would be to write a controller file which would call duke to perform matching of records:-
You can set the number of threads that Duke will use to run using the proc.setThreads() command.
Now you would have to specify the comparators(in an XMLconfig) on which the properties(read columns) would be compared and a confidence score would be given to the records based on the similarity between the properties. Following is a sample XML file:-
Now initiate the duke engine and provide the filepath to the xml in the configuration. Duke works in two modes
- Record Linkage mode
- Deduplication mode
You can specify the mode in your main file
The data sources for the engine have to be specified in the XML file(near the end). If two files are to be matched then they have to be put under the <group> tag.