Blockchain-based deduplication: Towards a standardized data management practice

Samer Haffar
Frontier Tech Hub
Published in
6 min readMar 23, 2023

This is the third article that I’m writing about our findings and learnings in the context of the piloting of a blockchain-based technology to address the beneficiary deduplication problem in Nigeria. The purpose of the pilot is to test whether a blockchain-based technology can be used to detect duplicate beneficiaries in Nigeria. The technology was previously piloted and proven effective and efficient in detecting duplicate beneficiaries in Syria; so, the aim is to build on the success of Syrian pilot and deploy the same technology in Nigeria, while taking into consideration the differences between the Syrian and Nigerian contexts.

The first sprint of the pilot was all about raising awareness of the problem, announcing the pilot, gathering momentum, and encouraging humanitarian actors to participate in the pilot. In the second sprint, we engaged with humanitarian agencies and tackled the assumptions that we believe would motivate them to use the technology, namely: their data collection and management workflows, the use of biometrics, as well as compliance with data protection laws and regulations. In this sprint, we wanted to onboard humanitarian agencies on the system to start deduplication checks; we also wanted to work with the agencies to arrive at a shared standard for data collection, something that is needed to make the technology work (please see How the system works below).

How the system works

The GeniusChain system utilizes blockchain technology to provide a platform where various stakeholders involved in humanitarian work can collaborate, coordinate to make the entire humanitarian process more efficient and effective. One application of GeniusChain is addressing the beneficiary deduplication problem, which is a challenging issue faced by humanitarian agencies. The GeniusChain solution to this problem works by allowing duplication checks to be conducted in real-time while data is being collected; humanitarian agencies can also conduct duplication checks for their beneficiaries in bulk after completing data collection and processing. Importantly, personal information is kept safe and secure. The steps to using GeniusChain for duplication checks are as follows:

  1. Humanitarian agencies come together to agree on a shared standard of data end points that will be used by the system to carry out the duplication checks. The data end points are pieces of information that, when combined, can uniquely identify a beneficiary; examples of these end points are the beneficiary’s first name, last name, gender, date of birth, place of birth, etc. The agencies also agree on a standardized way for how data points are supposed to be processed to reduce data entry mistakes.
  2. Humanitarian agencies then need to agree on a standard way to generate beneficiary unique identifiers (UIDs) using the data end points. These UIDs are what are compared by the system to check for duplication, as no two UIDs should be the same.
  3. After agreeing on the data end points and the duplication check configurations and methods, the agencies will then have to define those standards and methods into the GeniusChain system. To do that, the agencies create accounts on the system, define data preprocessing and cleaning rules (to correct data entry mistakes as agreed in the standard, step 1.), and define the rules by which the UID is generated.
  4. After finishing the system configuration, agencies can begin conducting duplication checks among each other. They can do that in the field through the GeniusChain mobile application, or directly from within their data collection tool of choice (such as Kobo Toolbox or CommCare) or at the office from the GeniusChain web portal.

The system carries out duplication checks by generating a universal unique identifier (UUID) of each beneficiary based on their information (i.e., the data end points) that follows the agreed upon standards. Each time a beneficiary is registered, their UUID is generated and recorded on the blockchain. If the system finds the same UUID already registered on the blockchain, the beneficiary to whom the UUID belongs is flagged as a duplicate to the user.

The system offers a wide range of options for defining data preprocessing rules for correcting data entry mistakes and processing data so that it gets automatically processed into the agreed upon and predefined standard to increase the accuracy of duplication checks. Data preprocessing is crucial to ensure data is as close to the standard agreed upon format as possible, since UUIDs are hashes of the data end points, so any change to the data can generate a different UUID. The system also offers a comprehensive suite of tools and functionalities for configuring the duplication check rules in a way that matches the humanitarian agencies’ contexts and needs.

Shared standard of data end points

To arrive at a shared standard of data end points to be used for deduplication, we asked the participating agencies to provide us with the surveys and questionnaires that they use for collecting beneficiary information. We then analyzed the questionnaires and compiled a form of the data end points, each data end point corresponding to a question that is asked by agencies participating in the pilot in their surveys. The form provides a list of the data end points, information about what each data end point means, and what data to expect as answers to the data end point. After creating the form, we then shared it with both agencies and asked them to confirm the data end points as being collected, provide samples of the nature of data that is collected at each data end point, as well as samples of common data entry mistakes that each agency comes across while processing data before use.

The end points which are asked by both agencies are:

  • Project information: this is general information about the project for which duplication checks will be conducted; the information includes: project name, donor, sector (cash assistance, food, etc), number of beneficiaries, status (whether ongoing, completed, or upcoming) and implementation regions.
  • Respondent data: this is the personal information of the individual that responded to the survey, which includes: the respondent’s first name, last name, date of birth (or age), gender and community. If the respondent is not the household head, then the same information is collected about the household head.
  • Household data: this is information collected about the household as a whole which includes: whether there’s a disability in the household, whether the household head is disabled, whether there’s an income for the household, as well as the household size disaggregated by the number of males, females, and age brackets.

Onboarding participating agencies on the system

The GeniusChain platform is equipped with a lot of configuration options that enables agencies to define their context and conduct duplication checks precisely as per their requirements. Due to that, we took an approach to onboarding agencies that allows them to gradually learn about the features and capabilities of the system while testing it for their context. The first step in that approach was getting the agencies familiar with the system.

To do that, we configured the GeniusChain system to perform a basic duplication check by relying on the beneficiary’s: first name, last name, gender and household disability status. We also recorded a 10-minute video where we provided step-by-step instructions on how to create an account and run their first duplication check. The use of a video recording helped make the task much easier to understand and perform, and both agencies completed the task smoothly.

The steps explained in the video are:

  • Creating an account and getting the account activated.
  • Navigation to where the user can run a duplication check.
  • Testing the system by running a duplication check for beneficiaries in two modes: one-by-one, where the user validates a beneficiary’s information by entering it into a form in the system, and bulk validation, where the user validates the information of several beneficiaries at the same time by uploading an Excel spreadsheet that contains the beneficiary information.
  • Sharing the results back with GeniusTags.

Challenges and next steps

Our primary challenge in this sprint was the holiday season, which took around 1 month off the sprint time for wrapping up 2022 and kickstarting 2023. However, we still managed to get the planned work for the sprint done. Now that the participating agencies are onboarded to the system, we will engage in a collaborative effort with the participating agencies to improve the accuracy of the basic duplication check configuration we created through:

  • Reviewing the data end points, data preprocessing and UID generation rules, and updating them based on feedback from the participating agencies.
  • Experimenting with the updated version of the configurations and making tweaks with trial and error to improve the accuracy of the duplication check results.

--

--