Advancing Public Sector Data Accuracy with AI-Powered Text Similarity Solutions
Writers: Cindy Adelia Setiawan and Rizka Azmira
Overview
Good quality data is fundamental to delivering high-quality services to citizens. In the public sector, data serves as the backbone for policy making, program implementation, and service delivery. However, achieving high-quality data requires more than just collecting information; it demands accuracy, consistency, and alignment across multiple datasets.
Interoperability of data is one of the government’s key priorities, as outlined in Indonesia’s One Data Policy. This policy emphasizes the importance of integrating and standardizing data across entities in the Nation to create a unified, accurate source of information. Despite this commitment, implementing the regulation remains an ongoing challenge. Differences in how data is recorded and managed across institutions often result in inconsistencies, creating gaps that hinder effective collaboration and service delivery.
In Indonesia’s Ministry of Education, Culture, Research, and Technology, data quality challenges are an unfortunate reality. With vast amounts of data collected nationwide, these issues impact the quality of services provided to the public. The problem becomes even more complex when the data is integrated with other ministries and agencies across the country. In this article, we share two use cases on the Directorate of Higher Education (DIKTI) as well as on the Directorate of Primary and Secondary Education (DIKDASMEN). First, managing educational data requires resolving duplicate lecturer records caused by errors in NIK, name, gender, and date of birth. These duplicates can lead to fraud, legal issues, or individuals being denied rightful benefits, making their resolution critical. Second, reconciling differences in data structure between DAPODIK’s and BKN systems is crucial to ensure consistency in employment, payroll, and operational data for educators. We elaborate those use cases on the next two subsequent sections.
Classical and Contemporary approaches
As practical scientists, we begin by applying simple, well-established traditional methods to address our challenges. Our standard approach is to prioritize state-of-the-art methods that are straightforward yet effective. If these methods fall short, we then explore more advanced and complex alternatives to ensure the problem is resolved.
- Levenshtein Distance and Jaro-Winkler Similarity
Text similarity is essential for reconciling mismatched data, especially with typos or variations in names. It measures how alike two strings are, aiding in tasks like identity verification and data cleaning. Classical methods like Levenshtein Distance and Jaro-Winkler Similarity effectively address these discrepancies. Levenshtein measures the number of single-character edits (insertions, deletions, substitutions) required to transform one string into another. It’s effective at detecting differences in text structure but can undervalue abbreviations and reordered names. Jaro-Winkler focuses on the alignment of common characters and gives higher scores to strings that match early. It excels at handling abbreviations and prefix variations, making it ideal for names with common modifications like initials or titles.
- Semantic Search via Generative AI
Semantic search leverages the power of Generative AI (gen-AI), such as GPT-4.0, to go beyond simple keyword matching and focus on understanding the contextual meaning of words. Unlike classical methods, which rely on syntactic similarity, gen-AI enables more nuanced analysis by capturing the semantic relationships between strings. This approach is particularly useful when dealing with complex datasets that involve synonyms, abbreviations, or paraphrased content. By transforming text into embeddings, semantic search allows for efficient comparison at a deeper level of meaning. The result is a more accurate and flexible system for aligning datasets and uncovering hidden connections.
Identifying Duplicate and Inconsistencies on University Lecturers Data
Managing educational data effectively requires addressing challenges such as identifying duplicate lecturer records. These duplicates are based on attributes like NIK (Indonesian National ID), name, gender, and date of birth, which should ideally be unique to one person. NUPTK (Unique Identification Number for Educators) ensures a permanent identifier, but names often vary due to titles and abbreviations. These duplicates arise due to several factors: data entry errors by users, verification mistakes, or even potential misuse of data. Addressing these duplicates is critical due to:
- duplicate accounts potentially lead to fraud, such as claiming multiple benefits or allowances;
- misuse of data may result in legal implications for the affected individual; and/or
- data entry errors can prevent individuals from receiving their rightful benefits and allowances.
Variations in lecturer name formatting across systems present significant challenges in detecting duplicates. These challenges stem from three key factors:
These inconsistencies make traditional string similarity methods insufficient because they fail to account for abbreviations, semantic equivalences, or structural differences in names. For example:
Traditional approaches struggle to resolve these challenges effectively, highlighting the need for AI-driven solutions tailored to these complexities.
Preprocessing: Refining Names by Removing Stopwords and Standardizing Variations
Before applying the text similarity algorithms, we preprocess the data to address common inconsistencies in lecturer names. This involves standardizing names by removing or replacing our predefined stop words such as academic titles, honorifics, and common abbreviations. This step ensures the algorithms focus on meaningful comparisons rather than format differences. Example of Stopword Replacements:
word_replace = {
‘m ‘: ‘muhammad’, ‘dr ‘: ‘’, ‘hj. ‘: ‘’,
‘s.’: ‘’, ‘se.’: ‘’
}
Examples
Method: Hybrid Text Similarity Models
We explored two text similarity algorithms — Levenshtein and Jaro-Winkler — to compare lecturer names. Levenshtein excels at positional accuracy, while Jaro-Winkler handles abbreviations better. To enhance results, we used our custom stopword list as in the preprocessing steps above.
Sample Outputs:
- RINAWATI KASRIN vs. RINAWATI
Average Similarity: 0.801
Levenshtein: 0.696, Jaro-Winkler: 0.907 - FARIANSYAH HASSAN BASRIE vs. FARIANSYAH HB
Average Similarity: 0.806
Levenshtein: 0.703, Jaro-Winkler: 0.908 - CINDY ADELIA SETIAWAN vs. CINDY ANGELINA
Average Similarity: 0.749
Levenshtein: 0.629, Jaro-Winkler: 0.869
Balancing Accuracy with Dual Scoring
To manage the trade-offs, we averaged the scores from Levenshtein and Jaro-Winkler. Levenshtein’s sensitivity to positional differences complements Jaro-Winkler’s strength with prefixes and abbreviations, ensuring robust identification across name variations. Based on our annotated experiments, we established a threshold of 80% to distinguish between different individuals and duplicates:
- [0, 0.8): Different individuals
- [0.8, 1]: Likely the same person
This threshold ensures reliable identification while accounting for the nuances of name variations.
Results
The AI-driven analysis enabled us to provide clear, data-informed recommendations to stakeholders for resolving duplicate lecturer data (dosen ganda). We identified >45k cases, with ~8k lecturers actively teaching. These were categorized into three risk levels
These actions aim to address duplication issues efficiently, prioritize risk levels, and support the stakeholders in maintaining accurate lecturer records. In addition to identifying duplicate cases, our analysis also uncovered data inconsistency between user and Human Resource Department tables, where names in the SISTER database did not match names in the Human Resource Department table. For example, records such as:
We leveraged text similarity models to identify these inconsistencies and flagged the affected records. This information was then passed to the relevant teams to take corrective actions, ensuring data integrity across the system.
Impact
Our effort ensures the integrity of user records, safeguard lecturers’ rights and allowances, and prevent fraudulent activities like claiming multiple benefits through duplicate accounts. AI-driven solutions significantly outperform manual methods, where categorizing ~8K active cases would demand exhaustive validation efforts. Using AI, nearly 25% of duplicate cases are resolved automatically, while the remaining 75% are efficiently prioritized for further review.
Data Readiness for Supporting Inter-Ministerial Data Interoperability
Dapodik, managed by the Indonesian Ministry of Education, tracks detailed school data; while BKN oversees employment data for all government employees, including teachers. To ensure consistency in employment, payroll, and operational data for educators, schools’ data from these two systems must be matched. However, their differing structures lead to discrepancies, such as variations in school name formats (e.g., “SMA Negeri 1 Surakarta” in Dapodik vs. “UPT SMAN 01 Solo Kec. Banjarsari” in BKN). While Dapodik focuses on school-specific data, BKN uses a generalized schema for government institutions. Reconciling these differences is critical for accurate alignment of institutional records.
School name variations in BKN and Dapodik arise from three key factors:
These differences make traditional string similarity methods ineffective, as shown in the table below:
Traditional methods like Levenshtein distance or Jaro-Winkler fail to account for:
- Semantic meaning, such as equivalent terms or regional differences.
- Structural changes, like additional prefixes or reordered information.
For example: “UPT SMAN 01 Solo Kec. Banjarsari” vs “SMA Negeri 1 Surakarta”
Traditional methods would see these as unrelated strings due to low textual similarity, even though they refer to the same school.
Preprocessing: Metadata Extraction
The first step was to extract key attributes from school names using GPT-3.5. This process distilled school names into structured metadata. To ensure consistency, the process includes regional name standardization using a reference database. For example:
By focusing on attributes rather than raw text and standardizing regional names, this step reduced variability and allowed for more consistent comparisons grounded in our standardized reference data.
Method: Filtering by Metadata and Semantic Evaluation with AI
Once the metadata extraction step was complete, the next challenge was identifying the correct match for each school name. The process involved two key steps: (i) filtering by metadata and (ii) semantic evaluation. The first step was to reduce the pool of potential matches. For example,
- Input name: “UPT SMAN 03 Mekar Jaya Kec. Bukit Indah”
- Extracted metadata:
- Level: SMA
- Status: Negeri
- City: Mekarjaya
- Kecamatan: Bukit Indah
Using these attributes, we filtered all schools in the Dapodik database to find candidates that matched key criteria. This process narrowed the pool from thousands of schools nationwide to a smaller, localized subset.
Based on the aforementioned example, five candidates for the input name were produced.
While filtering by metadata significantly reduced the pool, it wasn’t enough to guarantee accuracy. Multiple schools might meet the same criteria but differ in name formatting or semantic details. Hence, we did semantic evaluation on the next stage.
Input name: “UPT SMAN 03 Mekar Jaya Kec. Bukit Indah”
To resolve these ambiguities, we utilized GPT-4 to perform a semantic evaluation, scoring each candidate based on:
- Regional naming variations: Recognizing that “Mekarjaya” and “Mekar Jaya” refer to the same city.
- Structural variations: Understanding the meaning of abbreviations like “SMA Negeri” ↔ “SMAN.”
- Administrative context: Giving preference to candidates with matching district or school numbers.
Results
The AI-driven approach processed the schools’ records with 56% high confidence (confidence score >90%) and 51% validated matches after further filtering and cross-checks.
Impact
The implementation of AI-driven matching between Dapodik and BKN systems has demonstrated meaningful improvements in both processing efficiency and data quality. When comparing manual effort versus AI-driven approaches for processing 120K+ school’s names:
The improved data alignment between Dapodik and BKN has yielded practical benefits for key stakeholders. Teachers and education staff experience faster employment verification and streamlined administrative processes. Educational institutions benefit from reduced overhead in maintaining consistent records across both systems. Government agencies can now access unified data for policy planning and resource allocation, enabling more informed decision-making in areas such as teacher distribution and training programs. This systematic approach to data matching also provides a practical framework that can be adapted for similar data integration challenges across other government agencies, supporting broader improvements in public service delivery.
Concluding Remark
While the government continues its efforts to establish standardized data frameworks, tools like text similarity are critical to bridging the gap. By enabling the accurate matching of datasets from different ministries and agencies, text similarity plays a significant role in supporting data interoperability. It ensures that, even in the absence of fully standardized data, public institutions can still work collaboratively, delivering efficient and reliable services to the citizens they serve. This approach highlights how technology can be leveraged to address complex data challenges in the public sector.
— — — — — — —
The lecturer name matching problem taught us a fundamental lesson about text similarity. When dealing with personal names, the variations tend to follow predictable patterns based on social and professional conventions. But our next data matching challenge would prove far more complex, forcing us to rethink our approach to text similarity. What worked for personal names was just the tip of the iceberg. We needed something more comprehensive — an approach that could understand context, not just compare text. Here’s how we evolved our solution.
The Context: Why Match Data Between Dapodik and BKN?
Dapodik (Data Pokok Pendidikan) is the Indonesian Ministry of Education’s database for schools, tracking detailed records of students, teachers, and institutions. On the other hand, BKN (Badan Kepegawaian Negara) manages employment data for all government employees across Indonesia, including teachers.
Matching data between these systems ensures consistency in employment, payroll, and operational data for educators. However, the two systems are structured for different purposes, leading to significant discrepancies:
- Dapodik emphasizes school-specific data.
- BKN employs a generalized schema for all government institutions, introducing additional administrative details.
These differences result in variations in how school names are recorded. For example:
- Dapodik: “SMA Negeri 1 Surakarta”
- BKN: “UPT SMAN 01 Solo Kec. Banjarsari”
This mismatch necessitates reconciling the two datasets to align institutional records accurately.
The Challenges
School name variations in BKN and Dapodik arise from three key factors:
These differences make traditional string similarity methods ineffective, as shown in the table below:
Traditional methods like Levenshtein distance or Jaro-Winkler fail to account for:
- Semantic meaning, such as equivalent terms or regional differences.
- Structural changes, like additional prefixes or reordered information.
For example:
- “UPT SMAN 01 Solo Kec. Banjarsari” vs “SMA Negeri 1 Surakarta”
Traditional methods would see these as unrelated strings due to low textual similarity, even though they refer to the same school.
Preprocessing: Metadata Extraction
The first step was to extract key attributes from school names using GPT-3.5. This process distilled school names into structured metadata. To ensure consistency, the process includes regional name standardization using a reference database. For example:
By focusing on attributes rather than raw text and standardizing regional names, this step reduced variability and allowed for more consistent comparisons grounded in our standardized reference data.
Method: Filtering by Metadata and Semantic Evaluation with AI
Once the metadata extraction step was complete, the next challenge was identifying the correct match for each school name. The process involved two key steps: filtering by metadata and semantic evaluation.
Filtering by Metadata
The first step was to reduce the pool of potential matches. For example:
- Input name: “UPT SMAN 03 Mekar Jaya Kec. Bukit Indah”
- Extracted metadata:
- Level: SMA
- Status: Negeri
- City: Mekarjaya
- Kecamatan: Bukit Indah
Using these attributes, we filtered all schools in the Dapodik database to find candidates that matched key criteria. This process narrowed the pool from thousands of schools nationwide to a smaller, localized subset:
Based on the aforementioned example, five candidates for the input name were produced.
Semantic Evaluation
While filtering by metadata significantly reduced the pool, it wasn’t enough to guarantee accuracy. Multiple schools might meet the same criteria but differ in name formatting or semantic details.
Input name: “UPT SMAN 03 Mekar Jaya Kec. Bukit Indah”
To resolve these ambiguities, we utilized GPT-4 to perform a semantic evaluation, scoring each candidate based on:
- Regional naming variations: Recognizing that “Mekarjaya” and “Mekar Jaya” refer to the same city.
- Structural variations: Understanding the meaning of abbreviations like “SMA Negeri” ↔ “SMAN.”
- Administrative context: Giving preference to candidates with matching district or school numbers.
Threshold and Processing
To balance accuracy and efficiency:
Determine optimal candidate threshold
- Sampled a set of school names and analyzed where their correct matches typically ranked when sorted by Levenshtein string similarity
- Plotted cumulative distribution to find the ranking position that captures most true matches (recall)
- Selected threshold that optimized the tradeoff between processing load and match accuracy
Process names through initial filtering
- Filter candidates using extracted metadata
- Sort remaining matches by Levenshtein similarity
- Keep top candidates based on threshold
Run final evaluation using GPT-4
GPT-4 then returned the closest match with its confidence score for the candidate.
Results
The AI-driven approach processed 121,466 unique school names and identified:
- 68,105 high-confidence matches with over 90% confidence.
- 62,399 validated matches after further filtering and cross-checks.
Comparison of Approaches
Impact
The implementation of AI-driven matching between Dapodik and BKN systems has demonstrated meaningful improvements in both processing efficiency and data quality. When comparing manual effort versus AI-driven approaches for processing 120K+ school’s names:
The improved data alignment between Dapodik and BKN has yielded practical benefits for key stakeholders. Teachers and education staff experience faster employment verification and streamlined administrative processes. Educational institutions benefit from reduced overhead in maintaining consistent records across both systems. Government agencies can now access unified data for policy planning and resource allocation, enabling more informed decision-making in areas such as teacher distribution and training programs.
This systematic approach to data matching also provides a practical framework that can be adapted for similar data integration challenges across other government agencies, supporting broader improvements in public service delivery.
Key Takeaways
The task of matching Dapodik and BKN data highlights the need for flexible solutions tailored to the complexity of the problem. While structured variations are manageable with traditional text similarity methods, semantic and contextual differences require AI-based approaches. By leveraging AI, we were able to achieve accurate matches, ensuring the alignment of data between these two critical systems.
Concluding Remarks
Working on these two projects taught us something important about solving text similarity problems. The key isn’t just having powerful algorithms — it’s about understanding the context of why text varies in the first place. People shorten names, use different titles, or write things in multiple valid ways. A good solution needs to account for all these real-world patterns.
The approaches we developed can be applied to many similar problems: matching company names across databases, finding duplicate customer records, or comparing product descriptions. Anytime you need to figure out if different pieces of text refer to the same thing, these methods can help.
What started as specific problems with matching names turned into insights about handling text similarity at scale. Sometimes the best solutions come from looking past the surface-level differences in text and understanding the patterns that connect them.