Sitemap
Sciforce

We rock the science

Follow publication

Building a Unified Medical Vocabulary Framework Aligned with OMOP CDM

--

Client Profile

Our client develops standardized medical vocabularies that unify data from different sources, ensuring consistency and interoperability, to support observational studies and generate evidence. Their work helps researchers, data scientists, and healthcare professionals to integrate and analyze medical information using established standards like SNOMED CT, LOINC, and RxNorm.

By structuring and maintaining the unified medical dictionaries, they support large-scale research, improve data accessibility, and help healthcare providers to make decisions based on up-to-date evidence. Their approach allows experts worldwide to collaborate more efficiently and apply data-driven insights in medical practice and research.

Challenge

A standardized vocabulary must include:

  • Unique Identifiers — Consistent coding for medical concepts
  • Nomenclature — Standardized naming conventions
  • Thesaurus — Synonyms to support terminology consistency
  • Taxonomy — A classification system for organizing concepts
  • Network hierarchy — Upward/downward relationships augmented with lateral associations between concepts

Additionally, all input data must be normalized into a unified format, utilizing standardized terminologies and controlled vocabularies, to enable consistent querying and interoperability across healthcare systems. This requires expertise in medical ontology, engineering, hierarchical modeling, and domain-specific contexts to ensure accuracy and seamless integration for robust observational research.

Semantic Integration Across Sources

Medical coding systems such as ICD, SNOMED CT, and LOINC vary in structures and follow different conventions. The challenge was to align these diverse terminologies into a single vocabulary while preserving the integrity of the original datasets.

Complex Hierarchical Relationships

Medical concepts follow a structured milti-axial and dynamically evolving hierarchy that links diagnoses, drugs, measurements, procedures, and/or observations. The system had to accurately represent parent-child and lateral relationships, support continuous updates and adapt to rapid changes in medical knowledge.

Harmonization of Data Formats

Healthcare data originates from various systems with different formats, creating inconsistencies. Transforming these datasets into a standardized OMOP CDM-compatible format required preserving the essential metadata and ensuring lossless integration.

Maintaining Contextual Accuracy

Medical concepts vary across healthcare settings due to the differences in local practices, coding granularity, and terminology interpretations. Aligning these variations with standardized definitions was essential to maintain clinical and operational relevance.

Dynamic Evolution and Scalability

Medical coding systems continuously evolve with new discoveries and practices. The vocabulary framework had to support ongoing updates without disrupting integrations, ensuring long-term reliability and adaptability to future advancements.

Solution

Development and Harmonization of Standardized Vocabularies

We integrated multiple medical terminologies to create a unified structure aligned with the OMOP CDM. This included:

  • Core Clinical Vocabularies: SNOMED CT, LOINC, OMOP Extension, RxNorm, RxNorm Extension
  • Pharmacological Classification: ATC
  • Procedural and Diagnostic Codes: ICD9Proc, ICD10, ICD-10PCS, ICD-10CM, CPT4, HCPCS, OPS
  • Oncology-Specific Resources: HemOnc, ICDO3, JAX, NCIt, OncoKb
  • Genetics and Genomics Resources: OMOP Genomic, ClinVar, CIViC, CGI, HGNC

Ensuring Consistency and Interoperability

We set clear rules to standardize raw datasets, organized vocabularies into structured hierarchies, and developed algorithms to fix inconsistencies in definitions and classifications. To ensure accurate querying and analysis, we converted vocabularies into a common format for efficient querying and analysis, and applied strict quality checks, including automated validation for metadata accuracy and completeness.

Addressing Semantic and Structural Challenges

To ensure accuracy and consistency across different medical vocabularies, we developed solutions to unify terms, maintain relationships, and improve data clarity.

  • Mapping Algorithms — Linked source terms to standard concepts while keeping the original data intact.
  • Dynamic Hierarchies — Organized medical concepts into structured parent-child relationships and lateral property-based associations.
  • Human-Readable Descriptions — Enriched vocabulary entries with detailed metadata, including concept identification (preferred names, synonyms, concept classes), semantic context (domains, standardness), temporal validity (validity periods, creation/deprecation dates), cross-references (associated mappings), supplemented by comprehensive documentation.

Continuous Maintenance and Quality Assurance

We implemented a comprehensive vocabulary management system, encompassing:

  • Staged Loading and Integration: Utilizing SQL scripts and functions, dedicated manual input tables (e.g., concept_manual, concept_relationship_manual), staging tables (e.g., concept_stage, concept_relationship_stage), and a generic_update.sql script to systematically build and integrate vocabulary updates into the basic OMOP Vocabulary tables (e.g., concept, concept_relationship).
  • Versioning System: Establishing a versioning system, utilizing Git principles, that empowers users to track the updates and apply changes as needed.
  • Manual Review and Validation: Conducting thorough manual reviews by medical domain experts to maintain accuracy, reliability, and clinical relevance.
  • Automated Quality Checks: Employing automated checks to verify data completeness and consistency.

Features

Comprehensive Multivocabulary Coverage

The solution harmonizes diverse medical vocabularies across clinical, pharmacological, laboratory, procedural, device, oncology, and genomics domains. By unifying these resources into a single system, healthcare professionals and researchers can efficiently access standardized data for drug and disease classifications and coding, genetic research, and a broad spectrum of clinical and translational applications.

Dynamic Hierarchy Exploration

The OMOP Vocabulary infrastructure allows users to navigate complex hierarchies with ease by:

  • Visualizing parent-child relationships between medical concepts
  • Exploring conceptual linkages to uncover associations and patterns
  • Enhancing data interpretation through structured categorization

This feature improves accessibility and helps researchers analyze the relationships within medical datasets more effectively.

Adaptive Scalability and Versioning

The modular framework supports continuous updates, allowing new terminologies and refinements to be integrated seamlessly. A versioning system tracks changes, ensuring full transparency and alignment with evolving healthcare standards.

Development Process

The development process focused on building a robust relational database infrastructure to harmonize diverse medical vocabularies. Each vocabulary required a customized loading process, implemented using SQL, with dedicated documentation on GitHub for transparency and reproducibility.

Vocabulary Transformation and Standardization

Each vocabulary (e.g., SNOMED CT, LOINC, ICDO3) was transformed based on its unique structure, coding conventions, and relationships. To ensure semantic consistency while preserving essential details, we implemented standardized mapping rules and algorithms to align overlapping concepts accurately across vocabularies.

Managing Hierarchical Relationships

Medical vocabularies contain complex parent-child and lateral relationships and taxonomies that required:

  • Optimized SQL procedures to efficiently process large datasets and maintain relationship integrity
  • Dynamic hierarchy management to accommodate updates without disrupting already existing structures
  • Scalable processing techniques to handle expanding datasets while ensuring fast query performance and avoiding bottlenecks.

Version Control and Change Tracking

We implemented a version control system to track changes, apply updates efficiently, and allow users to reference or revert to specific versions as needed. This ensured consistency across integrations and maintained compatibility with evolving standards.

Technical Highlights

  • Domain Expertise: The project was enriched by insights from various medical specialties, including Internal Medicine, Pediatrics, Psychiatry, Pathology, Neurology, Intensive Care, Oncology, Obstetrics, and Gynaecology.
  • Medical Ontology Engineering: The core of the infrastructure was built upon SQL-based solutions, facilitating the integration and management of complex medical taxonomies within a relational database framework.
  • Programming Languages: The build process predominantly utilized PLpgSQL.
  • Version Control and Collaboration: GitHub served as the central platform for version control and collaborative development, hosting the building processes and associated documentation for each vocabulary.

Impact

We developed a standardized medical vocabulary infrastructure aligned with OMOP CDM, enabling seamless data interoperability and supporting global research and analysis. The system now offers:

  • A unified framework for diverse coding systems, enabling seamless data integration across hospitals, research institutions, and healthcare networks in different countries, ensuring interoperability and consistency in multinational healthcare and research collaborations.
  • Scalability and adaptability to incorporate new medical knowledge, ensuring long-term usability.
  • Dynamic hierarchies combined with detailed metadata for improved searchability and more precise data analysis.

This transformation allows researchers, clinicians, and policymakers to conduct evidence-based studies more efficiently, improving data-driven decision-making in global healthcare.

Hospitals and research institutions using this vocabulary infrastructure can now streamline patient data integration across different systems, improving large-scale observational studies. For example, a research team studying cardiovascular disease outcomes across multiple countries can utilize harmonized OMOP data with standardized vocabulary mappings and ETL transformations, thus ensuring the interoperability and allowing to generate a large-scale evidence.

Researchers can then define study-specific concept sets, construct cohorts using reproducible phenotype algorithms and conduct patient characterization to analyze baseline demographics and clinical features. This enables reliable cross-site comparison, improves predictive modeling for patient risk factors, and strengthens population-level effect estimation in real-world evidence studies.

--

--

Sciforce
Sciforce

Written by Sciforce

IT company specialized in the development of software solutions based on science-driven information technologies #AI #ML # #Healthcare #DataScience #DevOps

Responses (1)