Sitemap
Doctolib

Since 2013, Doctolib has been transforming healthcare for thousands of health professionals and millions of patients across Europe.

Follow publication

Building a Unified Healthcare Data Platform: Architecture

--

Over the past 4–5 years, our data platform has evolved alongside Doctolib, transitioning from a startup with a small data team to a scale-up with a team of over a hundred. This rapid growth has shaped the platform into what it is today, supporting the ever-changing needs of the business.

This blog serie aims to share the journey of creating our Unified Healthcare Data Platform, starting with this initial post, we will:

  1. Reflect on our current data platform: Explore how it has successfully supported and enabled the business until recently.
  2. Identify the limitations of the current platform: Understand the challenges it faces in meeting the ambitious goals to move from a reporting platform to become the leader in AI for healthcare.
  3. Lay the foundation of this new platform: Introduce the functional architecture and building blocks of the new platform that will serve as the cornerstone to support AI and reporting use cases.

In subsequent posts, we will dive deeper into the technical aspects of this transformation. This include:

  • Our North Star architecture built around the sociotechnical principles of Data Mesh
  • The technical solutions we’ve chosen,
  • Other critical decision or implementation that will shape the this new platform.

From a Data platform built to support a startup

Like many startups, our data journey began with a small team tasked with addressing a wide variety of data needs, all the while operating with limited resources and short-term visibility — often planning only one month ahead.

In this context, it made sense to prioritize building the platform in the most efficient way possible: minimizing friction, avoiding unnecessary complexity, and steering clear of long implementation cycles that might take months without delivering tangible results.

Monolithic and centralized approach

In the early stages, it was only natural to adopt a monolithic and centralized approach to building the data platform, characterized by:

  • A single Git repository: Centralized for all data contributors, with a unified release cycle (once per day).
  • A single AWS account: Shared for all data operations (per environment), with a single IAM role.
  • A single data warehouse: One Redshift cluster handling all data modeling and transformations.
  • A single orchestrator: A single Airflow instance orchestrating all workflows.

This foundation supported the growth of the data teams at Doctolib for several years. Over time, the platform evolved, step by step, into its current state:

Doctolib’s Data Platform
  • Airflow on EKS: Deployed with the KubernetesExecutor for better scalability.
  • Two Redshift clusters: One for computational workloads and another dedicated to read operations.
  • A unified dbt DAG: Orchestrating the computation of all data marts across teams using dbt-core.
  • Metadata layer: Leveraging Lambda and DynamoDB to trigger systems subscribed to specific events (e.g., updates to a Redshift table).

Despite these iterations, the platform’s core principles remain unchanged:

A tightly coupled team where data modelers and platform owners operate as one unit, with shared permissions and a centralized, monolithic approach to platform management.

Limitations

To be candid, given our ambitions and needs until now, this centralized and monolithic approach proved to be highly effective. It allowed us to quickly establish Doctolib’s data platform and enabled fast iteration and deployment of new features — placing data at the heart of business decision-making.

For example, while it can take some companies more than a year to adopt a tool like Tableau Server on-premise, we were able to deploy the infrastructure and complete the migration in under three quarters .

However, Doctolib’s new ambitions require a shift in approach. The goal now is to give operational teams ownership of their analytical data, enabling broader access to the platform. To enable as much operational data as possible to be made available within the data platform to support not only reporting needs but also AI tools’ training and knowledge base requirements.

This shift has highlighted several key limitations in our current approach, which we’ll explore in more detail.

Monolithic git repository

Our entire codebase for managing data modeling is consolidated in a single GitHub repository with a singled daily release cycle. While this approach simplifies management in some ways, it also introduces significant challenges:

  • Complex CI pipeline: With all code in one place, theoretically, every downstream and upstream dependency can be tested for any modification. This leads to an overly complex CI pipeline logic, with test runs taking 30–40 minutes to complete, delaying the development cycle.
  • Redundant pre-production testing: Having full access to the entire codebase drives the need for heavy pre-production testing. For example, our process involves running the full set of models on a production-like dataset to ensure data availability by 8 AM the next day. This results in duplication — pre-production runs and production runs perform identical operations, instead of using pre-production run results to incrementally reduce the production workload and SLA pressures.

This centralized structure also becomes problematic when syncing with the operational world, where release cycles differ. For instance:

  • Changes to a PostgreSQL table will not be reflected in the analytics CI, leaving downstream impacts undetected.
  • The absence of a decentralized paradigm prevents the introduction of concepts like data contracts, which could bridge this gap and support more robust workflows.

Additionally, this monolithic approach slow down innovation by imposing stringent production requirements on all systems, regardless of criticality. Since some systems are critical, the platform is designed around the worst-case scenario for everyone, creating inefficiencies. This often necessitates workaround solutions, like hotfixes, to address issues that arise from the rigid release process.

Ultimately, this setup limits flexibility and adaptability, highlighting the need for a paradigm shift to support scalability and agility.

Centralized orchestrator

While Airflow has been a solid choice for our current needs, its architecture and limitations make it challenging to scale for future ambitions.

Event-Driven Runs
Airflow relies on scheduled runs, complicating event-driven workflows. For example:

  • When source data is republished, workflows must be manually rerun.
  • Triggering workflows based on external events, like Tableau datasource refreshes, is cumbersome due to the DAG scheduling model.

Python Package and Code Dependencies

  • Shared dependencies across all workflows mean a single update can break multiple teams’ processes.
  • Insufficient test coverage exacerbates risks, making upgrades a challenge.

Dynamic Workflow Limitations

  • Airflow struggles with dynamic configurations, such as workflows depending on the nature or content of data.
  • For use cases like A/B testing or Tableau orchestration, this limitation hampers flexibility.

AWS Permissions Handling

All DAGs share the same IAM role, creating potential security vulnerabilities. Even if you set up a segregation between the tasks and the role used, this association is not managed at infrastructure level but at code level, so it cannot be considered a secure solution.

Testing and Upgrading Challenges

  • Testing orchestration logic and triggering conditions are cumbersome.
  • Upgrading Airflow is complex, requiring alignment of all DAGs and associated dependencies, making scalability even harder.

Monolithic DataWarehouse

The same challenges extend to other tools in our stack, particularly our monolithic data warehouse. Key issues include:

  • Permissions: All users have administrative rights, making it impossible to enforce fine-grained access control. This prevents the platform from supporting use cases involving sensitive data.
  • Resource Allocation: Managing cluster resources for multi-tenancy and workload prioritization remains a significant challenge, often leading to inefficiencies and bottlenecks.
  • And More: The centralized approach inherently limits flexibility and scalability, creating similar constraints across other components of the stack.

A shift toward a more modular and decentralized architecture is essential to address these systemic issues and support the platform’s growth and broader use cases.

To a Unified Healthcare Data Platform to support the growth of a ScaleUp

To address the limitations of our current platform and meet Doctolib’s evolving needs, we are embarking on a complete rebuild of the data platform. This new platform will focus on:

  1. Leveraging Data for AI and Analytics: Ensuring even the most sensitive data can be utilized securely and in a governed manner to drive AI development and advanced analytics.
  2. Enhancing Engineering Experience: Providing a streamlined, efficient, and user-friendly environment for our engineering teams to accelerate development and innovation.
  3. Addressing Technical and Security Debt: Resolving existing issues to create a more robust, secure, and scalable foundation for the platform.
  4. Optimizing Costs: Reducing infrastructure and operational costs while enabling the platform to scale with increasing data volumes without proportionally increasing team size.
  5. Serving as a Technical Enabler for Products: Acting as a foundational layer to support product innovation and operational efficiency across the organization.

This strategic overhaul will position Doctolib to better harness its data, empowering teams to meet current and future challenges effectively.

Unified Healthcare Data Platform — Functional Architecture

Platform Landing Zone

A modular, scalable foundation enabling organizations to adopt a cloud provider for operational and analytical needs.

Enterprise Networking
Implements a hub-and-spoke topology for secure, scalable global communication, granting platform teams autonomy in managing environments while ensuring secure cross-platform interaction. Tags management enhances organization, policy enforcement, and cost tracking without reducing team autonomy.

Resource Hierarchy Management
Structures cloud resources to simplify management, enhance security, and improve operational efficiency with centralized control.

Automation
Streamlines application delivery and infrastructure management with workflows and infrastructure-as-code (IaC) for consistent, reliable releases:

  • CI/CD Automation: Automates testing, deployment, and pipeline management.
  • Infrastructure Provisioning: Ensures scalable, automated resource deployment.

Version Control
Tracks changes in code and configurations, enabling collaboration and rollback.

Platform Monitoring & Observability
Centralizes logs, metrics, and traces to monitor health and performance, including:

  • Incident Management: Standardizes incident handling to reduce disruptions and document investigation on incident.
  • Events & Metrics: Tracks trends and optimizes performance.
  • Logs: Collects and analyzes logs for troubleshooting and audit.

Security
Protects the platform via backups, access controls, encryption, and real-time monitoring for vulnerabilities and threats.

Cloud Identity
Centralized authentication and access policy management for secure resource interaction.

Unified Healthcare Data Platform Foundations

A shared foundation across platform teams ensures accessibility, interoperability, security, and scalability for analytical platform and data assets.

Infrastructure Component Libraries
Standardized libraries enable secure, compliant deployment of platform components, such as Kubernetes clusters.

Networking
Grants platform teams autonomy over networking while maintaining secure cross-platform communication:

  • Spoke Shared VPC: Securely shares a central VPC for all analytical platform, enabling resource sharing with individual management.
  • Hub Integration: Centralized routing, connectivity, and security between spokes.
  • Firewall Rules: Enforces security policies for precise access control across resources.
  • Cloudflare Integration: Enhances security and performance with DDoS protection, WAF, CDN, and DNS optimization.

Resource Hierarchy Management
Structures cloud resources for scalable governance within the platform.

Lakehouse

Combines the scalability of a data lake with the governance and optimization of a data warehouse, supporting real-time analytics, machine learning, and advanced processing.

Datalake
Centralized, scalable storage for diverse data, supporting schema-on-read for flexibility:

  • Structured Data: Preserves schema for efficient querying and analysis.
  • Unstructured Data: Stores raw data like logs, images, and videos, enabling scalable, flexible access.

Datawarehouse
Centralizes structured data for optimized querying, reporting, and business intelligence.

Federated Queries Engine
Provides a unified, secure interface to query multi-format and multimodal data for analytics and AI applications.

Healthcare Ontology / Polysemes
Manages and integrates healthcare data using standards like HL7, FHIR, OMOP and DICOM, ensuring compliance while enabling advanced analytics and machine learning.

ML Storage

Vector Database
Optimized for storing, indexing, and searching high-dimensional vector data, enabling efficient similarity searches for AI applications.

Data Ingestion

Ingestion Endpoints
Endpoint for receiving data from different sources, supporting multiple formats and protocols for seamless flow.

Change Data Capture
Tracks real-time database changes to have an explicit view, across time of a given dataset.

Self-Service Ingestion Engine
Empowers teams to ingest data independently with pre-built connectors, validation, and transformation for analytics readiness.

Data Sharing

Analytics Hub
Centralized platform for securely sharing curated datasets, metrics, and insights, fostering collaboration and governance.

Reverse ETL
Syncs enriched Lakehouse data into operational systems (e.g., CRM, marketing platforms), embedding insights into business workflows.

Transformation Framework

Large-Scale Data Processing Framework
Supports distributed batch and stream processing for massive datasets with scalable, fault-tolerant performance.

SQL Framework
Enables consistent, automated, and scalable data transformations in the warehouse using SQL.

DevX Tools
Enhances SQL workflows with code editor integrations, documentation generators, and performance optimizers.

Data Mesh Support
Provides infrastructure and tooling for decentralized data ownership, scalability, and autonomy across domains:

  • Model Versioning: Tracks and manages multiple data model versions, ensuring data contracts and rollback capabilities.
  • Cross-Project Dependencies: Manages interdependencies across datasets, models, and transformations for consistent analytics workflows.

DataShield Transformer
Enforces security measures like encryption and pseudonymization to simplify data product developers to comply with legal and regulatory standards during transformations.

Data Product Orchestrator

Orchestrator
Automates and manages data workflows, ensuring reliable execution, scheduling, error recovery, and dependency coordination.

Metadata Layer
Centralized repository providing insights into data state, quality, and lineage:

  • Asset Checks: Automates validations to detect anomalies and ensure data quality before downstream usage.
  • Asset Observability: Offers real-time metrics for monitoring the health and performance of data assets.
  • Asset Lineage: Maps transformations and dependencies across assets for transparency and tracking.
  • Asset Orchestration: Ensures reliable materialization by coordinating execution, scheduling, and SLA compliance.
  • Asset Versioning: Tracks and manages data asset versions for reproducibility and traceability.

ML Training Platform

Model Registry
Centralized management of machine learning model lifecycles, ensuring governance, traceability, and streamlined deployment.

Feature Store
Centralized repository for managing, storing, and serving features used in machine learning models.

LLMOps tooling
Provides the infrastructure, workflows, and management capabilities necessary to operationalize large language models (LLMs) in production. This includes tools for model fine-tuning, deployment, monitoring, versioning, prompt optimization, and cost management.

Model Registry
Centralized management of machine learning model lifecycles, ensuring governance, traceability, and streamlined deployment.

Annotation Platform
Enables collaborative labeling and annotation for diverse data formats, with workflow management and quality control for ML projects.

ML Experiment Tracking
Help data scientists and ML engineers log, organize, and compare experiments. They record metadata such as hyperparameters, model architectures, datasets, evaluation metrics, and results.

Training Workstation
Is a high-performance computing setup designed for training machine learning models.

Inference Platform

LLM Providers
Offers APIs and tools for large language models (LLMs) to enhance text generation, summarization, translation, and language understanding.

Model Monitoring
Tracks production ML models’ performance, monitoring accuracy, drift, and resource usage, with alerts for anomalies.

Model Serving
Deploys ML models for real-time predictions, managing scaling, versioning, and API endpoints.

Model Inference Engine
Facilitates efficient inference by optimizing the execution of models across multiple hardware backends, such as GPUs, CPUs, and specialized accelerators.

Model As a Service
Catalog of pre-trained machine learning models as APIs or endpoints.

Data Governance

Access Control
Manages secure and compliant data access with defined policies and monitoring:

  • Identity and Access Management (IAM): Centralized system for managing roles and permissions.
  • Audit Logs: Tracks user activities and data interactions for compliance.
  • Column-Level Access: Restricts specific columns to protect sensitive data (e.g., PII).
  • Row-Level Access: Controls access to specific rows based on user attributes.
  • Perimeters: Isolates resources at the network level to prevent unauthorized data transfers.

Data Stewardship
Protects sensitive data and ensures compliance across its lifecycle

  • Data Masking: Obscures sensitive data while maintaining usability.
  • Encryption: Secures data at rest and in transit with encryption standards.
  • Metadata Management: Categorizes and tags data assets for better discovery and governance.

Data Quality
Maintains data accuracy, completeness, and consistency:

  • Data Profile Scan: Analyzes data patterns, distributions, and anomalies.
  • Data Lineage: Tracks data flow and transformations for transparency.
  • Data Quality Scans: Validates data against rules to ensure it meets standards.

Data Discovery
Enhances data accessibility with organization and search tools:

  • Data Catalog: Centralized metadata repository for easy discovery and documentation.
  • Data Portal: Self-service interface for exploring and accessing data resources.

Data Exploration and Reporting

Workspace Environment
Secure and collaborative space for developing and sharing data insights:

  • Notebook Service: Managed cloud-based environment for data analysis, modeling, and source integration.
  • Low-Trust Environment: Secure solution for collaborating on sensitive data, ensuring privacy and compliance.

Reporting
Tools for creating, managing, and distributing insights to drive data-informed decisions:

  • Self-Serve Dashboards: User-friendly platform for ad-hoc exploration and customizable dashboards.
  • Industrialized Dashboards: Centrally managed dashboards aligned with KPIs for consistent insights at scale.
  • GenAI Assistant: Conversational AI tool enabling natural language data exploration for non-technical users.
  • Tracking: Monitors user interactions and engagement to provide actionable behavior and performance insights.

Unified Healthcare Data Platform Interfaces

Data Domain Controller
Automates provisioning of infrastructure (storage, processing, and access controls) to operationalize data domains and support the data mesh strategy.

Technical Team Controller
Manages team resources, including infrastructure, access, and tools, aligning with roles and project requirements.

Data Product Controller
Oversees the lifecycle of data products, ensuring their creation, deployment, and maintenance meet quality and compliance standards.

Data Contract Controller
Enforces governance, quality, and compliance across domains to ensure secure and consistent data sharing while safeguarding privacy.

Data Contract Request Access Controller: Automates access requests and subscriptions to data products, ensuring compliance with predefined governance terms.

Team ownership

On this new platform, we have paid special attention to defining the scopes of each team.
The goal is to avoid scope overlaps, which have occurred in the past following team splits implemented to support team growth over time. For example:

  • Two teams managing different instances of Redshift (one for compute and another for reporting), leading to feature drift implemented across each cluster.
  • A shared Kubernetes cluster without clear ownership regarding its lifecycle management and the inherent multi-tenancy.

We therefore have four teams within our “One Team: Data and Machine Learning Platform”:

  • Data Engineering Platform: Responsible for establishing the infrastructure foundations on which our platform relies (network, shared standard packages, integration with Doctolib’s cross-platform services) and the foundations for modeling data (lakehouse, orchestrator, transformation framework, data product interface, etc.).
  • Data Ingestion & Output: Responsible for managing the input and output of the data platform, including integration with the operational world and providing self-service tools that enable data owners to handle integration with external partners.
  • Data Tools: Responsible for providing all tools and frameworks that make data discoverable, accessible, and usable — whether for development needs or reporting through dashboards — in a secure and trustworthy manner.
  • ML Platform: Responsible for implementing all platform components that allow data scientists and ML engineers to explore, train, deploy, and serve models that can be integrated into Doctolib’s products at a production-grade level.

Note: The team names are not final and may differ from their actual names in the future.

With this structure, we ensure a clear definition of each team’s responsibilities and, most importantly, limit interdependencies (and therefore interactions) to the strict boundaries of each scope (and its associated components and services). These boundaries can be formalized through interface contracts. For example:

  • Between Data Engineering Platform (DEP) and Data Tools: DEP implements the Lakehouse, while Data Tools manages access to it by calling the Lakehouse API, which serves as the contract defining the interaction.
  • Between Data Tools and Data Governance Team: The Data Governance team defines the information that should appear in the Data Contract, and Data Tools implements the interface and translates it into the Data Catalog.
  • etc.

Next part of the series

In this article, we’ve explored the building blocks we plan to use to construct our new Healthcare Data Platform and the functional solutions they provide. Notably, this architecture is not specific to Doctolib — it could be adapted and reused by any organization!

In the next two posts, we will dive deeper into the implementation decisions made to address Doctolib’s specific challenges, focusing on two key architectures:

North Star Architecture

  • Demonstrates how these building blocks come together through the Data Mesh approach we aim to adopt.
  • Highlights the data flow requirements and the overarching vision of our platform.

Technical Architecture

  • Details the technical choices (or aspirations) made to address both the functional requirements and the needs of a Data Mesh.
  • Explains the tools and solutions enabling this vision.

This series reflects a significant shift in our methodology. Instead of tackling problems by starting with technical tools, as we or others have done in the past, we begin with functional needs. The process is structured as follows:

  1. The functional architecture identifies the required building blocks.
  2. The North Star Architecture outlines the data flow requirements (readable schema link).
  3. The technical architecture defines the technical solutions to make it all possible.

By prioritizing needs over tools, this approach ensures that our platform is purpose-built to meet Doctolib’s present and future goals.

--

--

Doctolib
Doctolib

Published in Doctolib

Since 2013, Doctolib has been transforming healthcare for thousands of health professionals and millions of patients across Europe.

Alexandre Guitton
Alexandre Guitton

Written by Alexandre Guitton

Engineering Manager - Data Engineering Platform @Doctolib

Responses (4)