Lessons Learned from a Large-scale Data Mesh

Dan Sullivan
4 Mile Analytics
Published in
7 min readAug 23, 2022

Data silos hinder analysis and one way to mitigate the effects of those silos is to use a data mesh. The term data mesh is fairly new but its practices have been in use for decades in the field of bioinformatics, which is essentially data science of genetics. Looking into bioinformatics data practices, we can find some important lessons and begin to incorporate them into our own data mesh practices.

Failure to Scale

The concept of a data mesh was described by Zhamak Dehghani in a 2020 paper called “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” The author notes that in spite of many advances in data management technology, we were still failing to scale in several key objectives, including:

changes in the data landscape, proliferation of sources of data, diversity of data use cases and users, and speed of response to change.”

To address these unmet challenges, Dehghani proposes four principles:

  • Domain-oriented decentralized data ownership and architecture.
    Data mesh is based on decentralization and distribution of responsibility. This immediately raises the question of how to decide on domains. In the case of business, we can use organization structures as a guide to create bounded contexts.
  • Data as a Product.
    Data as a product is meant to address problems with discovering, understanding, trusting, and using data. Collecting and storing data is of little value if it is not used and data as a product is meant to make data readily available.
  • Self-serve Data Infrastructure as a Platform.
    Self-serve data infrastructure is meant to enable domain autonomy. “Tooling that supports a domain data product developer’s workflow” without requiring detailed knowledge of how to configure, deploy, and monitor compute, storage, and networking resources.
  • Federated Computational Governance.
    Federated computational governance entails a governance model that promotes decentralization and interoperability through standardization. This is especially challenging because this is less of a technology problem and more of an organization, people, and process issue.

The principles of data mesh have been incorporated into bioinformatics for many years and at a global scale. This makes it an ideal resource for learning about how to implement data mesh architectures in real-world situations.

Scaling Genetics Research

Bioinformatics is a trans-disciplinary subject area and includes life sciences, particularly biology and biochemistry, computer science, and statistics. The data of bioinformatics represents biological structures and processes at widely varying levels of scale, from molecules and reactions to organisms and communities of organisms. When we talk about bounded domains in bioinformatics, there are several major domains and within those domains are subdomains that represent subsets of knowledge and data within the major domain.

  • Genomics is an area of molecular biology that focuses on the structure and function of genes and genomes and much of the data in genomics is about DNA structures and other components of genomes.
  • Transcriptomics is another area of molecular biology that elucidates the process of transcribing the information stored in genes to create proteins and how the set of genes transcribed changes under different conditions.
  • Proteomics is the study of the structure and function of proteins
  • Metabolomics studies how the interaction of proteins and other biomolecules create biological processes.
  • Biochemistry is a foundational discipline that underlies these other domains.

The advent of bioinformatics data sources dates back at least 50 years with the creation of the Protein Data Bank at Brookhaven National Laboratory in the United States in 1971. GenBank, a repository of genetic and genomic data, became public in 1982 and in the 1990s the International Nucleotide Sequence Database Collaboration (INSDC) formed as a collaboration between DDBJ in Japan, EMBL-EBI (European Molecular Biology Laboratory) and NCBI (National Center for Biotechnology Information in the US) which is a global scale collaboration for sharing genomic data. As sequencing technologies advanced, more organisms were sequenced and studied. This led to further specialization, for example, with the development of bioinformatic resource centers focused on infectious diseases. One of the benefits of these long standing data resources was the ability to respond to the Covid pandemic with large scale monitoring and drug discovery.

GenBank is a repository for data about genes and genomes and it has adopted specialized file formats and metadata specifications for data it collects. The file formats and metadata are sufficient to represent data and metadata about genes and genomes across species, from viruses and bacteria to humans and other vertebrates. GenBank also links to publications on genes and genomes which are housed in another component of the bioinformatics data mesh known as PubMed.

The Protein Data Bank focuses on protein sequences and structures as well as groups of closely related proteins. Here again, this data product is discoverable, interoperable, and addressable. For example, a user of Protein Data Bank can navigate from the PDB to UniProt, another protein database hosted by EMBL, and one can go from UniProt to GenBank data to get data on the gene that encodes a particular protein.

Following the data mesh principles enables analysis from multiple perspectives.

Genes encode proteins and proteins interact with other proteins. Those interactions are captured in pathways, some of which are quite complex. The KEGG database is a bioinformatics data resource that specializes in representing and cataloging these pathways.

We can see in these examples how the four principles of data mesh were factors in the development of bioinformatics data resources.

  • Decentralized and distributed — there is no single controlling organization and data resources were developed around particular domains, such as genomics.
  • Data products — the bioinformatics resources provide data products, with access to multiple types of data and offer levels of service that allow researchers to build on the resources with confidence about the reliability and trustworthiness of the data.
  • Self serve — data is publicly available and many of the resources also provide analysis pipelines and related functionality for garnering insights from the data.
  • Federated governance — federated governance is clear at the global level of the INSDC which ensures data is synchronized across all three major genomics data stores.

The lessons learned in bioinformatics are applicable to data science in general.

Lessons Learned from a Global Scale Data Sharing

Data products will vary in scale and level of specialization so we should expect to see the evolution of hubs and satellite data products. Hubs tend to provide widely used data and satellites are more specialized and build on data provided by hubs. For example, the infectious disease resource centers use data from GenBank, Uniprot, and other large-scale data services.

Hubs or major data products will have metadata but that alone may not be sufficient for more specialized use cases. For example, standard genome metadata is not sufficient for describing pathogen genomes. The pathogen may have been isolated from a host, which could be a human, animal, or plant, and the specimen may be stored in a bio-repository, and the host has a particular disease state. Expect domain experts to augment metadata standards to accommodate additional metadata features.

Data standardization is needed early. This includes policies on what kind of data is provided, what formats are used, access controls, and any restrictions on volume of data accessed. These are both technical standards and “rules of the road” for users.

Bounded domains may have some conceptual overlaps. They should be using the same terminology and definition of terms should be clear and well defined. This can require the development of taxonomies and ontologies that specify relationships between terms. These can also be the source of allowed terms used in metadata descriptions.

Service level agreements let data consumers know what they can expect from a service, including how frequently data is updated, any restrictions on how much data is retrieved, data use restrictions. In the case of GenBank, any submitted data about human genomes will have to follow rules to ensure a person’s privacy is not violated.

How do you know what data is available in a data mesh? In the case of bioinformatics, the answer rests with a common practice of publishing information about data resources in an annual journal issue dedicated to bioinformatics databases. This is a de facto catalog of data products in bioinformatics. Of course, there are serious drawbacks to using journal papers to enable discoverability. Metadata catalogs that automatically collect at least technical metadata about data services can be used.

Perhaps the most difficult aspect of implementing a data mesh architecture is the part that requires human consensus, like developing data standards and taxonomies. We need to plan for extended periods of time for developing consensus around standards.

What’s Old is New Again

The principles of data mesh architecture have been in use for decades and the results have been exceptionally positive. Much more is known about chronic and infectious diseases than would have been possible without genomic and proteomic data. But it isn’t just the data, it is how the data is organized and made accessible that makes the data useful. It is hard to imagine the rapid development of Covid vaccines or the tracking of variants without the data infrastructure that has been developed over the past decades. Bioinformatics demonstrates that applying the principles of data mesh can significantly contribute to the effectiveness of data science.

--

--

Dan Sullivan
4 Mile Analytics

Dan Sullivan is a Principal Data Architect with 4 Mile Analytics. He focuses on data architecture, data modeling, machine learning, and analytics.