The Right Data for Good Results: Introducing the 5 ‘V’s of Drug Discovery Data

Leo Wossnig
25 min readJul 27, 2023


Co-authored with Winston Haynes.

In part 1 of this series we outlined the huge variety of machine learning methods that are applied to drug discovery and that there is no single approach that can solve all challenges. Different drug discovery stages and different tasks require different tools — there is no ‘one size fits all’.

While machine learning holds immense potential in drug discovery and development, the current focus needs to shift from creating more models to acquiring ‘good’ high-quality data that will optimise the functionality of these models. This is an opinion shared by many in the field (e.g. here by Pat Walters, or here by Jacob Oppenheim, or here by Benchling). Note that this is fundamentally different to large language models, where we hope that the average information will get it right, and so we rely on large sets of data available on the internet. We have the belief that specialised systems will for a long time prevail in drug discovery, since the amount of data that can be produced relative to the space of candidates is miniscule. General models will unlikely suffice for later stage programs (lead optimisation onwards), and even for early stage programs for entirely new candidates and targets.

In this post we will first define what good data looks like based on what we call the 5 ‘V’s of drug discovery data. We will then discuss what a tech stack looks like that can generate such data, and the layers that affect the 5 ‘V’s. Since tech is only as good as the teams that leverage it, we go beyond the data and the technology to discuss what mindsets, teams, and culture organisations need to create to enable the real impact of these technologies.

What does ‘good’ data look like?

‘Good’ data has two key requirements: The relevance of the data, or how well it translates into clinical readouts, and the quality of the data. The latter is a broad term but in reality the devil is in the detail — in the less desirable work of data collection, storage, protocols, reproducibility, and standardisation. This is often the work that no-one really wants to do and, due to a lack of incentive, few people write papers about this. But if this work is not done well, at best the machine learning models won’t work particularly well, and at worst there can be a complete loss of data integrity which can lead to fatal outcomes [ref].

Exploratory data analysis to build a basic understanding of the data and rigorous model validation are crucial in machine learning, biostatistics, and bioinformatics modelling. These steps strongly influence the predictive accuracy of any model and can counteract data inconsistency or relevance issues, making proper statistical analysis and model validation a discussion worthy of a separate blog post.

While improper statistical analysis also impacts the accuracy or reliability of in silico approaches, inaccurate chemical and biological data remain the central issue in data-driven methods for drug discovery (e.g. here or here). There is a long list of articles that discuss how noise can limit the performance of predictive models (see here, here and here), or more broadly how data quality impacts predictive models (e.g. here, here, here in context of QSAR models). Reducing noise and improving consistency in the data generation process is hence a key aspect of generating high quality data.

A decade ago, Fourches et al. outlined important steps for data curation for QSAR modelling (part 1 and part 2). However, with the growth of chemical, genomic, and protein datasets, some steps from the original study, like manual data curation, have become impractical. Further challenges arise because many essential aspects of data preparation and modelling aren’t available in a single standalone program nor are they standardised across the industry. Underlying processes and workflows are standardised to a lesser degree, imposing more fundamental limits. Building data-processing pipelines that allow for the consistent curation and standardisation of collected data is another essential part of generating good quality data.

In order to achieve consistency and minimise noise in the data and processing, we need to understand the entire stack that is required for machine learning-based predictions. We will come back to this later on, but first we will clearly define the right properties of high quality data. We are next developing a framework to assess the quality of data itself and introducing 5 key properties to evaluate.

Introducing the 5 ‘V’s of machine learning-grade data applied to the drug discovery tech stack

In this article we repurpose the 5 ‘V’s of data science and derive 5 ‘V’s of machine learning-grade data in drug discovery. These are properties to consider when dealing with data used to train machine learning models in drug discovery. They can be applied in most drug discovery settings and, if satisfied, will result in much better predictive models and meaningful outcomes. Any part of the above tech stack should be continually evaluated based on how it impacts the below principles.

We define the 5 ‘V’s of drug discovery data as follows:

  • The right data veracity: The quality, integrity, accuracy, and consistency of the data that is generated or available
  • The right data variety: How balanced is the data? What dynamic range does it cover? What different types of data do we have available? For proteins this could, for example, mean high sequence or structural diversity in the training data
  • The right data volume: The amount of data that’s available to train the models
  • The right data velocity: How fast and cheap can data be generated, accumulated, and curated for analysis purposes? This impacts how easily we can validate or retrain our models
  • The right data value: How translatable is the data to clinical outcome, which is typically related to the biological complexity?

While this list could be extended to additional properties, in this article we stick to the main 5. In general, data veracity, variety, and volume are qualities of the data itself, while velocity helps assess the ease at which we can acquire new data and hence what processes and methods we want to use. The latter is particularly important if there is little data in the public domain, or if we want to use an active learning-based approach. The value of the data is based on the biological complexity and its relevance to clinical outcome, which impacts the predictive validity of the data readout. There is also the opportunity to provide the model with more context and introduce conditionality by adding metadata.

Let’s look more closely at each of the 5 ‘V’s:

1. Data veracity

Consistent, accurate, good quality, labelled data, generated using biologically relevant functional assays is rare in many areas of drug discovery, or it doesn’t exist at all. Large corporate datasets typically aren’t useful as they aren’t collected in a consistent manner or digitised. Similarly, data from CROs might not include (ML) relevant controls and metadata needed for machine learning. And variability arises between sites and providers as a result of varying protocols, equipment, etc. See Figure 1 for an example of the impact of normalisation via controls. If public data already exists, it usually contains a lot of noise, duplicates and errors, which requires careful analysis, filtering, normalisation, and other preprocessing steps to use it (more examples in the appendix).

Figure 1: Data from repeated measurements of two different control molecules evaluated in cell-based assays from different campaign pre (left) and post (right) normalisation. The normalisation with respect to a control can dramatically change the picture. Here we clearly see two clusters (control-1 and control-2) emerging (Source: LabGenius internal data).

As we move towards more complex biology, we increasingly have to deal with more noise, more variability, and more confounding variables for each readout. For example, cell-based assays are highly sensitive in comparison to simpler binding assays, and have much higher plate-to-plate and campaign-to-campaign variations. From a translational perspective, more complex assays or systems are preferred because they are more indicative of molecules’ translational value. Increasing the predictive validity by utilising more complex, biologically relevant data is usually better than increasing the quantity of less relevant data. However, increased complexity brings greater variability, amplifying the significance and difficulty of the normalisation and integration process. Similarly, more specific data (e.g. a specific disease-relevant cell line vs a range of different ones) is usually harder to standardise but has higher predictive validity. This is one of the main reasons why standardisation across different programs is also much harder than across a single one.

Normalisation (e.g. via controls) hasn’t typically been done for historical data because it wasn’t created with the intention of being used for machine learning. This means that a lot of pre-existing data is not suitable for modern machine learning systems. The issue extends beyond a single type of measurement in drug discovery — variability and lack of reproducibility is common across many different areas, arguably anything that touches biology. This ranges from solubility to protein kinase assays ( or here) to toxicology (see also the appendix).

Even today, most organisations do not pay enough attention to the quality and consistency of the data they produce. This is why machine learning doesn’t have the impact that it could. Perhaps a reason for this is that the collection and curation of data is much less exciting than developing or using a new machine learning model. Experimental and computational scientists, industry, and investors alike need a big shift in mindset here. We’ll come back to this later in the post.

It is also important to understand that machine learning-grade data requirements are different, and more stringent, compared to conventional drug discovery data. To create machine learning-grade data, we need the right number of controls, repeats, a certain dynamic range, balance in the labels/categories, and metadata (context). This is necessary because biological data can often contain huge variances, and even slight changes in protocols or setup dramatically impact the readout. Without the context of these changes, it is unlikely that an algorithm will be able to make sense of the data. Beyond the algorithm, the data scientist or person performing the analysis the data without the larger context might face the challenge of trying to identify an invisible cause.
Creating the right data (beyond just FAIR data principles) is key to enabling the development of models that can truly impact the drug discovery pipeline, move assets into the clinic and ultimately treat patients.

This article by Andreas Bender and Isidro Cortés-Ciriano has highlighted many of these points. In the context of this discussion, Figure 2 (see below) is especially relevant and highlights a few challenges with biological data, namely cell line drift and response heterogeneity. When we work with biology we suddenly face a whole new set of data quality challenges, and therefore machine learning problems, that are very different to other fields.

Figure 2: Challenges with biological data in contrast to chemical data. Source: Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: Ways to make an impact, and why we are not there yet; A. Bender and I. Cortés-Ciriano, Table 1

When talking about data quality, it’s also important to touch on the lack of reproducibility between studies, assays, and labs. When using such data it is important to understand its limitations. Readouts from different labs have high variability even when measuring the same compound. This is again because experiments generate very different results when performed under different conditions. If this is the case for the data used to train a machine learning model, it is unreasonable to expect the model to provide accurate predictions. This can be significant, as shown in the toxicity measurements in Figure 3. In general, while it’s not always as bad as shown here, data from mixed sources can introduce additional challenges and noise. While this data can still be helpful, for example for pre-training, transfer learning, or multi-task learning, one needs to be aware of the limitations and risks.

Beyond the variability, public data also contains plenty of errors. A fundamental assumption in most machine learning or chem/bioinformatics papers (and the proposed models) is the correctness of input data. However, error rates in databases can be significant. For example, Olah et al. showed that on average there are two errors per each medicinal chemistry publication, resulting in an overall error rate as high as 8% in some databases as is the case for WOMBAT (also here). Similar results have been found in other databases (here, here, or here), and have resulted in efforts to produce better data pipelines and refined and curated existing databases. As we’ve discussed before, errors within datasets can significantly reduce the predictive ability of any model and require particular care when being used for model building.

Figure 3: This paper by Cortes-Ciriano and Bender studied the comparability of independent cytotoxicity measurements on a large-scale, in this case in the ChEMBL database. In overlapping compound–cell line systems measured in independent laboratories, one can observe a poor correlation. This is in part due to annotation errors, pointing out the importance of data curation when extracting or accumulating public data.

To make sure we get the best possible data set for training machine learning models we must make sure we have the correct controls and data processing steps to minimise variability and maximise reproducibility. We need to ensure that we have good processes in place to minimise errors and obtain data with high consistency. This requires assessing the entire process, including instruments, cell lines, assay and curve fitting protocols, and other confounding variables. Constant data monitoring can ensure that inconsistencies are immediately detected. To ensure that this is the case, a lot of effort and resources should be spent on setting up the lab processes and data processing pipelines accordingly, which will in turn enable the creation of veracious data.

For more on this topic, see the publications by Christian Kramer and colleagues (e.g. here) or this recent RDKit blog post.

2. Data variety

Large amounts of balanced data that have a uniform distribution, including negative examples (which often don’t exist) with a wide dynamic range (e.g. data with pIC50 across several log orders), are rarely available in a program. For selectivity, this is usually even more challenging, as highly selective compounds are very rare (in particular for small molecule kinase inhibitors). In general, this topic refers to the availability of positive and negative data (if we do binary classification), and to how well the data spans the entire domain that we want to make predictions on (when we perform a regression, i.e. prediction on a continuous variable). A common term for large and diverse sets that is used in the literature is ‘representativeness’. This means that for example all target labels (classes) should be sufficiently present, or that all target values (across the entire range of possible values) should be present and approximately evenly distributed. Beyond this, if the data covers only a smaller dynamic range, we typically achieve not only lower correlation coefficients using predictive models but a lower mean absolute error. Understanding the impact of such data set qualities is hugely important for modelling purposes.

‘Representative’ data often doesn’t exist since historically (and still today in many instances) scientists have chosen not to take all compounds forward for experimental testing or discarded the data for all failed attempts. Whether prioritised through simulation, machine learning, or human assessment, it is still common practice only to progress the molecules with the best predicted performance, which means that data might not be selected in a fashion that is optimal for learning. Beyond this, even with the best selection approach, we will likely find that most compounds fall within a narrow dynamic range (e.g. within 2–3 logs).

All of these limitations can drastically limit the model’s ability to learn. A lack of negative examples will limit the ability to predict poorly performing compounds, and a lack of dynamic range will make it hard to extrapolate to regions of interest and limit the model’s applicability domain. Due to lack of variety, a predictive model might miss crucial information and never learn to identify the top performing compounds, for example if they are in an unexplored region of the chemical or biological (sequence) space. The same is true for both antibodies and small molecules, which is why active learning approaches can only truly succeed if there is no final ‘human filter’ in the loop.

Data variety might also result in higher experimental costs. For example, it is usually easier to synthesise molecules around a specific scaffold, rather than explore a wide range of random structures. But diversity from a sequence/structure, physicochemical and functional perspective can all drive the model’s ability to make better overall predictions. So there is a clear trade-off between cost and speed vs better learning. Being aware of the constraints that the training data might impose on the model is crucial for its successful application. And testing the examples that the model proposes, even if we disagree with the choice, might result in a less successful campaign but in a more successful program.

Beyond variety in the context of one metric, data ‘variety’ also matters across assays. For instance, measurement of activation, selectivity, thermostability, aggregation, and yield for the same protein can be crucial to drive a program forward. For meaningful impact in drug discovery we need to understand the multi-dimensional optimisation space and have a sufficiently high variety in the assays we perform and data we capture.

3. Data volume

Quantity is of course still important. If it’s of the right quality, the more data the better. But for many problems the data just doesn’t exist yet, e.g. for particular antibody therapeutics such as VHHs or T-cell engagers, or in chemistry. Even just a quick search for single domain antibody (‘VHH’) structures in the PDB shows that there is very little data available. And here we’re not even looking for functional data — of which there is even less (for example T-cell activation of multispecific or multivalent molecules — please share any data sources if you have them!). The volume of data required to train a machine learning model can vary hugely. Depending on the scope of the project, the model being used, and the endpoint that needs to be predicted, they can vary from 10s to several 100,000s. For instance, AlphaFold was successful, in part, because it was trained on around 100,000 protein structures that had been collected in PDB since 1971 and the data clearly covered all folds (see image below from the PDB). It is less surprising that machine learning can solve this problem since the problem space is comprehensively covered.

Figure 4: Number of unique folds added to the PDB, 1 August 2018, calculated using CATH. The total number of unique folds in each year (red), and the yearly addition of new folds (blue) are shown. Source: PDB

In contrast, predicting structures for the smaller universe of antibody sequences has been much more challenging. For small datasets it is harder to cover the problem space, and conventional machine learning models or specialised models that incorporate a lot of prior information are usually better. For larger datasets, neural network / deep learning based approaches are increasingly better to use.

Figure 5: Number of crystal structures in the PDB for all proteins (blue), antibodies (red), and single domain antibodies (‘VHHs’) in yellow. Source: PDB.

4. Data velocity

The speed and cost at which we can acquire high quality data is essential to generate large datasets, use active learning, and fine-tune our models. Simpler assays with less complex biology typically come with a higher data velocity but at the cost of reduced data value (i.e. translatability). Finding the right trade-off between these two is usually determined by which methods a biotech company can use and how effectively they can move its programs forward. Innovation of higher velocity methods with a high relevance to the clinical endpoint can give fundamental advantages in the age of AI and machine learning. Active learning, for example, requires fast cycle times with sufficiently high throughput to complete the design — build — test — learn cycles and to rapidly iterate on the compounds.

5. Data value

While we can control veracity, volume, and variety of the data (all a matter of resources), we are usually limited by the relevance of the assay to the in vivo outcome (for example in mice for in vivo PoC or ideally in humans). We’ve discussed the challenges of increasing the biological complexity above, but we also need to consider the increases in cost as the complexity of the system increases. But doing so could ultimately save us millions by avoiding later stage failures. For this reason, it is essential that we continue developing methodologies that enable the interrogation of biological systems at a higher resolution and complexity.

It is worth highlighting again that quality becomes a particular issue when we explore more complex biology. In the past there has been a tendency to scale up systems (high throughput) with low biological complexity — for example, defaulting to large-scale screening for binding affinities. The big challenge with this is that this data rarely translates directly to the desired functional response, which means the candidates fail to deliver the desired response in a more relevant system. This is why we need complex biology data with a higher predictive validity, even if it is challenging to generate from a data quality and quantity perspective.

For example, in past programs we have repeatedly observed compounds that bind strongly in a biochemical assay but fail to inhibit the same target in a cell-based assay; or high affinity antibodies that don’t result in any functional response in a cellular environment. The reasons for the lack of translatability are varied but always come down to the inadequacies of proxy systems that have little correlation with the desired endpoint. For example, for T-cell engagers we are highly interested in measuring T-cell activation. These multispecific antibodies use different anti-CD3 binders. We observed that the molecules that include a weaker anti-CD3 binder have a consistently higher activation than the ones that include a 30x higher binding affinity binder (NB: they compete in a binding assay).

The use of irrelevant proxy assays is a common issue, and while this can often be fixed early in the pipeline, the later you are in the drug development process, the harder, more time consuming and expensive it is to go back and start again. Consequently, the pursuit of superior data encompasses not only more and better quality data, but also novel information and data that has intrinsically higher predictivity for in vivo behaviour. As was recently highlighted in this blog by Nick, Carrie, and Galen at Innovation Endeavours and previously in many other places ( or here), emphasis should be placed on approaches that acknowledge complexity and demonstrate the following 3 subcategories:

  • Contextualised: Approaches that facilitate measurements in the most native environments possible, or, more plausibly, an environment more akin to the in vivo context. Native environments are perpetually influenced by various dynamic contexts, for example, the conditions in the tumour microenvironment (TME) are hard to reproduce in 2D or even 3D cell cultures (see e.g. here in context of organoid models). Note that the conditions can be so different that one approach for building tumour-selective antibodies is reliant on this premise — specifically mask-cleavage of enzymes that are present at a higher concentration in the TME. This also holds true even earlier in the pipeline and on a smaller scale: proteins are mostly treated as static while we all know of course that protein dynamics play a huge role. Recent efforts try to incorporate more of the dynamics, for example for flexible docking (DiffDock is indirectly attempting to do this).
  • Functional: Approaches that directly assess activity instead of relying on a proxy. For example, antibody campaigns often pursue high-affinity binders, yet it is evident from the example above, and many others in the literature, that lower affinity is frequently sufficient or even essential for desired function such as activation or selectivity (see also Figure 6). Sometimes, function does not even correlate with binding. So, in order to find molecules that have the potential to overcome current limitations of solid tumour-targeting TCEs in the clinic (e.g. dose-limiting tox — see here) such approaches will be essential. In these instances, it is more important to find compounds that are not just strong binders but selective, or have other therapeutically valuable properties. Avidity-driven selectivity tries to accomplish this (see here in the context of HER2).
  • Multi-scale: Approaches that integrate diverse data modalities to deduce causality, stemming from minute biochemical alterations that propagate to functionality at the system level. Large scale initiatives like TCGA provide great examples of diverse data collection, spanning clinical, genomic, expression, and imaging modalities. Multi-modal clinical-genomic datasets, for example, facilitate the examination of the intricate biology underlying therapeutic efficacy in real-world scenarios.
  • Translational: Approaches that are cognizant of their inherent limitations and strive for translational relevance as a design objective, be it in vivo drug activity or a scaled-up industrial process. Complex in vitro models are progressively demonstrating predictive capabilities (e.g. here or here). That said, there is still a long way to go and many open questions remain (e.g. here and here). Doing more is not always better, and finding the right proxies with the highest translational probability is key to success. This recent Nature Review drug discovery article makes this point really well.
Figure 6: Example of a functional readout that is hard to estimate with simpler assays. Here we see T-cell-mediated killing of a cancer cell, which is a complex mechanism depending on cross-linking of the T-cell and the target receptor on the cancer cell. Stronger binding usually results in lower selectivity which can be detrimental for the safety profile of the compound. (Source: LabGenius)

Summary of the 5 ‘V’s

We have seen that data resources often require extensive curation and preprocessing by experts to extract relevant and informative data and to eliminate noise, errors, and variability that could compromise the utility of implemented machine learning methods.

The other aspect to consider is the most important part: generating data that is meaningful, i.e. more relevant to our target outcome.

Being aware of the 5 ‘V’s means that one can actively strike a balance between cost and speed of generation/data availability, quality of the data, and translatability. We can also leverage this framework to assess where our data generation pipeline can be improved to maximise the quality and utility of our data.

Satisfying the 5 ‘V’s as much as possible is crucial for the successful application of machine learning in drug discovery. Improving any of these will likely translate into improved outcomes. This is key to move medicines from the bench to the patient.

This thinking is supported by a lot of evidence (e.g. here, here and here), and yet many companies still get this wrong. For example, generating a million data points from a simplistic binding assay doesn’t necessarily mean you’ll have more success identifying compounds that do something meaningful in a (3D) cell culture, a PDX model, or the complex tumour microenvironment in a human. The good thing is that these insights are slowly arriving and companies are starting to adapt.

What do we need to get good data and predictions?

Since we now have a good understanding of the properties of the data that we need, we are now turning towards the ‘how’.

There are two main ingredients:

  1. A technology stack that maximises consistency and reproducibility, and
  2. A company culture and organisation that allows for the effective use and application of the technology stack.

1. The complete tech stack for machine learning-driven drug discovery

In general, we need to distinguish between data that is used to train intra-program machine learning models (i.e. models that are only used for a specific program), and data that is used to train inter-program machine learning models (i.e. models used across multiple programs).

  • For inter-program models, the conditionality of the data often results in further complexities because differences between the programs and processes make the standardisation of data harder. Examples include different indication- or disease-specific cell lines, or different assay conditions. All factors can influence the data and introduce noise that limits the ability of machine learning systems to learn.
  • The steps and requirements for data standardisation in intra-program models are simplified significantly but still require a fair amount of work in practice. This is particularly true for more complex biology.

The below pyramid captures the full stack that constitutes a machine learning process in drug discovery. Each layer is needed and errors or noise in any will deteriorate the final performance of any data analysis of machine learning. The stack here applies for both cases, but requires more work for cross-program learning.

Figure 7: The entire data stack. The lower layers typically have the biggest impact on the actual outcome of a drug discovery program. Without good foundations (i.e., predictive assays, data generation, data capturing, and data pre-processing steps) the best analysis can only achieve so much. Data analysis & machine learning could be further broken down into data representation and machine learning models.

When designing a data analysis or machine learning pipeline, it is important to understand each layer and carefully assess sources of noise, errors, and inconsistencies. When assessing the data that is generated by a tech stack (example shown in Figure 7), it is important to acknowledge that any changes made at each layer will require a close collaboration between the tech and the science teams. Identifying which layer(s) have the biggest impact on data quality is crucial to maintaining the integrity of the stack, and should therefore be assessed on a continual basis.

The essential layers in a technology stack are laid out below. We have omitted perhaps the most fundamental layer, which is the valid therapeutic hypothesis, which of course is the most important requirement for any program to succeed. Below we start from the bottom and work our way to the top of the pyramid:

  1. Data Context And Relevance: Available models and context of the data that is generated is important for the translatability of the generated results into meaningful (clinical) outcomes. We can optimise the models and capture additional data and metadata to try to obtain more context.
  2. Data Generation: Consistent data generation is possible through the use of standardised processes, automation, and standardisation (equipment, etc.). Business rules, standard operating procedures, and automation are key.
  3. Data Capture And Storage: Raw data should be captured automatically, with the relevant metadata and stored in a consistent, secure manner according to FAIR data standards. This would ideally happen in a way such that it is immediately accessible to anyone in the company. Version control and provenance for both data and models is helpful to assure the right models are trained on the right set of data and increases resilience to errors and changes.
  4. Data Processing: Processing of raw data needs to be standardised across the company and normalised where appropriate (e.g. EC50 values from fitted curves). Automation and business rules can reduce variability at this stage.
  5. Data Analysis & Machine learning: Only once all other layers are in place should data analysis and machine learning be performed. This requires careful data curation and model validation. In particular feature choice, data splits, and choice of relevant performance metrics play an important role here.

Transitioning to a data-driven biotech company requires strategic changes across people, processes, and systems. Here are some potential actions to drive this transformation for the technology stack.

Creating a good drug discovery tech stack: an action checklist

  1. Process standardisation: Establish consistent processes, captured in SOPs and business rules, to be used throughout the organisation. Minimise manual steps through the use of automation and scheduling.
  2. Automated data capture: Develop pipelines and scripts to capture, track, and version data and metadata automatically and reliably. It is also important to choose instruments that can be integrated with the existing system which typically requires the availability of suitable drivers.
  3. Cloud and warehouse adoption: Leverage services like Google Cloud, AWS, Azure, and data warehouses such as Snowflake and BigQuery for secure data storage of both the raw and processed data. This enables immediate, global access to the data.
  4. Implement model and data tracking: Use systems like MLFlow, DVC, or Google Vertex AI pipelines to track models, model versions, and the associated data and data version.
  5. Track and manage biological variability: Establish processes for quality control such as automatic tracking of key assay metrics and outlier detection. Establish processes to minimise variability across a program by performing data normalisation (e.g. through various controls).
  6. Lab automation: Use automated / robotic labs to maximise process consistency and perform user lab/user acceptance tests when updating a system.
  7. User-friendly systems: Ensure systems are easy to navigate for experimental teams and capable of supporting modern data analysis methods. This may require a combination of custom-built and purchased solutions.
  8. Data accessibility and interaction: Ensure data accessibility with visualisation and interaction capabilities for all team members, and implement flexible governance systems to manage user permissions.

So to wrap up: We always need to ask ourselves the question — are we more likely to achieve our goal of designing better drugs by either predicting complex properties or biology with lower precision or predicting simpler biology with higher precision? Once we have made this decision, which will vary from project to project, we can then optimise all parameters across the entire technology stack, including the experiments, in the best possible way for the training of models that can answer meaningful biological questions.

The second ingredient is the right company culture and organisation of the team.

2. Company culture and team organisations that enable a data driven approach

As discussed earlier, the company culture and organisation of the team will play a key role when creating a data-driven organisation. We’ve compiled a list of potential actions that can support the creation of the right culture and teams. We also refer the reader to this practical guide on how to work with data scientists.

Company culture:

  1. Visionary leadership: The C-suite should articulate a compelling vision for becoming a data-driven organisation and infuse this vision throughout the company. This vision should be communicated clearly and consistently across all levels. Having strong support from senior leadership is essential and also requires appropriate representation at the senior management level. GSK, for example, has established a Senior Vice President and Global Head of Artificial Intelligence and Machine Learning, and many biotech companies start with a Chief Data Officer in their team.
  2. Data-centric reward system: Implement a reward system that incentivises data quality, accessibility, and data-driven innovation, not just short-term milestones. This could be in the form of awards, bonuses, or public recognition but should also be part of annual goal setting and work-acceptance.
  3. Data literacy: Prioritise data literacy by establishing regular training programs and workshops. Ensure all team members understand their role in the data value chain and how their contributions impact the overall success of the organisation.
  4. Shared data ownership: Encourage a sense of shared ownership of the data. Provide scientists with easy-to-access tools to perform simple data science tasks themselves, and to evaluate and better understand the data and quality of the data that they produce. This fosters a culture of proactive data management.
  5. End-to-end process focus: Promote an understanding of the end-to-end data processes instead of focusing only on individual technologies. This helps everyone appreciate the importance of their role in the larger context.
  6. Shared responsibility and accountability: Foster a culture of shared responsibility for both short-term and long-term outcomes. The application of machine learning in projects, for example, should be the shared responsibility of data scientists and wet lab teams.
  7. Continuous learning: Commit to the ongoing education of staff. Allocate time and resources for continuous training on new systems and technologies. This extends to data scientists and machine learning experts who need to develop a good understanding of the wet lab processes and the data that is generated.

Organisational structure:

  1. Cross-functional teams: Form cross-functional project teams with shared data goals and accountability. This should include data scientists as core team members within drug discovery projects.
  2. Co-location: Whenever possible, co-locate teams to facilitate spontaneous interaction and collaboration. In cases where this isn’t feasible, use technology to keep teams connected and promote frequent communication.
  3. Standardised processes: Enforce the use of standard data pipelines and systems across all teams throughout the company. As discussed earlier, this consistency helps to minimise errors and facilitates better data analysis.
  4. Shared success and failure: Celebrate wins as a joint effort and hold all team members accountable for any shortcomings. This approach fosters a sense of unity and shared commitment.
  5. Knowledge sharing: Promote knowledge sharing between different teams and backgrounds. Explain technologies and communicate requirements in regular sessions, and make resources and information readily available across the business. Project teams should maintain clear documentation of all decisions and steps, including models and data used. This information should be easily accessible (upon request) across the business.
  6. Product teams (i.e. any team building a computational workflow or analysis system) need to be science or user led — but software developers or data scientists need to be part of the core product team. Working together to define product specifications and perform user acceptance tests is key to building useful tools.

Putting it all together

By implementing some of the proposed actions and building a good technology stack as well as developing a data-driven culture and organisation, we believe modern biotech companies will be able to produce high quality machine learning-grade data (5 ‘V’s) and get better results using machine learning. This can ultimately lead to the selection of better compounds and improved patient outcomes. We hope that these guidelines will help some of the operators in the field develop their own thinking and strategy for building data-driven organisations, and educate investors and other stakeholders who are less familiar with the space on the need to look at technology and data more holistically.
If you have any questions or comments, please do reach out to us.

We’d like to thank Philipp Harbach and Gino Van Heeke for his input on many aspects that we touched on in this article.


More examples.

Alves et al. gave a good example in toxicology regarding the impact of data quality on the confidence in model prediction. As an example, they showed that models generated with un-curated data had a 7–24% higher correct classification rate but the perceived performance inflated owing to the high number of duplicates in the training set. Chemical toxicity mechanisms, like many mechanisms in drug discovery, are complex, involving multiple molecular pathways, cell types, and organ systems, and interpretations of toxicology studies are further complicated by differences in study protocols, exposure conditions, chemical purity, test subject attributes, and dose selection. Other sources of biological and random variability can also impact the concordance of toxicology outcomes from different studies, with one-third of variance in ‘no effect levels’ from rodent studies unaccounted for by obvious study characteristics [Ref]. Another study evaluated a corpus of human transcriptomics studies by comparing the provided annotations of sex to the expression levels of sex-specific genes. They identified apparent mislabeled samples in 46% of the datasets studied, yielding a 99% confidence lower-bound estimate for all studies of 33%.