Herophilus Orchestra: the best Drug Discovery Operating System you’ve never heard of

The origin story

We founded Herophilus with the goal of bringing together the recent revolutions in artificial intelligence, robotic automation, and experimental biology to build a totally new approach to discovering therapeutics for neurological disease. We knew that bringing world class software engineering into our project from day one was a requirement given the complexity of the automated and human data generation processes, the heterogeneity of data objects, and the complexity of modern AI-based analysis pipelines critical to our scientific approach. Google Apps weren’t going to cut it.

Orchestra: coordinating the symphony of drug discovery from generation to insight — hardware integration, data provenance, ML/AI pipelines and human interaction

We had to build a product that didn’t exist

Any project should start with the question “Build or Buy?” Before we started building, we looked far and wide. We already were using Electronic Lab Notebooks (ELNs), Laboratory Information Management Systems (LIMS), Schedulers, Image Viewers, and every manner of modern-stack developer tools and cloud computing infrastructure. All of these isolated products helped scientists create or interact with data, but none did so in a scalable, reproducible, extensible, auditable, and — most importantly, integrated way. In these apps, data was still thought of as it was in the 90’s– individual files in a file system with little context, no linkage to related data and metadata, no protection or permanence, and no provenance or reproducibility– in short, a complete lack of data integrity. Every scientist has endured the pain, often years after the data was generated, of lost files, forgotten analysis settings, broken code when trying to re-engage with their preciously acquired data. As the saying goes, “no useful data gets touched only once.” We realized we had to build something that didn’t exist: integration as a product.

Orchestra proto-model on the proverbial napkin
Ideas coming together: Orchestra conception written on the proverbial napkin

Integration: from generation to insight

What is the primary purpose for generating data in drug discovery? It is to produce insight. We needed to go from generation to insight many times a day and in many different ways without the usual scientist pain and loss caused by an amalgam of point application solutions and ad-hoc data management. The only way to avoid this mess is to integrate. Integrate data, integrate data sources, integrate human interfaces, integrate code, integrate machine learning pipelines. Integration is the key to happiness.

The top-level “single pane of glass” for our entire discovery operation: Orchestra’s Dashboard GUI
Bringing collaborative science to life on Orchestra touchscreens

Trained analytics is the new normal

The advent of data-trained analysis pipelines, aka machine learning, aka AI, has only made the need for data integrity and integration even more critical. Training data must be filtered, annotated, curated, then revisited when artifacts are discovered. Analysis pipelines are constantly tweaked and re-trained. Nothing is static, nothing is clean, and the data pool is always growing. There is no batch learning– there is a constant stream of heterogeneous training data. Data schemas are constantly changing and growing. Analysis code must be bonded with data so we can always re-generate data analysis methods as new insights, or insight-generating ideas, come to light.

Living in the high performance cloud

These days, there’s a common solution to enabling enterprise-strength access-anywhere applications. This paradigm is so successful that it eclipsed selling things as the killer app for a certain company valued over $1 trillion. Science, as an industry, is not immune to the slow adoption of cloud technologies seen by other traditional sectors over the last decade. So for us, as veterans of several non-science data-driven companies that have digitized the enterprise through cloud-deployed infrastructure, it was the obvious place to start.

Examples of Orchestra Workflows
Examples of Orchestra GUI workflows
3D Neurites shown in Honeydew
Honeydew facilitates exploring and annotating complex images: here we see 3D neurite segmentation in a live GFP-labeled organoid

But what does it really DO?

The core functionality of Orchestra is quite general. It is an enterprise-strength data generation, management, and analysis platform. It enables rapid development of analytical pipelines, data exploration, and rapid production of insights. It is particularly suited for complex datasets and deployment of machine learning systems.

Blocks describing Orchestra Modules
A sampling of Orchestra Modules

So, what’s next?

We may have over-engineered a one-customer product. The consequence is that our scientists are the happiest scientists we know, because the software has disappeared for them– they live in the data. They can achieve a flow state, and they know they can always recreate their analyses, and never lose data. We have rescued them from a hell of spreadsheets and shared folders. Our data scientists rapidly develop new analysis pipelines without writing file i/o code or thinking about infrastructure issues. But our joy seems almost… selfish.



CTO of Herophilus

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store