Herophilus Orchestra: the best Drug Discovery Operating System you’ve never heard of
The origin story
We founded Herophilus with the goal of bringing together the recent revolutions in artificial intelligence, robotic automation, and experimental biology to build a totally new approach to discovering therapeutics for neurological disease. We knew that bringing world class software engineering into our project from day one was a requirement given the complexity of the automated and human data generation processes, the heterogeneity of data objects, and the complexity of modern AI-based analysis pipelines critical to our scientific approach. Google Apps weren’t going to cut it.
We needed to interconnect equipment of different kinds for data acquisition and control. We needed to plan, guide and annotate a broad range of long-running, complex human and automated wet lab assays. We needed to integrate complex data and metadata streams, maintain data provenance and serve data up for flexible analysis and query. We needed to enable rapid development of machine learning pipelines. We needed to facilitate rich human exploration, curation, and annotation of the data, and we finally needed enterprise-strength scalability, access-anywhere and security. Above all, we needed a unified platform to tie it all together; to oversee and facilitate our symphony of needs for cutting-edge therapeutics discovery.
We had to build a product that didn’t exist
Any project should start with the question “Build or Buy?” Before we started building, we looked far and wide. We already were using Electronic Lab Notebooks (ELNs), Laboratory Information Management Systems (LIMS), Schedulers, Image Viewers, and every manner of modern-stack developer tools and cloud computing infrastructure. All of these isolated products helped scientists create or interact with data, but none did so in a scalable, reproducible, extensible, auditable, and — most importantly, integrated way. In these apps, data was still thought of as it was in the 90’s– individual files in a file system with little context, no linkage to related data and metadata, no protection or permanence, and no provenance or reproducibility– in short, a complete lack of data integrity. Every scientist has endured the pain, often years after the data was generated, of lost files, forgotten analysis settings, broken code when trying to re-engage with their preciously acquired data. As the saying goes, “no useful data gets touched only once.” We realized we had to build something that didn’t exist: integration as a product.
Integration: from generation to insight
What is the primary purpose for generating data in drug discovery? It is to produce insight. We needed to go from generation to insight many times a day and in many different ways without the usual scientist pain and loss caused by an amalgam of point application solutions and ad-hoc data management. The only way to avoid this mess is to integrate. Integrate data, integrate data sources, integrate human interfaces, integrate code, integrate machine learning pipelines. Integration is the key to happiness.
And so we built Orchestra. We needed to define a new product category, so allowing ourselves a bit of terminology abuse, we called it a Drug Discovery Operating System. Orchestra is the unifying platform that allows us to execute a vast complexity of data generation tasks: assay specification and logging, instrument control, accessioning of biobank samples and reagents, and automagically serves it up for data insight tasks: machine-learning feature extraction, traditional statistical analysis, and high-definition human exploration of data. It is the central nervous system of our scientific discovery operation.
Trained analytics is the new normal
The advent of data-trained analysis pipelines, aka machine learning, aka AI, has only made the need for data integrity and integration even more critical. Training data must be filtered, annotated, curated, then revisited when artifacts are discovered. Analysis pipelines are constantly tweaked and re-trained. Nothing is static, nothing is clean, and the data pool is always growing. There is no batch learning– there is a constant stream of heterogeneous training data. Data schemas are constantly changing and growing. Analysis code must be bonded with data so we can always re-generate data analysis methods as new insights, or insight-generating ideas, come to light.
In fact, as machine learning becomes ubiquitous, it’s the data engineering that unlocks the value opportunity. Orchestra was built to fill that need. No more file transfer. No more lost data. No more data munging in Prism, Excel, or MATLAB. No more hand built Google Sheets for tracking. No more exercises of painfully reconstructing data analysis steps for publication. From the beginning, we built an ML-first platform; highly focused on data provenance and a plug-n-play pipeline architecture that makes it possible to run, configure and rerun analysis training and inference jobs over and over again with complete reproducibility and artifact integrity. For our team, these analytics close the gap between mountains of data collection and human translational hits.
Living in the high performance cloud
These days, there’s a common solution to enabling enterprise-strength access-anywhere applications. This paradigm is so successful that it eclipsed selling things as the killer app for a certain company valued over $1 trillion. Science, as an industry, is not immune to the slow adoption of cloud technologies seen by other traditional sectors over the last decade. So for us, as veterans of several non-science data-driven companies that have digitized the enterprise through cloud-deployed infrastructure, it was the obvious place to start.
Orchestra is built atop the AWS cloud. We get all the benefits of secure virtualized infrastructure with detailed cost analytics, massive parallelism and scale, high performance compute in both CPU and GPU, just-in-time resource allocation and unlimited storage. We get to make broad use of the configuration as code paradigm (in AWS land this is called CloudFormation, but every cloud computing vendor has its own flavor). This allows us to deploy or modify entire functionality stacks at the click of a button, to stably support ever-evolving and expanding uses. This was such an implementation win that AWS did a case study about our stack.
Our processing infrastructure is completely containerized. This allows us to quickly create, configure and deploy various workflows of several flavors with complete confidence that they will work when deployed in the final environment. This standardized approach also allows us to create process and procedure tooling and integration+deploy harnesses that facilitate the rapid, nearly effortless, inclusion of new machine learning modules as they are developed by our analysis team.
Lastly, all the human interaction functionality for the platform lives in the unified Orchestra UI. Although generalized commodity workflows are a critical piece of platform-style technology, you can’t ever get away from the fact that plenty of real day-to-day on-the-ground lab is highly opinionated; specific to the experiment at hand. People need a platform that actually helps them do their job. We have built GUIs for over 30 types of on-the-ground workflows; and more are built everyday.
Our discovery techniques make heavy use of images. Images are complex and bulky, and difficult for individuals to store, move and analyze. Perhaps our most important (and dare I say, impressive) module is a flexible and highly performant image exploration and annotation service we call Honeydew, built to facilitate rapid access to terabytes of high-resolution 2D, 3D, 4D & 5D microscopy images and videos. This feature set includes browser-based viewing, pyramid zooming, annotation and labeling, scrubbing, histogram manipulation, in-situ metadata and much more. And to make it truly collaborative, all Orchestra UI states are permanent and shareable across the organization.
But what does it really DO?
The core functionality of Orchestra is quite general. It is an enterprise-strength data generation, management, and analysis platform. It enables rapid development of analytical pipelines, data exploration, and rapid production of insights. It is particularly suited for complex datasets and deployment of machine learning systems.
But we’re a science company. So on top of Orchestra we built out a suite of functionality modules specific to our scientific process. We built dozens of ML pipelines deployed in production used to detect signals in high-dimensional data, such as transcriptomic data, neural activity, IHC, neurite density, microglia morphology, live-cell activity, phagocytosis, spine density (just to name a few), and more. Here are just a sample of the modules that we built, and rely on virtually every day:
So, what’s next?
We may have over-engineered a one-customer product. The consequence is that our scientists are the happiest scientists we know, because the software has disappeared for them– they live in the data. They can achieve a flow state, and they know they can always recreate their analyses, and never lose data. We have rescued them from a hell of spreadsheets and shared folders. Our data scientists rapidly develop new analysis pipelines without writing file i/o code or thinking about infrastructure issues. But our joy seems almost… selfish.
We aren’t the only team out there with our needs — we know this from the oohs and aahs when people visit our labs and see Orchestra on the touchscreens. There is a life sciences enterprise software product category here, and there is investment capital flowing into the space. We’d love to share our software with the world; but there is major activation energy needed to adopt an extensive platform like ours for those already operating in a legacy environment. We don’t know how easy this will be to deploy in a big, global pharma company, obsoleting the old spreadsheets and SAP databases. We’d like to try deploying it with an intrepid partner. Or maybe we just keep it to ourselves — a jet-powered science software platform with one very happy customer.