Turning the ENCODE pipelines into community resources

Karl Sebby
truwl
Published in
6 min readOct 13, 2021

After the completion of the Human Genome Project (HGP) in 2003 which deciphered the order of bases in the human genome — represented by the letters A, C, T, and G — it was time to start figuring out how the arrangements of all those bases correspond to biological function.

Enter the Encyclopedia of DNA Elements (ENCODE) Project. ENCODE is a large undertaking involving researchers from over two dozen of the top research institutions in the world with the goal to identify and characterize the functional elements in the human genome. Beginning in 2003 with a pilot phase, and now in its 4th and final phase, ENCODE necessitates a high degree of collaboration between researchers in many time zones and a wide range of expertise.

The HGP set the bar for what large-scale, distributed, and collaborative science could accomplish and set a new precedent for turning publicly-funded scientific output into a community resource that could be accessed and used by any researcher. From the beginning, ENCODE was also designed to generate a valuable community resource. As stated in the opening line of the abstract for A User’s Guide to the Encyclopedia of DNA Elements “The mission of the [ENCODE] Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve human health.”¹

The amount of raw and processed data along with annotations are truly impressive and has indeed become an important resource for the scientific community. According to the ENCODE portal, at the time of this writing, the consortium has produced 1487 publications and an additional 1949 publications have been published by the community using ENCODE resources.² This large degree of community output was intentional and expected: “However, we expect that deep insights into the function of most elements will ultimately come from the community of biologists who will build on ENCODE data or use them to complement their own experiments.”

Meeting the goals of ENCODE is highly dependent on being able to integrate and compare results coming from many different investigators. To achieve this it was recognized early on that the handling of samples and data had to be standardized across labs to ensure that signals in the data were caused by underlying biology and not artifacts from the way the data was collected or processed.

The consortium developed and adopted a set of experimental guidelines and protocols for collecting and evaluating raw assay data including standardized growth conditions and antibody characterization, requirements for controls and biological replicates, and a data quality review process prior to public release. Once through the quality review process, the raw data also has to be handled in a standardized fashion. As has been highlighted for mapping and variant calling in whole genome sequencing, differences in bioinformatics processing impede the ability to compare results from different labs.³ This is also true for the functional assays used by ENCODE. To ensure robust comparability of results, the ENCODE Data Coordination Center (DCC) at Stanford developed a set of uniform processing pipelines for the major assay types used by ENCODE and was tasked with processing the raw data produced by the consortium with those pipelines.

Once released the data is made available through a publicly accessible portal (encodeproject.org) which provides a user-friendly interface to explore experiments with their associated raw data, processed data, and metadata. This ease of access is foundational to enabling community use of the data produced by the consortium. Although the raw data and analyzed results provided by the consortium are valuable resources, the DCC realized that the computational methods that were developed to turn raw data into results were an equally important deliverable of the project’s activities.

Providing access to reproducible and reusable computational methods used by ENCODE enhances the value of the project by allowing anybody to process similar data types. The number of cell lines, tissues, cell states, and chromatin binding proteins studied by ENCODE is necessarily limited and to ensure that the ENCODE results are and continue to be a valuable community resource, broad community access to the processing pipelines is a must. The benefit of providing access to these methods is twofold. The first is that providing a set of trusted and robust pipelines enables researchers to get from data to results confidently. Bioinformatics is complex, and pipelines and analyses can use dozens of different tools each with its own quirks, pitfalls, and range in quality of documentation and support. Using methods that have been battle-tested provides a faster and more secure path to reliable results. The second benefit is that the results of experiments can be directly compared to the results produced by the ENCODE consortium. Data is always much more valuable when placed in the context of other experiments.

So how can researchers access and use the ENCODE pipelines in their own research? Firstly, the pipelines have been released on the ENCODE-DCC’s GitHub page (https://github.com/ENCODE-DCC/) under a free and open software license so anybody can clone, modify, or use the pipelines. The pipelines are designed within what the ENCODE-DCC has named the ‘reproducibility framework’ which leverages the Workflow Description Language (WDL), streamlined access to all the underlying software through Docker or Singularity containers and a Python wrapper for the workflow management system Cromwell. This framework enables the pipelines to be used in a variety of environments including the cloud or compute clusters in a reliable and reproducible fashion. The pipeline maintainers are also very helpful and quick to respond to issues on GitHub. The reproducibility framework is a leap forward in distributing bioinformatics pipelines in a reproducible, usable, and flexible fashion, yet still requires users to be comfortable cloning repositories, installing tools from the command line, accessing compute resources, and properly defining inputs with text files. This falls short of providing the same level of access and usability provided by the ENCODE portal for raw and processed data and experimental details.

To increase the ease of access and use of the ENCODE pipelines, Truwl partnered with the ENCODE-DCC to complete the ‘last mile of usability’ for these pipelines. As with data on the ENCODE portal, access to these pipelines on Truwl is available to anyone with an internet connection. Without being logged in, users can see complete examples of how the pipelines are run in practice and get the files required to run the pipelines on their own system. Once a user has an account and is associated with a project account, these pipelines are available to run directly on the cloud from truwl.com. The inputs are defined from a web-based input editor that has embedded documentation about each input, then a job can be launched with the push of a button. Figuring out proper pipeline settings is confusing for users that are not intimately familiar with them. On Truwl, analyses run by other users can be found and forked (copied) to pre-populate parameters and inputs from similar experiments. Once completed the pipeline outputs can be accessed from a provided link to a bucket on the cloud or copied to another system with a provided command. Analyses can then be shared with a select group or published openly for others to evaluate or reuse.

ENCODE workflows

Users who do not run their analyses from Truwl can still take advantage of Truwl by seeing complete analyses run by other users, using the input editor to generate inputs files, and sharing their own usage examples by providing us with the files that were used to run their experiment.

Truwl is proud to have worked with the DCC to help make these methods into more accessible community resources that anybody can use and are excited that we have been able to help a range of researchers use these methods that would not have been able to otherwise. Special thanks to Jin Lee, Idan Gabdank, Seth Strattan and other members of the DCC team in making this all possible.

** Special thank you to Idan Gabdank for providing comments and useful insights for this post.

1. The ENCODE Project Consortium. A User’s Guide to the Encyclopedia of DNA Elements (ENCODE). PLoS Biol 9, e1001046 (2011).

2. ENCODE: Encyclopedia of DNA Elements — ENCODE. https://www.encodeproject.org/.

3. Regier, A. A. et al. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects. Nat. Commun. 9, 1–8 (2018).

--

--