How to Perform Large-scale Data Processing in Bioinformatics

Arkarachai (Chai) Fungtammasan, PhD

Published in

DNAnexus Science Frontiers

9 min readApr 18, 2022

A Complementary Discussion for “Ten Simple Rules for Large-scale Data Processing”

Contributors: Arkarachai Fungtammasan, Yih-Chii Hwang, Jason Chin

Most biomedical students and early career professionals might think that large-scale data processing belongs to the realm of computer engineering, and they do not have to worry about it. However, at some point, many scientists find that they need to perform large-scale data analysis. In this article, we provide useful guidelines to help you run that analysis with fewer problems.

Recently, we published an article “Ten Simple Rules for Large-scale Data Processing” in PLOS computational Biology in collaboration with members from the Greene lab, Alex’s Lemonade Stand Foundation, and the Center for Health AI at the University of Colorado School of Medicine. This article builds on that work and describes a technical planning process that scientists can apply to both high-performance computing and the cloud. Although it was written using examples from genomics, these rules are applicable to other data analysis scenarios.

While it is conventional for articles in this series to have only 10 rules, there are additional rules that scientists should take into consideration. In this blog post, we discuss some of those important supplemental rules.

Recap of 10 simple rules for large-scale data processing

To start, let’s quickly summarize the 10 rules that we published earlier this year. For those who are interested, the full manuscript is open access and takes about 10 minutes to read.

Rule 1: Don’t reinvent the wheel: Check for published or pre-processed data that could be useful.

Rule 2: Document everything: Track all rationale and decisions used in your work and make it easy for others to track what you have done.

Rule 3: Understand hardware and regulatory limitations and tradeoffs: Be aware of the regulations and standards that govern the platforms and data you are working with. Simply put, know where you “can” and “should” analyze your data.

Rule 4: Automate your workflows: Set up a system that supports repeated analysis of your data. There are various workflow tools available that can be helpful in this regard.

Rule 5: Design for testing: Plan multiple test input sets and expected outcomes to identify problem areas and sources of error.

Rule 6: Version both software and data: Use version control for all related software, workflows, and data in your pipelines.

Rule 7: Continuously measure and optimize performance: Measure the performance of your workflows and algorithms at intervals to identify areas that can be optimized to reduce costs and save time; change the tool option if needed.

Rule 8: Manage compute, memory, and disk resources at scale: Consider trade-offs in memory, storage, and computing resource use. Depending on what is available, you may need to make different choices about data management and analyses.

Rule 9: Learn to recover from failures: The more data you analyze, the greater the risk of errors. This is why it is important to include error handling in workflow.

Rule 10: Monitor execution: Keep track of potential production issues, and scale up your analysis gradually.

In the next section, we will discuss other rules that are crucial for large-scale data analysis. We will use the rule number with the suffix “.5” for a new rule to indicate where it would be inserted in the ordered list above. For example, Rule 2.5 should be considered after Rule 2 and before Rule 3. Note that even though we tried to write down all these rules in a particular order, in reality, scientists may jump back and forth between rules in many scenarios.

Rule 2.5: Understand where the burden is and design around it

It is best to first understand the analysis type and its major requirements. Performing large-scale data analysis could require large quantities of computing power, memory, and storage, or involve a lot of files, or some combination of these issues. A good understanding of where the burden comes from could guide scientists to viable solutions. For example, performing classical molecular dynamics simulations requires large quantities of compute power, but the input and output are usually small. Therefore, it is possible to perform this type of analysis on various settings relatively easily and find the consensus among results later. Processing next-generation sequencing data requires a different approach. The input data is sizable, so scientists have to consider their data storage and transfer requirements as well.

The type of analysis also impacts how scientists distribute the work to speed up their analysis. For example, when calling variants in a large number of samples, the first step is to detect variants independently in each sample using the same process. Therefore, distributing one sample per analysis is the most natural way to approach the analysis. The next step might be to use the known variants for each sample to perform joint variant calling to improve the accuracy of the calls. To do so, the scientists would need to combine all samples. In this step, it would make more sense to distribute analysis per chromosome or per region in chromosomes.

Figure 1: Slicing data for distributed computing in variant calling and joint calling application

One way to think about the problem is to think of it as a computational graph, a directed acyclic graph that lays out all the required computing tasks and any associated dependencies. If you use Spark or DASK, these tools already use computational graphs natively. The image below is designed to help you visualize which components could be analyzed in parallel even though they may not have a clear parallel computing structure.

Figure 2: An example of a computational graph. Each box represents the processing that needs to be done. Each path to the merging point can be computed separately. The ability to perform computing separately is required for parallel or distributed computing.

Rule 3.5: Forecast the cost and time

After selecting a deployment platform, the next step is to set up your end-to-end pipeline and automate it (Rule 4) to prove that your analysis can be done. Part of the setup process involves checking the estimated time and resources required to complete your analysis on a small scale. Try to extrapolate from this information if the whole project is feasible in terms of cost and timeline. Look to gain an early understanding of how the computing and resource requirements scale (linear, quadratic, etc.) with the amount of data would greatly help with cost predictions. Make sure a reasonable margin of error is included in the calculation.

Remember that this is an estimate and that your costs will change over time. This early in the project, the end-to-end pipeline is not finished and optimized; there may be additional costs to consider as development continues. Also, the margin of error depends heavily on the error rate of your tools of choice, the reliability of your system, and the quality of data. Such information would only be available later on in the process. Therefore, it is important to frequently revisit and reevaluate your estimated costs and time to ensure they remain within an acceptable range as the project progresses.

Rule 4.5: Organize files and analyses with metadata

Metadata — data or information associated with the data of interest — is crucial in analysis as it provides important context and meaning to the data. Aside from improving data interpretation, metadata is also useful for organizing execution plans for large-scale analysis. For example, when grouping data into multiple batches for analyses, you can add tags that make it easier to identify which data belongs to a certain batch. This tagging system makes it simpler to quickly gather and verify related data. Such a system is also helpful to track down related files if there is an issue with a particular batch of analysis. Depending on your chosen system (Rule 3), there could be multiple metadata options. These metadata could be mutable/immutable, structured/unstructured, and have different access mechanisms. Each type has pros and cons. The best approach may be to use multiple metadata types to meet your requirements.

Some platforms (e.g. UK Biobank Research Analysis Platform) support attaching metadata to each analysis jobs. This is particularly powerful for managing overall data analysis and reanalyzing certain batches when necessary.

Figure 3: Using metadata to tag analyses with experiment name, number of analysis batch, and whether the analysis is the original or a rerun of a failed analysis. See more info.

Even if you perform your analysis on a platform with no metadata management system, this principle still applies. Two of the most under-appreciated kinds of metadata are the data filename and analysis name which are possible to implement in almost all systems. It is a good practice to have a patterned and self-explanatory name so that users can quickly identify samples and related analyses. In addition, the filename should be unique to the project. Do not rely on a file path to make distinctions. Although both file names and file paths are mutable, paths are more likely to change because files might be reorganized or transferred to other locations.

For example, the filename could have the pattern <sampleID>_<replicate_or_other_identifier_if_any>_<previous_and_overall_processing>.<file format extension> such as SRR1234_rep1_rawReadQCProtocol1.html.

At the same time, the analysis could have comparable pattern <analysis_protocol>_<sampleID>_<replicate_or_other_identifier_if_any>_<attempt> such as rawReadQCProtocol1_SRR1234_rep1_rerun3.

Having the patterned name is also helpful for communicating with collaborators, and it is helpful for not getting mixed up about which files to use in the future when you revisit your analysis. If you currently use filenames such as test1.txt or foo.txt, this would be a good time to make a change.

As we mentioned earlier, each type of metadata has pros and cons. The filename, in particular, has several limitations including low character limit, mutability, and it is limited to only text format. These limitations can be compensated for with manifest files that record the filename, file path, file signature (e.g. md5), and a description of abbreviation that might be used in the filename. It is a lot of work to set up a system like this from scratch but the time investment will pay off in the long run. It will be much simpler to keep track of the files and analyses needed for your work.

Rule 10.5: Sharing data

Finally, you got the results that you need! It took a lot to get to this point so it would be good if other researchers do not have to repeat all the work you just did. In fact, it would violate Rule 1 on using existing resources in the publication if we do not share research results.

Simply sharing data is not immediately useful. Others who want to reuse the resources need to know detailed information about the data including how it was processed and by what software. In bioinformatics, most software programs have several versions and there is no guarantee that all the versions are properly maintained. Rule 2 on documentation and Rule 6 on versioning greatly impact the usefulness of data sharing.

A set of good practices around data sharing is worth another blog post by itself. There are several standards around today such as FAIR principles to make your data Findable, Accessible, Interoperable, and Reusable. Another recent publication “Ten simple rules for improving research data discovery” provided a full suite of recommendations on data sharing that covered some important technical & ethical considerations.

Sharing data benefits the community but is also useful for building a strong research portfolio to support your career. First, having detailed documentation about your work in a sharable condition is helpful for drafting publications, auditing, and expanding the work you have done. Secondly, it increases the impact of your work. Many citations do not come from the main conclusion of the paper, but from auxiliary data points or resources provided as an addendum.

Conclusions

Working with big data does not have to be scary. The guidelines in this article and our previous paper provide the foundation to help you get started. As you launch your pipelines and begin analyzing data, you will identify the shortcomings in your process, refine your workflows, and develop your own set of rules that benefit our communities. We look forward to reading about your experiences and suggestions.

Ten simple rules for large-scale data processing