Galaxy: An Open, Web-based Bioinformatics Platform

This was one of the first articles I wrote a couple years ago and completely forgot to publish. It’s a bit sparse on images but this reminds me of my journey over the last four years. Enjoy!

I recently read a paper called — Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences.

I thought the paper was fairly important, since it deals with an issue that I as a new bioinformatics student found important- namely, how do we make it accessible for researchers who do not have a strong computer science background? Hopefully this post will convince you of the scope of this problem as well as how Galaxy is able to make things easier.

The Galaxy Bioinformatic’s home page

With the advent of NGS(Next Generation Sequencing) we are able to capture extremely large amounts of genetic information. Computers are the key to manipulate and understand these large datasets, however we require increasingly more complex computational analysis tools to make sense of it all. For researchers in the life sciences, the use of these tools becomes a challenge have little to no background in computer science so creating or using these tools becomes a challenge. Furthermore, communication and reproducibility, the cornerstones for scientific research in general, become much more challenging as well. Galaxy aims to change that by allowing users to upload their datasets and run their own analysis on the data even with having little to no programming experience. Lets take a look at how it’s done.

The main benefit of using Galaxy are the tools it provides for data analysis. Galaxy allows more experienced users to create and upload their own tools so that others can use them. Additionally, it’s easy to set the parameters of each tool or combine multiple tools for an analysis chain or workflow. While this is great for individual users keep in mind that a successful experiment needs to be reproducible and transparent to other researchers as well. Unfortunately, due to the lack of standards, extremely large datasets and very complex analysis tools that need to be used this can be hard to document. Furthermore, it is rare to run just one analysis and Integrative data analysis — analysis from multiple sources — is becoming more common.

Galaxy provides a solution to the reproducibility problem by generating metadata during analysis. Metadata can be thought of as descriptive information about datasets, essentially information that can be used to repeat the analysis in the exact same way. Galaxy makes it easy for metadata to be obtained by creating it automatically for each step in the analysis. This metadata is then saved in a history so that users can get a copy or share it with other researchers. This alone is insufficent since it does not tell you why a particular tool or analysis was used. Therefore Galaxy allows the user to include annotations in every step of the analysis so that the intent of each step is accurately captured. Galaxy allows for workflows — a chain of analysis tools — that act as a reusable template which allow the user to run the same tools, using the same parameters but on different datasets. Once created the workflow can be uploaded to Galaxy’s tool shed so that other researchers can take download and try it themselves.

Finally, Galaxy addresses the issue of accessibility extremely well. All operations can be can be performed using nothing more that a web browser. No need to download any extra software. That being said there are limitations to the amount of data that the web based version of Galaxy can store. Users who require more storage space for their data can download and install a local instance of Galaxy on a local machine if need be.

The article goes into more detail about Galaxy so feel free to check that out if this has piqued your interest. Download a dataset from UCSC and try Galaxy out yourself!