Open-source bioinformatic solutions for ‘Big Data’ analysis

Drs Tim Griffin and Pratik Jagtap along with the Galaxy-P team from the University of Minnesota are working to develop workflows on an open source platform for the analysis of multi-omic data. They are currently focusing on using a Galaxy-based framework to investigate the integration of genomic datasets with mass spectrometry-based ‘omics’ data. But in the long term, they aim to expand the platform to cope with many other ‘Big Data’ domains.

Currently, a major limitation to what we can discover from complex datasets derived from next-generation technologies is our ability to analyse them. This is where the work of Dr Tim Griffin, Dr Pratik Jagtap and their research team will play an important role.

The ‘Big Data’ era
Moore’s Law predicts that computing power will double approximately every two years, and with this, the cost of high-powered machines will also decrease. However, this cannot continue indefinitely and 2017 may be the crunch point at which physical limitations intervene, with the rate of progress becoming ever more saturated.

But what influence has this increase in computer power had on science? One of the major advances has been the ability to generate data using next generation, high throughput techniques, resulting in ‘Big Data’. Although ‘Big Data’ has been used to define many datasets, the term often corresponds to what are now commonly known as ‘omics datasets’ — genomics, metabolomics, proteomics, transcriptomics and epigenomics to name but a few. For example, in biomedical science, we see large scale, system-wide approaches being used more and more commonly. These include the 1000 Genomes Project, the emergence of personalised medicine — tailored to an individual’s needs — and systems biology, examining multiple, interacting pathways concurrently as one giant network.

One of the major advancements is the ability to generate data using next generation, high throughput techniques, resulting in ‘Big Data’.

However, the analysis of these large and complex datasets requires an analytical platform which can cope with the intense informatics requirements, as well as the ability to access disparate software from different ‘omics’ domains. Many wet-bench researchers will not have access to this level of compute-power or expertise locally, and therefore there is an increase in remote, or cloud, open-access platforms being used to access the necessary bioinformatic tools needed to cope with the complex results that researchers are obtaining.

One solution for all
At the University of Minnesota, Drs Tim Griffin, Pratik Jagtap and team are working on solutions to analyse these complicated datasets. This is a multi-disciplinary, collaborative project between Dr Griffin’s lab and the Minnesota Supercomputing Institute, which involves software developers, data scientists and wet-bench biological researchers. Specifically, the team are focusing on mass spectrometry (MS)-based ‘omics’ data (metabolomics and proteomics) and how they can harness an existing open-source framework, called Galaxy.

Put simply, mass spectrometry represents a high throughput technique that sorts ions based on their mass to charge ratio. Once certain signatures have been recorded for individual ions, this information can, for example, be extrapolated to identify peptides, the building blocks of proteins. Tandem mass spectrometry (MS/MS) further expands on this by using at least two stages of mass analysis.

Galactic platform
Galaxy was originally developed over a decade ago to solve problems in genomic informatics. It can be hosted on a scalable compute infrastructure, helping to cope with the problem of large data volume, and can be accessed remotely by researchers across the globe. Supported by a team of experts and software developers, Galaxy integrates many individual ‘omics tools in a single environment, and also has many functionalities that promote workflow sharing and reproducibility. The latter is particularly important, as there may be multiple research projects that can utilise one particular dataset or workflow. Data sharing and transparency also encourages collaboration, and increases the number of expert approaches that can be combined to maximise novel findings.

In particular, the Galaxy for proteomics (Galaxy-P) team investigates ways in which genomic and transcriptomic data can be integrated with MS-based proteomics data. From here, they aim to verify the expression of protein sequence variants that result from sequence variations at the DNA or RNA level. This approach, known as proteogenomics, commonly uses transcriptomic data translated in silico to produce a customised protein sequence database. This database is subsequently used to match proteins obtained through MS technologies. The major advantage of this approach is that no existing reference sequence is required, and so novel protein sequence variants, which may previously have gone undetected, can be identified. The analysis can also be extended to compare expression levels of genes and proteins.

Members of the Galaxy-P team, (

Similar to proteogenomics, metaproteomics is also based on integration of metagenomic data with MS-derived proteomics data. However, unlike the previous approach, this concentrates on integrating these with sequence data derived from bacterial communities (microbiomes). As before, metagenomic data are translated in silico to create a protein sequence database. MS/MS peak lists, derived from the raw data, are matched against the database. Once peptides of interest have been identified, they are assigned to taxonomies and verified. Additional analysis using tools for functional analysis such as MEGAN, provide information about the functional categories of microbial protein expression. Metaproteomics can provide us with functional data to complement the taxonomical findings of a metagenomic approach. The main draw of this approach is that it can potentially be used to analyse data from diverse sample types — ranging from clinical to environmental samples.

An example of where Galaxy-P ( provides ideal tools could be in helping cancer researchers identify which protein sequences may have a functional role in causing a specific cancer. Not only does Galaxy-P provide the necessary tools required for complex analyses, it can also potentially train non-expert, bench scientists through public Galaxy platforms (; This platform provides small-scale data for users to access and use with already published workflows. Existing studies have already used the Galaxy-P platform successfully to look at a range of topics, from proteogenomic analysis of hibernating mammals, to protein expression in the lungs of patients with acute respiratory distress syndrome.

To infinity and beyond
Drs Griffin and Jagtap hope their work will provide a novel environment to integrate multiple ‘omics’ datasets, and that this approach will provide unique opportunities for future discovery.
So far, the Galaxy-P team has advanced the abilities of Galaxy to cope with the many challenges of multi-omics informatics. An accessible, unified environment now exists to help non-experts navigate the analysis of MS-based proteomics and metabolomics data, in addition to a platform with the potential to develop workflows for proteogenomic and metaproteomic analyses.

The next steps will continue to involve the consultation of biological researchers to help the team translate their informatics findings into basic biological contexts, and to aid projects which address human diseases. The team will also continue to develop visualisation tools that can help with the interpretation of outputted data.

There is also potential to add extra layers of omics to the analysis. So, for example, metabolomics could be included in the mix. Using this approach, the possibilities for new discoveries are endless.