Roshan Noronha
Algorithms For Life
6 min readAug 22, 2015

--

As part of my DIY PCR series I’ve been learning how to do some analyses of my own. Over the last couple of weeks I’ve been using a web based platform called Galaxy Bioinformatics as a way to get started. This post will cover my experiences going through Galaxy’s introductory walkthrough as well as my overall thoughts on using this platform as a beginner.

The question that was asked in the walkthrough was “which coding exon has the highest number of single nucleotide polymorphisms on chromosome 22?”. To answer this question we need two sets of data. The first, is the data for chromosome 22. The second, is the SNP data that corresponds to chromosome 22.

I started by creating a new history called “AlgorithmsForLife” to store data as well as the steps of the analysis.

A new history

The next step is to get the data for analysis. Galaxy has a built in option to get data from several genome browsers under the Get Data tool. I got both the chromosome as well as the SNP data from the UCSC main table browser.

Getting data from chromosome 22

From the above picture there are a couple things to note. The conditions in the first row i.e. clade, genome and assembly simply mean that I am working with the current assembly of the human genome. The row labeled region is set to position since the plan is to work with a specific chromosome; in this case chromosome 22. Lastly, the row labeled output format is set to BED and the results will be sent to the “Algorithms For Life” history.

After clicking get output the result is the following:

Before this data is sent to Galaxy we can choose whether we want all of it or just specific parts. In this case since our question deals with coding exons specifically we can just import that data.

The same process needs to be repeated to get the SNP data however some of the parameters need to be tweaked.

Getting the SNP data

In the left image the track is set to CommonSNP’s and the position is set to chromosome 22. Following that the SNP data is exported to Galaxy and the end result is this:

Galaxy history with the imported data

You’ll notice that the SNP data is in grey since Galaxy hasn’t started working on the request yet. When the data is being processed the box will turn yellow followed by green when the job is completed.

Now that the data has been imported we can start answering the question “which coding exon has the highest number of single nucleotide polymorphisms on chromosome 22”.

I started by renaming my data to SNP’s and Exons so it was easier to keep track of. Following that, the next step was to join the two data sets together. To achieve this, the join tool under the Operate on Genomic Intervals option is used.

Parameters for the join tool

Key to note here, is that the exons need to be first and SNP’s need to be second. The result is a table with 12 columns where the six columns represent the exons and the last six represent an SNP. By counting the number of times the exon with the same name occurs we can see the number of SNP’s it has. As such, we can tell that the exon named “ uc010gqp.2_cds_10_0_chr22_16287254_r” (inside the red box) has 16 SNP’s.

Output from the join tool

Since I don’t want to manually count the SNP’s for each exon, the Join, Subtract and Group option can be used on the Join step. The Group tool along with the Count option will group and count the SNP’s for each exon.

Grouping and counting the SNP’s for each exon

The SNP’s will be grouped based on the exon names stored in Column 4 of the Join step. Clicking execute results in the following output:

Results of the Group and Join steps

Now that we know the SNP’s for each exon you’ll notice that the data is not sorted. The Sort tool under the Filter and Sort option lets us sort our data in descending order. Additionally, the Select First tool under the Text Manipulation option lets us look at the top five exons with the largest number of SNP’s.

Sorting SNP’s in descending order
The top 5 exon’s with the largest number of SNP’s

The result shows us that the exon with the largest number of SNP’s is uc002zsw.2_cds_0_0_chr22_21044319_f with 32 SNP’s. For the sake of brevity I’ll refer to it as uc002zsw.2.

So we’ve managed to successfully answer our question. We can take this a step further by finding some more information about uc002zsw.2 using a genome browser. Since the Group tool removed the coordinate information for all the exons we’ll need to add it back before we can view uc002zsw.2 or any of the other exons. By using the Join, Subtract and Group option again and selecting the Compare two Queries tool we can get the coordinates for the top five exons.

Merged coordinate data

Galaxy has a built option to view genomic data in a variety of different genome browsers but I’m not a huge fan of using that option. Instead I went to the UCSC main genome browser and entered the coordinates for uc002zsw.2. The result was the following:

uc002zsw.2 and it’s SNP’s as seen in a genome browswer

In addition to the exon uc002zsw.2 we can also see the 32 SNP’s that correspond to it.

Not only were we were successfully able to answer our question but we were also able to visualize our findings using a genome browser. But what if I wanted to repeat this process using a different chromosome? Or find some different information from this data? Luckily Galaxy a feature that lets you convert the steps in a history into a reusable workflow. This means that any analysis can be repeated in a consistent manner. In my next post I’ll be converting the AlgorithmsForLife history that was created in this walkthrough into a functioning workflow.

So far I’ve been really pleased with how easy it is to use Galaxy. I found each step of the walkthrough easy to follow and fairly intuitive to use. One issue I can definitely imagine encountering is finding the right tools I need to complete an analysis. However I feel that once I gain a bit more proficiency that shouldn’t be a problem. As I gain more experience down the road I can definitely see myself writing my own custom tools. Overall, my experience using this platform has been positive and I’m looking forward to using it’s other features.

--

--