Genomic hacker toolkit

What do we know about human genomes after The Human Genome project?

During the project we obtained a very long string of DNA bases that look like this:

It is 3,554,996,726 bases long and is divided into 25 parts, called chromosomes. Over time, the reference genome has become more precise, so today scientists use its 38th version. This long string can be downloaded from a database and it is an averaged human genome, so there is no other person on the Earth having a similar genome. In that long string there are special areas, they are called genes and occupy about 2% of the entire genome.

What are genes?

Why are those 2% coding genes so special? DNA is just a carrier of information and all actions in cells are produced by proteins that are encoded by genes. So, 98% of genomes are regulatory patterns or sometimes repeats. This does not mean that they are less important, but we don’t know yet what their function is.

Individual genome variability — what is it and how do we measure it?

We are all genetically different. The most common source of human variation is single nucleotide polymorphism. It means that one letter in our genome string differs from the letter in the same position of the reference genome. In the latest database there are 88,111,767 polymorphisms that are common in the population, which means that on average each 40th letter in your genome can be different from the reference. Why do we know this? It’s because we can read individual genomes during different sequencing projects, and compare them to the reference. There are different techniques that allow us to get information about an individual’s genomic features. In this paper we are focusing on the most modern technique called next generation sequencing.

Sequencing — what does it mean?

Sequencing is a process of reading biopolymers of DNA using physicochemical methods. We can get DNA from almost any human cell. While we keep the sample clean we are acquiring information about an individual’s DNA. Modern techniques don’t allow us to read the whole genome at once due technical limitations. Instead, we are only able to read many short lines (150–200 nucleotides) that belong to the genome of the individual. Modern sequencing machines produce hundreds of millions of such short lines each time it sequences an individual. But as they output we get a text file that weights several tens of gigabytes. Here is where bioinformatics and high performance computing comes in.

How does bioinformatics works with sequencing data ?

There are several steps:

  • Filter out short lines of individual genome sequences. Not all lines are high quality; some of them are simple junk so we filter them out.
  • Map short lines to reference genomes. It’s the most computationally heavy task. Imagine — we need to find a 200 letter line in a 3.5 billion letter long book and we have hundred of millions of such lines. One more thing: because our sequenced individual has its own genome polymorphisms (remember those SNPs), those lines can differ from the reference by several letters.
  • Calling individual variants and create a file that contains only differences from the reference genome. This file is the shortest representation of an individual’s genomic uniqueness.

Here you can see a real sample and a small part of a genome containing the CPM gene shown in genome browser. Grey lines are genome lines that were sequenced in a sample individual and aligned to the reference genome. In the centre we can see that one letter differs in this individual genome, compared to the reference G vs. A in refenece.

What do we do with those differences?

When we know what is in an individual’s genome, we can do many interesting things using knowledge from scientific research:

  • Find if the human is a carrier of inherited diseases.
  • Find out what are the individual’s predisposition to widespread diseases. For example, will this individual be more likely than others in the same population to be affected by diabetes, macular degeneration or atherosclerosis?
  • Detect the immunohistocompatibility complex type that can be used for transplantation.
  • Find genomic features that are unique for that human being.
  • Get individual dietary and lifestyle recommendations.
  • Reconstruct the appearance of the individual without any photos.
  • Find the most likely country of origin of the individual.

If we want to do new research, we can collect data about the individual and as well as other genomic data and do association analysis to find new scientific or technological knowledge. For example, how likely an individual with such a genome will like or dislike a certain food.

What is the place of the Zenome Platform in this process?

Right now, all that has been described can be done only by centralized services. While you can get an interesting report about your genetics, you also give away all the information about your genomes to that company and they can then use it at their own discretion…collect database information, sell your data to big pharma companies or other consumer companies. The Zenome platform team creates the market for genomic data, where you are the only owner of your genome. Raw data can be uploaded and processed fast and securely using distributed networks and after processing, your genomic data is stored privately. You can then obtain any genetic-based services that you wish, such as health reports, dating services, etc. If desired, you can also take part in research or profit from your genomic information. Scientific communities will be able to get more statistical genome information that will help with studying population structures and this can help detect what is the ‘norm’ for human genomes.