Genes are the fundamental units of inheritance in living organisms. Together, they hold all the information necessary to reproduce a given organism and to pass on genetic traits to its offspring.
Biologists have long debated what constitutes a gene in molecular terms but one useful definition is a region of DNA that carries that code necessary to make a molecular chain called a polypeptide. These chains link together to form proteins and so are the bricks and mortar out of which all organism are constructed.
Given this crucial role, it is no surprise that an ongoing goal in biology is to work out the total number of protein-coding genes necessary to construct a given organism. Biologists think the yeast genome contains about 5300 coding genes and a nematode worm genome contains about 20,470.
But the number for humans has been the subject of constant revision since biologists first began the task of estimating them in the 1960s. Then, they believed humans could have as many as 2 million protein-coding genes. But by the time the human genome project began in the late 1990s, the highest estimates put the number at 100,000 and the number has continued to shrink.
In 2001, the initial sequence of the human genome cut the figure dramatically. The International Human Genome Sequencing Consortium put it at 30,000 while a rival group led by Craig Venter estimated the number at 26,000.
In 2004, the final draft of the human genome reduced the figure even further to around 24,500 and in 2007 further analysis suggested that it was more like 20,500.
And that’s where the figure has sat. Until now.
Today the figure is shrinking yet again thanks to the work of Iakes Ezkurdia at the Centro Nacional de Investigaciones Cardiovasculares and Michael Tress at the Spanish National Cancer Research Centre, both in Madrid, along with a few pals.
In a paper submitted to Molecular Biology and Evolution, these guys say the true number of coding genes in the human genome is probably closer to 19,000.
The task of spotting protein-coding genes is by no means easy. The best method is to take a sample from a cell, ionise the proteins it contains and send them through a mass spectrometer. The proteins in the sample can then be determined by matching the measured masses to the predicted protein masses.
The idea is to identify all the proteins from which an organism is made and therefore all the genes that code for them. But this is only possible with samples from all possible cell types and in practice this is hard to gather.
As a result, many proteins—and their corresponding genes—can be missed because they are not present in the sample, are technically hard to spot or degrade very quickly. And just because nobody has found the protein for a given gene, that doesn’t mean it doesn’t exist.
So Ezkurdia, Tress and co had to find another way to identify genes that probably don’t code for proteins. They began by combining the results from seven large scale mass spectroscopy analyses of human proteins from a wide range of cell types. They identified some 12,000 genes from these samples.
These guys then analysed these genes, looking for common factors that make them easy to spot. It turns out that the key factors are the age of the gene family and whether it is also found in other vertebrates such as mouse, chimpanzee or dog. “Ancient genes are generally widely expressed and often retain important housekeeping roles,” they say, which is why they are easily found.
That gave them an idea for spotting genes that are unlikely to code for proteins—just filter out the human genes that are not present in other species and do not have a structure likely to code for a protein.
So they analysed the remaining 8000 unaccounted-for genes and selected those that met this criteria. That produced a list of 2000 genes which they then studied in more detail.
They found they had protein evidence for fewer than 6 per cent of these genes (compared with 60 per cent of the genome as a whole). That “suggested that many of these genes might not code for proteins under normal circumstances,” they say.
Consequently, many of these should be withdrawn from the genome. “We believe that this evidence suggests that as many as 1,500 genes do not code for proteins.”
That’s an interesting result that is partly a reflection of the state of genomics. The human genome is by no means fully defined and biologists are still in the process of refining their gene models and withdrawing genes in the process.
Indeed, in the most recent update of the genome release, geneticists have withdrawn 328 of the 2000 genes that Ezkurdia, Tress and co identify as potentially non-coding.
And on this evidence, the human genome is set to get smaller still. “Our evidence suggests that the final number of true protein coding genes in the reference genome may lie closer to 19,000 than to 20,000.”
Which means that humans have fewer protein-coding genes even than nematode worms.
Geneticists long ago debunked the idea that more complex organisms require more genes. The water flea, for example, has 31,000 genes, the most in any animal, while the organism with the largest genome is thought to be the Paris jabonica, a rare flowering plant native to Japan.
The fact that the human genome is so parsimonious raises an interesting question. What exactly is it about the human genome that gives rise to our staggering complexity, in the brain for example, compared to other animals such as monkeys, worms or even water fleas?
A good answer to that question will win prizes!
Ref: arxiv.org/abs/1312.7111 : The Shrinking Human Protein Coding Complement: Are There Fewer Than 20,000