It is time to replace genotyping arrays with sequencing

tl;dr: for discovery of genetic variants associated with traits, sequencing outperforms genotyping arrays and costs less.

Over the last ten years, the field of human genomics has made unbelievable progress in identifying genetic variants that influence disease susceptibility and other traits (see, e.g. this review). The technological advance that drove this progress was the development of genotyping microarrays: a technology for the measurement of hundreds of thousands to millions of genetic variants in a single person.

The benefits and limitations of genetic studies using this technology (often confusingly called genome-wide association studies [1]) have been debated since before anyone even tried one (see examples from 2000, 2008, and 2017). But it’s generally believed that, if one accepts the basic goals and assumptions of a genome-wide association study, the most cost-effective and powerful technology to use is a genotyping array.

In this post, I argue that this is no longer the case: genotyping arrays are now less effective and more expensive than sequencing technologies.

The known limitations of genotyping arrays

Before getting into the details of this argument, it’s worth remembering the basic goals of most genetic association studies.

In general, if you’re running an association study, you are interested in identifying genetic variants that influence risk of a disease, and you have DNA from thousands of people with the disease and thousands of people without it. The goal is to scan across the entire 3.3 billion bases of the human genome to find sites that differ across these two groups.

The logic of genotyping arrays is as follows: obviously it’s not cost-effective to sequence all 3.3 billion bases in thousands of people, since most of the genome is just going to be identical in everyone. Instead, what you might want to do is choose a set of maybe 500,000 to 2M sites (well less than 0.1% of the genome) and measure just those. If you choose those sites well (for example, by focusing on sites that you know vary across people), you can get a lot of the benefits of looking at the whole genome for a fraction of the cost.

The counterargument to this logic is that presumably the reason you’re doing your study is you don’t know ahead of time which sites to look at. Though there are ways to help mitigate this problem, ultimately this leads to the key limitations of genotyping arrays:

  1. Since you have to know ahead of time that a genetic variant exists before you can measure it, it is difficult or impossible to identify rarer variants that have ended up being important for a number of diseases and traits.
  2. Array design is necessarily biased towards better-studied populations; for example, arrays based on genetic variants discovered in European populations don’t perform well in African populations.

The well-known solution to all of these problems is to use a technology where you don’t have to decide ahead of time which genetic variants to look at, and the natural choice of technology is some variant of whole genome sequencing. At Gencove we’ve developed our own ultra-low-coverage sequencing assay, and as in the schematic below, even very low-coverage genome sequencing generates a more complete look at the genome than a genotyping array.

A comparison of the data generated by a genotyping array and ultra-low-coverage sequencing. The x-axis represents 6,000 bases of the human genome, and the y-axis represents 600 people, 300 assayed with a genotyping array and 300 with sequencing. Black represents the positions measured with the array, while blue represents positions measured with sequencing (see inset).

The conventional wisdom is that there is a tradeoff between genotyping arrays and sequencing: genotyping arrays let you inexpensively cover all of the known variation in the genome, while sequencing lets you identify new variants at a higher cost. But this conventional wisdom is wrong.

Sequencing approaches outperform genotyping arrays even at common variation

Genotyping arrays were not designed to measure rare or population-specific variants, so it’s hardly a surprise that they don’t have these features. What might be surprising is that many genotyping arrays don’t profile known variation particularly well either.

When we were developing the ultra-low-coverage sequencing assay that we use at Gencove, we performed a number of simulations comparing the power of imputation-based genetic studies using genotyping arrays to those using ultra-low-coverage sequencing [2].

This situation is exactly the one that genotyping arrays are designed for. And based on results from a few years ago from my colleague Bogdan Pasaniuc, we’d expected the performance of ultra-low-coverage sequencing to be poorer than genotyping arrays, but to an acceptable degree given the added benefits of sequencing in other contexts.

Instead, the results looked more like those in these figures: for example in a Nigerian population, 0.2x coverage sequencing outperforms the Illumina CoreExome and Global Screening Array chips across the entire allele frequency spectrum. Importantly, this is not expensive: if you run a quick back-of-the-envelope calculation [3], the sequencing costs of 0.2x sequencing will soon be less than $10 (though other costs then start to dominate).

Comparison of average genotype accuracy after imputation in an African population, using genotyping arrays and ultra-low-coverage sequencing. On the x-axis are genetic variants in bins of different minor allele frequencies, and on the y-axis is the average r² after imputation to the true genotypes.
Comparison of average genotype accuracy after imputation in a European population, using genotyping arrays and ultra-low-coverage sequencing. On the x-axis are genetic variants in bins of different minor allele frequencies, and on the y-axis is the average r² after imputation to the true genotypes.

For our purposes we were comparing to the genotyping arrays with the fewest markers. Recently, a paper from the Zeggini lab at the Sanger Institute appeared on bioRxiv, in which they compare higher coverage sequencing (about 1x) with denser genotyping arrays.

The take-home messages from this paper are in line with our calculations:

  1. Sequencing increases power compared to genotyping arrays for the discovery of associations between genetic variants and traits. In their application:
Of the 54 association signals arising from genome-wide association analysis of 1x [whole genome sequencing] variants with 25 haematological traits, only 57% are recapitulated by the imputed [array] results in the same samples.

2. Sequencing, the technology with more power, is less expensive than genotyping arrays:

As of January 2017, 1x WGS on the HiSeq 4000 platform was approximately half of the cost of a dense GWAS array (e.g. Illumina Infinium Omni 2.5Exome-8 array) [and] 1.5 times the cost of a sparser chip such as the Illumina HumanCoreExome array

For the purposes of discovering new genetic associations to traits and diseases, there actually is no tradeoff: sequencing approaches are more powerful and less expensive.

What next?

The switch from genotyping technologies to sequencing technologies was inevitable, but methods for low-coverage and ultra-low-coverage sequencing have accelerated the timescale for this switch considerably. How might one plan for large-scale human genomics studies going forward?

  1. For association studies that want the most power at the lowest cost, ultra-low-coverage sequencing (around 0.2–0.4x) can take the place of sparse genotyping arrays.
  2. For association studies that have higher per-sample budgets, low-coverage sequencing (around 1–2x) can take the place of dense genotyping arrays.
  3. For studies where the goal is not to identify new associations, but rather to examine a set of known associations in a large number of people, genotyping arrays remain a cost-effective choice.


[1] Obviously a whole-genome sequencing study would be a truly genome-wide association study, but for historical reasons the term GWAS seems to be permanently linked to the specific technology of genotyping arrays.

[2] The simulations involve splitting the 1000 Genomes Phase 3 reference panel in half, extracting the SNPs on the arrays (or simulating reads at a given depth threshold) for half of the sample, and then using the other half as a reference panel for imputation. Since the reference panel is half the size it would be in practice, the absolute imputation accuracy for a technology is underestimated, though this should affect all technologies equally.

[3] Very back-of-the-envelope: if a high-quality 30x genome costs $1,000, then the sequencing costs of a 0.2x genome should be less than $7 (!). The costs of running the assay are dominated by other factors like library preparation and DNA extraction rather than sequencing per se.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.