How Much Dirt is Too Much Dirt — Quality Metrics in Gene Expression Analysis

by: Aaron Daugherty, PhD
Manager, Discovery Software,

At twoXAR we bring together a lot of disparate data to rapidly identify disease treatments. It’s through these different data that we gain our predictive power. However, more data isn’t always better — not if the new data is of poor quality. In other words, quantity doesn’t trump quality, and that’s because of a common data science saying: bad data in = bad data out. Because of this, we check the quality of our input data at multiple levels; some of this is a manual process, but we automate as much as possible.

In July’s post, (ML)²: Myths and Legends of Machine Learning, I touched on the messiness of real world data and mentioned quality control checks; here, I will expand on that with an example of one of the checks we use for gene expression data (need a refresher on gene expression? Check out my post from a couple years back).

Gene expression data is a powerful tool to agnostically investigate disease, but it’s also notoriously noisy. So to make sure we are working with only the most high-confidence data we do a couple things. The first is that we always review the methods used to collect the data. Secondly, our pipeline includes a series of objective metrics as quality checks. In this effort we use our own custom software where necessary, but we also make use of existing methods. A good example of this, and a package that is really helpful for us, is ArrayQualityMetrics (AQM).

Coming out of the Huber Group at the EMBL, AQM is easy to use and easily ‘plugs-in’ to our microarray-processing pipeline and with it, our platform. But what does it actually do? Taking straight from the original abstract, AQM “provides powerful, automated, objective and comprehensive instruments on which to base a decision” on the quality of a microarray. It does this by measuring several aspects of a microarrays’ quality (e.g. reproducibility, distance from other microarrays in the dataset, etc.) and automatically detects outliers using not one, but three different measures — have we mentioned that we at twoXAR are big fans of overlapping evidence? We’re also really big fans of being able to drill down and understand every step in our process — no black boxes! This is what really sets AQM apart from other methods; it automatically produces detailed HTML reports. Here are a couple images from our running of AQM on our platform:

AQM’s sound underlying metrics and analyses and interpretable plots, make reviewing the quality of a microarray easy, fast, and consistent. This automatic detection of the samples that are too dirty (i.e. overwhelmed with technical variation) ensures our scientists aren’t wasting their time on unusable data. As a result, we quickly have ‘cleaner’ data and that ultimately leads to rapidly identifying our best treatment candidates.