How to run statistics inside BigQuery
Big data is hard. Statistics is even harder. Doing statistics on big data is mind-blowingly hard. We are going to provide some tools to start you on a road to making statistics on big data, if not easy, at least possible.
Collaborators:
Ian Mathews, Redivis; Boris Aguilar, Institute for Systems Biology
Data scientists and increasingly traditional scientists are faced with enormous datasets. One of the most challenging parts of exploring large datasets is applying standard statistical tools at scale. Here we hope to provide tools to run statistics in-situ on Google Cloud BigQuery (BQ) using User Defined Functions (UDFs) and Stored Procedures. The examples are shared as part of the Google Cloud Platform “bigquery utils” repository, under the name Statslib:
Notebook
We illustrate this work in a Jupyter notebook hosted on the ISB-CGC github repostitory here.
Introduction
The growth in data size for big data has been exponential, with the amount of data created in the world doubling every two years. This growth is being driven by a number of factors, including the increasing number of connected devices, the growing popularity of social media, and the increasing use of sensors and other technologies to generate data.
The increasing number of connected devices is one of the main drivers of big data growth. The number of connected devices was predicted to reach 50 billion by 2020, but the true number is in the trillions [1]. These devices generate a huge amount of data, which needs to be stored and analyzed.
The growing popularity of social media is another major driver of big data growth. Social media platforms like Facebook, Twitter, and Instagram generate a huge amount of data, which can be used to track user behavior and preferences. This data can be used to target advertising, to improve customer service, and to develop new products and services.
Situation
In a well ordered system, this big data is captured in a database of some kind, and optimistically in a SQL query-able data warehouse, such as Google Cloud BigQuery [2]. This is a great start to managing massive datasets. But, depending on the type of analysis you want to do, you are limited to extracting data from the warehouse for detailed analysis. A typical process for statistical analysis is to query a representative dataset from your database, export it via CSV and import into R-Studio, but you could substitute any popular data science platform, such a Python and Numpy.
This time-tested workflow breaks down when the size of the data exceeds what can usefully be pulled into R-Language. Normally, this is limited to several GBs, but in theory a 64bit executable could handle up to 8TB (if you have that much RAM available!). In practice, the former number is likely to be the limit you find on an interactive R-Studio session. We can sometimes work around these limits by either reducing the large dataset to a representative analytical cohort, or breaking it into smaller pieces. These sometimes-viable (if labor intensive) methods can often help us get through a large dataset. Careful construction of queries can make this more efficient.
Better solution
A better approach, and the point of this post, is to run our statistical analysis directly where the data reside — in our case, making use of BQ UDFs to run statistical analysis directly in BigQuery, without exporting any data until the analysis is complete. Whoa, I can hear you say: “that sounds hard”. Well, here is an example running a t-test on the classic Iris dataset. You can run this query directly on BQ, and if you don’t already have a project on Google Cloud, you can set up a BigQuery sandbox to try it out.
SELECT bqutil.fn.t_test(
ARRAY(SELECT sepal_length from `bigquery-public-data.ml_datasets.iris` WHERE species = "versicolor"),
ARRAY(SELECT sepal_length from `bigquery-public-data.ml_datasets.iris` WHERE species = "virginica")
)
The query is running the bigquery-utils UDF: t_test. The output of this query provides the T value and the degrees of freedom DOF.
If you recall your college stats course, these values will be familiar to you. The t_value represents a measure of “dis-similarity” of two distributions. If you review the chart below (assuming you are familiar with box plots), you can see that the distributions of sepal length between the versicolor species and the virginica species are quite different. As an exercise for the reader, try running the comparison between vesicolor and setosa species. In that case, you should see a t_value of over 10. This makes sense based on the chart below.
A more complicated example
Admittedly, we could easily compute summary statistics on the Iris dataset using conventional, in-memory methods. However, the power of statistical tests in BigQuery really comes through when working with larger datasets.
The Cancer Genome Atlas (TCGA) is a cancer genomic program with the overarching goal of applying high-throughput genome analysis techniques to improve the ability to diagnose, treat, and prevent cancer through a better understanding of the molecular basis of the disease. The project molecularly characterized over 20,000 primary cancer samples spanning 33 cancer types. TCGA generated over 2.5 petabytes of genomic, epigenomic, transcriptomic, proteomic, and clinical data, a valuable resource for cancer research. Recently, ISB-CGC transformed and loaded the TCGA data into a publicly available set of BigQuery tables which will be used in the following examples.
For the purposes of this discussion, we focus on the two elements of TCGA dataset: patient clinical data and protein expression (an effective measure of how much protein is created from a particular gene in a patient sample). We further focus on the specific breast cancer related to the project named “TCGA-BRCA” in the TCGA dataset. For more information on the details of this data, please have a look at another publication here.
Our challenge here is to identify genes where protein expression shows significant changes across clinical features. Similar to the iris example above, we are looking for populations (groups) where the distributions do not appear to be similar. In fact, significance is measured by the mathematical probability that the groups are from the same overall distribution. You would observe this in the plot below where the orange box plot seems to be out of line with the other distributions. In fact, the statistical method used here (Kruskal-Wallis statistical test) predicts a p=0.0, indicating there is a zero percent chance that the distributions are the same.
The Kruskal-Wallis test
The Kruskal-Wallis test is a non-parametric test for comparing the means of three or more groups. It is a generalization of the Mann-Whitney U test for comparing two groups. The Kruskal-Wallis test is used when the assumption of normality is not met or when the samples are small. The Kruskal-Wallis test is based on the ranks of the data. The null hypothesis is that the samples come from the same distribution. The alternative hypothesis is that the samples come from different distributions. Now that we have that cleared up, this is how we used it in the data.
The query is (admittedly) a little complicated:
SELECT protein.project_short_name AS study,
gene_name AS gene_name,
clinical.feature.key AS feature,
`bqutil.fn.kruskal_wallis`(ARRAY_AGG((clinical.feature.value,protein.protein_expression))) AS reso
FROM `isb-cgc.TCGA_hg19_data_v0.Protein_Expression` protein
JOIN `isb-cgc-bq.supplementary_tables.Abdilleh_etal_ACM_BCB_2020_TCGA_bioclin_v0_Clinical_UNPIVOT` clinical
ON clinical.case_barcode = SUBSTR(protein.sample_barcode,0,12)
WHERE clinical.feature.value != 'null'
AND protein.project_short_name = 'TCGA-BRCA'
GROUP BY study, gene_name, feature
HAVING reso.DoF > 3 and reso.DoF < 10 #and reso.p <= 0.001
ORDER BY study, reso.p, feature
Again, you can run this in BigQuery directly (or in your BigQuery sandbox). Recently, the ISB-CGC, a component of the National Cancer Institute Cancer Research Data Commons, made several cancer related datasets publicly available in BQ, including the TCGA dataset. This allows us to easily make use of the data. This query has a JOIN between a table with protein expression, determined by the corresponding gene name, and a table with clinical data. This query analyzes every combination of gene and clinical feature, running the data through the bqutil.fn.kruskal_wallis UDF to determine if there is a significant relationship between protein expression and clinically relevant features.
Although we don’t cover this in detail, as shown in the figure below, the BigQuery Execution Graph can provide insight into query performance. In particular, data bottlenecks due to joins and other operations can be visualized.
The key line of SQL is this one.
`bqutil.fn.kruskal_wallis`(ARRAY_AGG((clinical.feature.value,protein.protein_expression))) AS reso
The output provides the combination of gene names and clinical features, along with the p-value. The query sorts them by p-value. The results show that the feature “histological_type” is shown in most of the cases, but in “country” (country of origin) is also a significant factor in the data for gene “GATA6”. None of this analysis is intended to be definitive research, but it points to a method to analyze large amounts of data.
A question of size
It may be possible to extract all the data involved in this particular query and run your analysis in R-Studio, but there are datasets which would simply be too large. It also takes only seconds to run this on a large set of data. This method of using SQL UDFs to run statistical significance tests is a promising method to apply statistics to big data.
To look specifically at this query, the resulting export required to get this data into R-Studio would be over 6M rows, nearly 360MB in size. At present, the BigQuery UI will allow you to export up to 10MB to a local file. We’re exceeding that, so your only option is to export to Google Drive. That export alone can take almost 30 seconds.
It’s pretty clear that running the analysis where your data is (inside BigQuery) is the winning strategy.
Conclusion
We’ve demonstrated a bit of what you can do with Statslib library as part of the BigQuery Utils package. Below is the list of tests and tools we have already implemented. We are looking for collaborators. If you are interested to contribute, either just make a pull request, or reach out directly to the team.