Stories by Nicholas Parker on Medium

Visualizing High-Dimensional Microbiome Data

Nicholas Parker — Thu, 25 Jun 2020 02:59:45 GMT

Source: wallpapercave.com/dna-background

Part 2 — Genomic Data Science Series

This article is part of a tutorial series on applying machine learning to bioinformatics data:

Part 1 — Data Acquisition & Preprocessing

Part 2 — Dimensionality Reduction

To follow along, you can either download our Jupyter notebook here, or continue reading and typing in the following code as you proceed through the walkthrough.

Introduction

Unsupervised machine learning methods can allow us to understand and explore data in situations where we are not given explicit labels. One type of unsupervised machine learning methods falls under the family of clustering. Getting a general idea of groups or clusters of similar data points can inform us of any underlying structural patterns in our data, such as geography, functional similarities, or communities when we otherwise would not know this information beforehand.

We will be applying our dimensional reduction techniques to Microbiome data acquired from UCSD’s Qiita platform. If you haven’t already done so, see Part 1 of this tutorial series for how to acquire and preprocess your data, or alternatively download our notebook on that section here. We will be needing this before we move on. Basically, our microbiome dataset has columns that represent counts of bacterial DNA sequences present, and our rows represent samples of individual communities of bacteria. This type of data table can be created from Illumina NGS sequencing data after a variety of bioinformatic data cleaning and transformation. We expect samples from different environments to have a different microbial signature since bacteria communities are affected by their environment. The data we worked on for this article are samples take from Toronto, Flagstaff, and San Diego, which should be unique to each other. We hope to visualize this difference which is hidden somewhere in their bacterial composition.

Bacterial communities are expected to be unique across the three different locations, and we hope to visualize that through the high-dimensional microbiome data. Image Source: Pexels, modified by user.

To visualize this complex, sparse, and high dimensional metagenomic data as something our eyes can interpret on a two-dimensional computer screen, we will be needing to drastically reduce our dimension size, or in other words, the number of features in our data. Rather than the 25,000 columns of our dataset which currently represent a portion of the genetic sequence of each organism and their counts in our microbiome, we instead would like a notion of “the most important features” to plot. This article explores 3 different dimensionality reduction and visualization techniques applied to microbiome data and explains what these visualizations can tell us about the structure inherent in our data.

All visualizations were produced in python with the Matplotlib and Seaborn plotting packages and pandas was used for data frame construction.

https://medium.com/media/ba9a6f33c0cfc6e1e060b840855aeca3/href

For demonstration purposes, since we actually do have labels for this data set, we can confirm whether or not our data set produces nice visualizations by assigning a different color to each point corresponding to a different geographic location. In reality, you will often not have this if you are already taking the unsupervised machine learning approach.

PCA

Our first dimensionality reduction technique and one of the most commonly used ones is called Principal Component Analysis (PCA). PCA attempts to reduce the feature space down to representations of the variation found within the data. It does this by taking all of your data points and rotating to an axis that clearly shows the maximum amount of variability. That axis is known as your ‘first principal component’. Mathematically speaking, the placement of this line goes through the centroid of your data, while also minimizing the squared distance of each point to that line. It is also the axis with the most variation in the data. After re-aligning our data, we will then collapse all data points onto that dimension. Once this step is done, we rinse and repeat, keeping in mind that every time we find a new principal component, that new line will always be perpendicular to the previous principal component. See here (https://setosa.io/ev/principal-component-analysis/) for a nice visual explanation of PCA.

To do PCA we can run the following code on our previously built feature table from Part 1 of our series.

https://medium.com/media/dada6de6e6ae73e85023059810064af9/href

Source: Image by author

And if we are given labels to check how our dimensionality reduction went (reminder that in reality this is not guaranteed), we can replot our PCA with colors:

https://medium.com/media/791d4aa44ffcd99a87a2c61ec97826c5/href

When we transform our data from 1,894 features down to 2, we can see two discernable dimensions the data is being placed into. However, once we look at the true meaning of the data after the geographic origin of the data has been revealed, we see that this visualization technique doesn’t give us a good representation of the geographic data. Unfortunately, this common technique falls apart when applied to the level of sparsity that microbiome data produces.

*It is important to note that after applying PCA we could get a variety of shapes for our plots. To interpret the different shapes you can get, we recommend you check out this post here.

t-SNE

Another technique used to explore high-dimensional data like our microbiome data is to use something known as t-distributed stochastic neighbor embedding t-SNE. Unlike PCA which works by trying to keep dissimilar data points far apart in our lower-dimensional representations using linear methods, t-SNE attempts to handle data that lie on non-linear lower-dimensional manifolds by trying to keep similar data points close together.

t-SNE works by minimizing the divergence between 2 distributions. The first distribution comes from our pairwise similarities of objects in our original high-dimensional input space. The second distribution is that of our pairwise similarities of objects in our corresponding lower-dimensional embedding. In essence, we are trying to minimize the divergence between these two distributions of our original high dimensional space and the corresponding lower-dimensional one.

To run t-SNE let’s use the implementation by scikit-learn and run that on our feature table from before:

https://medium.com/media/a7ecfca23547243611b93962cb482459/href

Before we move on, let’s bring up a few important points. One of our parameters for t-SNE is the metric we used to calculate the distances between our instances of our features. The default is Euclidean distance, but since we are using counts as our entries for each row, we will instead use a metric called the Jaccard distance. In essence, the Jaccard distance metric is the number of counts in two sets divided by the number in either set, multiplied by 100, and then subtracting this all from 1. It technically measures the dissimilarity between sample sets.

Another important hyperparameter that we can adjust in t-SNE is that of perplexity. In essence, perplexity allows us to balance how much we would like to emphasize our local vs. global relationships in our data. We chose to stick with the default setting of perplexity=30, however we highly recommend this exploration of t-SNE and perplexity here.

Let us plot the results from our t-SNE embedding, showing both a plot without labels (again, like how you would expect in real unsupervised scenarios) and a plot with our known labels:

https://medium.com/media/a943ad658b21747626987a30927e0bec/href

When the labels are revealed, we can see that this embedding is a decent representation of the underlying geographic structure present in our microbiome genetic data as evidenced by the data points taken from the same geographic region is in their own quadrants. Next, we will move on to UMAP, a technique we found to be most effective on this type of data.

The last dimensionality reduction technique we will use to represent our high dimensional microbiome metagenomic data is called Uniform Approximation and Projection (UMAP). UMAP improves upon t-SNE’s performance by not only working better with larger datasets in a markedly shorter time but also by preserving much more of the original global structure of our data. For a nice in-depth comparison of t-SNE vs. UMAP, we recommend this tutorial here.

The mathematical underpinnings of UMAP center around first building a weighted graph, where edge weights represent the likelihood of a connection between points. This is determined by a radius that expands out from each point and connecting points with overlapping radii. As each point’s radius grows, its likelihood of connection decreases.

Similar to t-SNE, we can also tune the hyperparameters associated with UMAP to balance our local and global structure in our lower-dimensional embedding. Our n_neighbors parameter corresponds to the number of nearest neighboring points used to construct our original graph, with low values emphasizing local structure, and high values emphasizing global structure.

The second main parameter is that of min_dist. This parameter represents the minimum distance we would like between points in our lower-dimensional embedding, with low values giving us more closely packed groups of points, and larger values giving more loosely packed groups of points. We recommend playing around with the interactive visualizations here to get an intuitive feel for UMAP’s hyperparameters.

Let’s now create our new lower-dimensional embedding with UMAP:

https://medium.com/media/3b9e8caa99ddb1ed006f6fd23a35e0b5/href

…and with labels:

As we can see once we apply colors to reveal our labels, it seems that UMAP does a much better job at conveying the underlying geographic structure from our original high-dimensional metagenomic data set to the above lower-dimensional visualization, with each “spoke” or “petal” of the plot representing that area’s local community of microorganisms, also known as those microbiomes.

Conclusion

Microbiome data present a unique challenge due to its inherently high-dimensional and sparse nature. To reduce dimensionality we applied three techniques: PCA, t-SNE, and UMAP. In terms of grouping similar data such as microbial samples with a similar geographic origin, UMAP performed the best.

Following up on the above, we can use these 2D embeddings combined with our favorite clustering algorithms to infer classes from the data. Instead of specifying dimensionality reduction down to two dimensions, we can try to reduce it down to any number of dimensions and then apply clustering on top of that. Classifying microbiome data using clustering methods is a topic we may talk about in our next article.

The two co-authors of this tutorial series are Nicholas Parker & Mundy Reimer, both of whom are graduates of the Masters in Data Science program from the University of San Francisco.

If you would like to reach out, you can find us at:

Mundy Reimer
Linkedin
Personal Blog
Twitter

Nick Parker
Linkedin
Personal Blog

PCA, t-SNE, and UMAP. Image created by the user.

Visualizing High-Dimensional Microbiome Data was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Predicting the Oscars using Preferential Machine Learning

Nicholas Parker — Mon, 03 Feb 2020 04:46:58 GMT

The Oscars and their preferential balloting led me to create a novel machine learning approach to mimic this voting system

Last year was a great year for film, and if you are like me, basking in the afterglow of the Movie Pass craze and still seeing a lot of films in theaters, you know Once Upon a Time in Hollywood, Parasite, 1917 and many more films delivered unique cinematic experiences. Every year on Oscar Sunday, Hollywood gets together and gives itself a big pat on the back. The biggest prize of the night is the award for Best Picture, which can cement a movie in the annals of film history. Unlike the other 23 awards given out on Oscar Night, the coveted Best Picture award is chosen using a method called preferential balloting which is more complicated than a traditional vote. Preparing for this year’s Oscars, and learning about the preferential balloting led me to write some programs to mimic this voting system using machine learning.

‘2020 Oscars’ Art by Reddit user u/Tillmann_S

In this article, I:

Pick the data used to predict the Oscars with
Explore how preferential balloting works from a data science perspective
Demonstrate a method of my own design I call a Preferential Balloting Random Forest
Simulate what is happening behind the scenes of the Best Picture vote
Predict this year’s best picture winner

I don’t include any of my code in this article, but here is the repository with my notebooks used in this analysis

How to Predict the Oscars: The Dataset

To predict anything using machine learning, we need a meaningful dataset to train our model on. In the case of the Best Picture race, we have the nine 2019 films nominated for the award. As reverential as I am to the Oscars (I am interested enough to write this article, after all) I hold no reservations that the best movie of the year is the one which will win the Best Picture Oscar. The Academy is made up of thousands of members working throughout various areas of the film industry, and they each have biases which lead to their votes. Because there are real people behind the votes, we can’t rely on numerical indicators of film quality like box-office profits or aggregate critic scores. But you know what correlates well with filmmakers’ votes? Other filmmaker votes.

There are numerous other awards shows which make up “Awards Season”, and the voters for events like the Screen Actors Guild Awards and the Directors Guild Awards are often the same people which make up the voting body of the Academy Awards. Using the results of earlier awards shows like the SAGs, DGAs, PGAs, Golden Globes, and BAFTAs and combining that with Oscar info like nomination count, I can train a model on previous years best picture winners to predict this year’s. To get consistent movie data and naming conventions, I scraped the data for each awards show’s nominees and winners from Wikipedia’s and merged them all together into one dataset in Python using the Pandas and Beautiful Soup packages.

How Preferential Balloting Works

Preferential balloting, also called Instant Run-off Voting, is commonly used in situations where there are many candidates for only one winning spot. The Oscars has used this vote tallying system to decide the Best Picture race since 2009, when the field expanded from five nominees to up to ten. In preferential balloting, rather than voting for one film, voters submit a ballot with all the options ranked, and the #1 choices are tallied up as votes for that film. Then an iterative process begins in which the least popular film is eliminated and all ballots are re-ranked until a single film has greater than 50% of the #1 votes. After a film is eliminated from all ballots, the ballots which previously had the eliminated film at their #1 spot, now have their #2 move to the top spot, which increases the number of votes for the remaining films. This process continues until one film has greater than 50% of the #1 votes and then it is declared the winner. A simulation of this elimination process is shown below.

Figure 1: Simulation of preferential balloting elimination. Generated by my Preferential Random Forest

Critics of the preferential ballot method claim that it rewards easy-to-like or non-controversial films since non-controversial films will be around the middle of people’s rankings and controversial films may be at the top of some folks ballots but at the bottom of other’s, so they are prone to being eliminated. This effect was seen last year when the more artistic film Roma lost to the more generally-appealing film Green Book.

Photo Credit: Left — Alfonso Cuarón/Netflix, Right — UNIVERSAL PICTURES/PARTICIPANT/DREAMWORKS

Preferential Balloting Random Forest

We’ve seen in the past that preferential balloting can change the result of the Best Picture race, and so I created a model that reflects this distinct vote-counting method. A Random Forest Classifier model makes predictions by using a number of decorrelated Decision Tree Classifiers. Here is an article focusing more on the specifics of how a traditional Random Forest works. Generally, a Random Forest counts each tree’s ‘vote’ as a score based on leaf size and picks a final label by which class has the most ‘votes’ amongst all the trees. For this Preferential Balloting Random Forest, we instead use the ProbA values for each film in the test set and use them to create 1st - 9th place rankings of the film. ProbA values are the likelihood of that item being in the ‘Winner’ class and represents a softer prediction than the binary ‘Winner’ or ‘Loser’ classification. This softer prediction allows us to change the predictions from a boolean classification to a range. Each Decision Tree produces one ballot, and once the entire Forest has created their ballots, the iterative process of preferential ballot elimination begins to determine the Forest’s choice for winner. By using rankings rather than picking one class, my Preferential Balloting Random Forest is saving information that would otherwise be discarded by a Traditional Random Forest and using it again later in the elimination and reranking stage of preferential balloting.

Figure 2: An Individual Decision Tree’s vote on the test set

Simulating the Oscars

Using my Preferential Balloting Random Forest I simulated this year’s Best Picture race. To de-correlate each Decision Tree, I varied which awards show each tree saw, similar to Random Forest’s max_features hyperparameter. In this simulation, max_features represents what guild the voting academy member may be in, or how closely they follow the other awards shows that season. I also included a random noise feature for each Decision Tree to train on, representing each voter’s innate bias towards certain films. The Academy is made up of around 7,000 unique voters, so I fired up my Forest, which soon produced 7,000 ballots. After 6 rounds of eliminating the last place film, the top film had over 50% of the #1 votes, and my model had chosen the Best Picture Winner…

Figure 3: The final standings after 6 rounds of preferential balloting elimination. The process stopped once the film 1917 had greater than 50% of the vote.

Final Prediction

My Preferential Balloting Random Forest is a novel approach to simulate the Oscars, and I hope it helped you understand a bit about what goes into the Best Picture voting and Random Forest Classifiers, but preferential balloting aside, let’s get down to business and really predict these bad boys. Using my scraped dataset of Awards winning films, I implemented H2O’s powerful AutoML tool to train 100 different Random Forest, XBGT, and Deep Learning models with various parameters to predict this year’s Oscars. AutoML chose a XGBoost model that correctly predicted the Oscar outcomes of 147 out of 159 films on cross-validation. And which film did this maelstrom of models predict? Also 1917! Looks like things are looking good for this flick since the Preferential Balloting Random Forest and my AutoML model both predicted it.

Photo Credit: Universal Pictures, François Duhamel

Links and shoutouts:

Github Repo For This Project

Scraping Code Inspired from Github user Buzdygan

University of San Francisco MSDS

Predicting the Oscars using Preferential Machine Learning was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.