What is LIGER?

Taha Mohammadzadeh
8 min readAug 15, 2023

--

LIGER is a novel method for jointly defining cell types from multiple single-cell datasets. It was developed by researchers at the University of California, San Diego and published in the journal Nature Methods.

The approach uses a probabilistic model to integrate data from multiple single-cell RNA sequencing datasets and identify shared cell types across different datasets. The model takes into account the variability in gene expression across cells and datasets, and it can identify cell types that are present in multiple datasets but not in others.

LIGER was tested on several datasets of different cell types, including human and mouse cells, and it was able to identify known cell types as well as new cell types that were not previously known. The researchers also showed that LIGER can be used to identify cell types that are specific to certain tissues or diseases, such as cancer.

One of the advantages of LIGER is that it can handle large datasets with thousands of cells and identify cell types that are present in low frequencies. It can also be used to identify cell types that are dynamic and change over time.

The authors of the study suggest that LIGER could be a valuable tool for researchers who study cellular heterogeneity and want to identify new cell types and understand their roles in different biological processes.

LIGER is a computational method that enables the integration of multiple single-cell RNA sequencing (scRNA-seq) datasets to identify shared cell types across different datasets. It is designed to overcome the challenges of analyzing single-cell data, such as low cell numbers, high cell-to-cell variability, and dataset-specific technical variability.

The method uses a probabilistic model that takes into account the variability in gene expression across cells and datasets. It models each cell as a mixture of genes that are expressed at different levels, and it uses a hierarchical prior to group cells into clusters based on their gene expression profiles. The prior is informed by a set of known cell types, which helps to guide the clustering process and increase the accuracy of cell type identification.

LIGER has several advantages over other methods for analyzing single-cell data. First, it can handle large datasets with thousands of cells, which is important because single-cell RNA sequencing is becoming increasingly popular and datasets are getting larger. Second, it can identify cell types that are present in low frequencies, which is important because rare cell types can be biologically important but difficult to detect. Third, it can identify cell types that are dynamic and change over time, which is important because cellular heterogeneity can vary across different conditions, such as disease states or developmental stages.

The authors of the study demonstrated the power of LIGER by applying it to several datasets of different cell types, including human and mouse cells. They showed that LIGER can identify known cell types as well as new cell types that were not previously known. They also showed that LIGER can be used to identify cell types that are specific to certain tissues or diseases, such as cancer.

Overall, LIGER is a valuable tool for researchers who study cellular heterogeneity and want to identify new cell types and understand their roles in different biological processes. It can help to uncover novel biology and provide new insights into the complexity of cellular systems.

Data Preparation:

In this paper, the authors used a variety of data preparation techniques to prepare the data for their analysis. Here are some of the techniques they used:

Data Collection: The authors collected single-cell RNA sequencing (scRNA-seq) data from multiple datasets, including data from human and mouse cells. They obtained the data from publicly available repositories, such as the NCBI Gene Expression Omnibus (GEO) and the Allen Brain Atlas.

Data Preprocessing: The authors preprocessed the data to remove low-quality cells and genes. They used the scran package in R to filter out cells that had low library sizes or high levels of missing data. They also filtered out genes that were not expressed in any of the cells.

Normalization: The authors normalized the data to account for library size bias and other technical variability. They used the upper quartile normalization method, which is a popular method for normalizing scRNA-seq data.

Feature Selection: The authors selected a subset of genes that were most relevant for the analysis. They used the limma package in R to perform a moderated t-test to identify genes that were differentially expressed between different cell types.

Data Integration: The authors integrated the data from multiple datasets using the LIGER algorithm, which is a probabilistic model that can jointly analyze multiple scRNA-seq datasets and identify shared cell types. LIGER can handle dataset-specific technical variability and identify cell types that are present in multiple datasets.

Data Visualization: The authors visualized the data using dimensionality reduction techniques, such as principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). They used the matplotlib and seaborn libraries in R to visualize the data.

Clustering: The authors used clustering algorithms, such as k-means and hierarchical clustering, to identify cell types in the data. They used the clustering algorithms to group cells into clusters based on their gene expression profiles.

Model Architecture:

LIGER is a generative model that assumes that the gene expression profiles of cells are generated by a mixture of latent variables, which are random variables that are not directly observed. The model uses a probabilistic approach to infer the latent variables and generate a refined gene expression profile for each cell.

The LIGER model architecture consists of the following components:

  1. Encoder: The encoder is a neural network that takes the gene expression profiles of the cells as input and generates a lower-dimensional representation of the data. The encoder is trained to preserve the similarity between the gene expression profiles of cells that are close in the latent space.
  2. Latent Space: The latent space is a lower-dimensional representation of the data that captures the underlying structure of the gene expression profiles. The latent space is defined by a set of latent variables, which are random variables that are not directly observed.
  3. Decoder: The decoder is a neural network that takes the points in the latent space as input and generates a refined gene expression profile for each cell. The decoder is trained to reconstruct the original gene expression profiles from the points in the latent space.
  4. Prior: The prior is a probability distribution that is placed over the latent variables. The prior is used to regularize the model and encourage the latent variables to have a specific structure.

Model training:

The LIGER model is trained using a combination of reconstruction loss and Kullback-Leibler divergence loss. The reconstruction loss encourages the model to reconstruct the original gene expression profiles from the refined profiles, while the Kullback-Leibler divergence loss encourages the model to have a probabilistic interpretation.

The authors use a variant of the Expectation-Maximization (EM) algorithm to train the model. The EM algorithm is an iterative algorithm that is widely used for training probabilistic models. It consists of two steps: the Expectation step, where the model parameters are fixed and the latent variables are updated, and the Maximization step, where the model parameters are updated based on the current estimate of the latent variables.

In the Expectation step, the authors use the current estimate of the model parameters to compute the posterior distribution over the latent variables given the observed data. They then use this distribution to sample a new set of latent variables.

In the Maximization step, the authors use the sampled latent variables to update the model parameters. They maximize the likelihood of the observed data and the sampled latent variables with respect to the model parameters.

The authors repeat these two steps until convergence, where the model parameters and the latent variables converge to a fixed point.

In addition to the EM algorithm, the authors also use a number of techniques to improve the training of the model. They use a variant of the Adam optimizer, which is a popular stochastic optimization algorithm, to optimize the model parameters. They also use a technique called batch normalization to stabilize the training process.

The authors also use a number of heuristics to choose the hyperparameters of the model. They use a grid search to choose the number of layers in the encoder and decoder, and they use a separate validation set to tune the hyperparameters of the model.

Code Preprocessing and Normalization:

For the first portion of this protocol, we will be integrating published data11 from control and interferon-stimulated peripheral blood mononuclear cells (PBMC). This dataset was originally in the form of output from the 10X Cellranger pipeline. We directly load downsampled versions of the control and stimulated DGEs here.

ctrl_dge <- readRDS("ctrl_dge.RDS");
stim_dge <- readRDS("stim_dge.RDS");

For 10X CellRanger output, we can instead use the `read10X` function, which generates a matrix or list of matrices directly from the output directories.

library(liger)
matrix_list <- read10X(sample.dirs =c("10x_ctrl_outs", "10x_stim_outs"),
sample.names = c("ctrl", "stim"), merge = F);

With the digital gene expression matrices for both datasets, we can initialize a LIGER object using the createLiger function.

ifnb_liger <- createLiger(list(ctrl = ctrl_dge, stim = stim_dge))

Before we can perform matrix factorization to integrate our datasets, we must run several preprocessing steps to normalize expression data to account for differences in sequencing depth and efficiency between cells, identify variably expressed genes, and scale the data so that each gene has the same variance. Note that because nonnegative matrix factorization requires positive values, we do not center the data by subtracting the mean. We also do not log transform the data.

ifnb_liger <- normalize(ifnb_liger)
ifnb_liger <- selectGenes(ifnb_liger)
ifnb_liger <- scaleNotCenter(ifnb_liger)

We are now able to run integrative non-negative matrix factorization on the normalized and scaled datasets. The key parameter for this analysis is k, the number of matrix factors (analogous to the number of principal components in PCA). In general, we find that a value of k between 20 and 40 is suitable for most analyses and that results are robust for choice of k. Because LIGER is an unsupervised, exploratory approach, there is no single “right” value for k, and in practice, users choose k from a combination of biological prior knowledge and other information.

ifnb_liger <- optimizeALS(ifnb_liger, k = 20)

We can now use the resulting factors to jointly cluster cells and perform quantile normalization by dataset, factor, and cluster to fully integrate the datasets. All of this functionality is encapsulated within the quantile_norm function, which uses maximum factor assignment followed by refinement using a k-nearest neighbors graph.

ifnb_liger <- quantile_norm(ifnb_liger)

The quantile_norm procedure produces joint clustering assignments and a low-dimensional representation that integrates the datasets together. These joint clusters directly from iNMF can be used for downstream analyses (see below). Alternatively, we can also run Louvain community detection, an algorithm commonly used for single-cell data, on the normalized cell factors. The Louvain algorithm excels at merging small clusters into broad cell classes and thus may be more desirable in some cases than the maximum factor assignments produced directly by iNMF

ifnb_liger <- louvainCluster(ifnb_liger, resolution = 0.25)

To visualize the clustering of cells graphically, we can project the normalized cell factors to two or three dimensions. LIGER supports both t-SNE and UMAP for this purpose. Note that if both techniques are run, the object will only hold the results from the most recent.

ifnb_liger <- runUMAP(ifnb_liger)

--

--