Performing Single Cell RNA-seq analysis using 10X Genomic Data format in R language environment of Snowpark Container Service

Tatsuya Koreeda
Snowflake Engineering
8 min readMar 6, 2024

Introduction

Recently, the Snowpark Container Service (SPCS) has become available in all regions. SPCS can be used in various business scenes such as hosting locations for internal applications and batch processing servers. It is also applicable in the field of bioinformatics, which requires a large amount of computing resources for tasks like genome analysis and gene data analysis. Among these applications, single cell RNA-sequencing (scRNA-seq) is an advanced technology that profiles the expression of RNA at the individual cell level, requiring significant computing resources. In this article, we would like to upload 10X Genomics format files to the internal stage and perform cell type labeling using the R package Seurat. We will start with setting up Rstudio Server on SPCS.

Diagram of analyzing 10X Genomic Data format

Why R?

R is widely used in the fields of bioinformatics, epidemiology, statistical genetics, and the healthcare industry. It is frequently updated (version 4.3.3 was released). In a figure published by the Data Scientist Association, R is introduced alongside Python. The strengths of each language in data engineering are as follows:

Strengths of Python
- Compatible with AI systems, web applications, IoT devices, and cloud platforms
- Easy to build machine learning and deep learning models

Strengths of R
- Easy calculation of results based on statistical theory (e.g., confidence intervals, p-values)
- Graphs are visually appealing and simple

Strengths of Python and R

Personally, I like R for the following reasons:
- Abundance of specialized libraries for bioinformatics
- Beautiful plotting using ggplot2 (see below)
- Strength in statistical analysis
- Convenient data frame manipulation using dplyr

James Ding et al., 2020 Characterisation of CD4+ T-cell subtypes using single cell RNA sequencing and the impact of cell number and sequencing depth

Although Snowflake mainly uses Python in its technology, the use of R in Snowflake has been limited. However, with the introduction of Snowpark Container Service, the possibility of using R has expanded. The image provided by Snowflake for SPCS also features the “R” mark.

SPCS with R mark

Rstudio Server and Spec of SPCS

RStudio is an integrated development environment (IDE) managed by Posit for data analysts and researchers using the R programming language. It supports all aspects of R programs, including code editing, execution, debugging, and visualization. Additionally, a software called RStudio Server has been developed to run RStudio on a web server based on Linux, allowing the use of RStudio from a browser.

The Rocker project provides various Docker images centered around the R language. Each image is designed to meet specific needs and is structured as follows:

r-ver
- Contents: Image with R installed, tagged by version.
- Dockerfile: [Dockerfile for r-ver:latest](https://github.com/rocker-org/rocker-versioned/tree/master/r-ver)

rstudio
- Contents: Image with RStudio installed on top of `r-ver`.
- Dockerfile: [Dockerfile for rstudio:latest](https://github.com/rocker-org/rocker-versioned/tree/master/rstudio)

tidyverse
- Contents: Image with tidyverse package, devtools package, etc., installed on top of `rstudio`.
- Dockerfile: [Dockerfile for tidyverse:latest](https://github.com/rocker-org/rocker-versioned/tree/master/tidyverse)

verse
- Contents: Image with TinyTeX, etc., installed on top of `tidyverse`.
- Dockerfile: [Dockerfile for verse:latest](https://github.com/rocker-org/rocker-versioned/tree/master/verse)

geospatial
- Contents: Image with various R packages installed on top of `verse`, specialized for geospatial analysis.
- Dockerfile: [Dockerfile for geospatial:latest](https://github.com/rocker-org/geospatial/blob/master/Dockerfile)

For this project, we will be hosting on SPCS based on the images provided by the Rocker project.

Additionally, I have created an image pre-installed with the Seurat library for analyzing single cell RNA-seq data in 10X genomics format. You can use it by running.

docker pull kinngut/single-cell:latest

Link to the image on Docker Hub:

Setting up Package Installation in RStudio

To perform package installation from RStudio, SPCS needs to have external internet access. Therefore, defining network rules and performing operations like EXTERNAL ACCESS INTEGRATION are necessary. Let’s define the network rules and carry out the operation for EXTERNAL ACCESS INTEGRATION.

USE ROLE ACCOUNTADMIN;
CREATE NETWORK RULE allow_all_rule
TYPE = 'HOST_PORT'
MODE= 'EGRESS'
VALUE_LIST = ('0.0.0.0:443','0.0.0.0:80')
;
CREATE EXTERNAL ACCESS INTEGRATION all_access_integration
ALLOWED_NETWORK_RULES = (allow_all_rule)
ENABLED = true
;

Once the network rules and EXTERNAL ACCESS INTEGRATION are created, add them to the spec file and create the service. This will enable seamless package installation.

CREATE SERVICE singlecell
IN COMPUTE POOL tutorial_compute_pool
FROM SPECIFICATION $$
spec:
containers:
- name: singlecell
image: <registry_hostname>/tutorial_db/data_schema/tutorial_repository/kinngut/single-cell:latest
env:
DISABLE_AUTH: true
volumeMounts:
- name: rstudio
mountPath: /rstudio/data
endpoints:
- name: rstudioendpoint
port: 8787
public: true
volumes:
- name: rstudio
source: "@TUTORIAL_DB.DATA_SCHEMA.tutorial_STAGE"
uid: 0
gid: 0
$$
EXTERNAL_ACCESS_INTEGRATIONS = (ALL_ACCESS_INTEGRATION)
MIN_INSTANCES=1
MAX_INSTANCES=1;

By doing this, you can access RStudio by visiting the ingress_url.

RStudio startup screen

Explanation of 10X Genomics Format and Uploading Data to Internal Stage

10x Genomics is a company that provides single-cell RNA sequencing technology, allowing researchers to analyze a large number of cells at a low cost, making it widely adopted in the research community. Data from 10x Genomics is created in its proprietary format. To analyze the unique format of 10XGenome data, they provide a special library called cellranger. However, this data can also be analyzed using Seurat, so we will analyze it using Seurat this time.

Data in the proprietary format of 10x Genomics is structured as follows:

$ tree filtered_feature_bc_matrix
filtered_feature_bc_matrix
├── barcodes.tsv.gz
├── features.tsv.gz (or genes.tsv.gz)
└── matrix.mtx.gz
Stephanie Hicks “Welcome to the World of Single-Cell RNA-Sequencing”

Now, let’s upload the data to an external stage in Snowflake and analyze it in RStudio on SPCS. If the data is less than 250MB, you can upload it to the internal stage from Snowsight (for larger files, use the PUT command in SnowSQL). Therefore, we will try uploading the relatively lightweight files `barcodes.tsv.gz` and `features.tsv.gz` via Snowsight’s UI. If the size of `matrix.mtx.gz` exceeds 250MB, we will upload it using the PUT command in SnowSQL.

File Upload via Snowsight
snowsql -a [account_name] -u [user_name]
PUT file://[local_file_path] @[stage_path];
![SnowSQL PUT Command](https://storage.googleapis.com/zenn-user-upload/cd70825a1122-20240305.png)

After uploading the files to the stage previously set in the spec, you will be able to see the files in the location specified by the path set in the volume in RStudio’s terminal.

Code and Dataset Used for Analysis

The dataset used in this analysis is as follows:
Paper Title:
Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19
Dataset:
GSE149689

This paper utilizes Single Cell RNA-seq analysis to investigate the role of the type I interferon response in the progression of inflammation in severe COVID-19. Additionally, the dataset GSE149689 used in this paper has been reanalyzed in the following study:

Early peripheral blood MCEMP1 and HLA-DRA expression predicts COVID-19 prognosis

In this study, biomarkers for the prognosis of severe COVID-19 — MCEMP1 and HLA-DRA gene expression in CD14+ cells — were examined by reanalyzing seven datasets, including GSE149689. This time, let’s try to reproduce Figure 4b from Early peripheral blood MCEMP1 and HLA-DRA expression predicts COVID-19 prognosis by analyzing the GSE149689 dataset using Seurat. While it may be challenging to replicate it exactly, we can create a similar figure.

Figure 4b

Below is the code used for the analysis.

library(dplyr)
library(Seurat)
library(patchwork)
library(SingleR)
library(celldex)
library(SingleCellExperiment)

# Load the dataset
pbmc.data <- Read10X(data.dir = ".")

# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)

# Calculate percent mitochondrial genes
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

# Quality control
pbmc <- subset(pbmc, nFeature_RNA >= 200 & nFeature_RNA <= 5000 & percent.mt < 15)

pbmc <- pbmc %>%
NormalizeData() %>%
FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>%
ScaleData(features = VariableFeatures(object = pbmc)) %>%
RunPCA() %>%
RunUMAP(dims = 1:20) %>%
FindNeighbors(dims = 1:10) %>%
FindClusters(resolution = 0.5)

# Visualization
DimPlot(pbmc, reduction = "umap", label = T)

# Load reference data from celldex
ref <- celldex::HumanPrimaryCellAtlasData()

# Run SingleR to infer cell types of pbmc dataset using reference data
results <- SingleR(test = as.SingleCellExperiment(pbmc), ref = ref, labels = ref$label.main)

#Add inferred cell type labels to pbmc object
pbmc$singlr_labels <- results$labels

#Visualize cell types in a UMAP plot with labels
DimPlot(pbmc, reduction = 'umap', group.by = 'singlr_labels', label = TRUE)r

The code discusses preprocessing steps where low-quality cells are removed, followed by scaling and dimensionality reduction using PCA. The resulting data is then labeled based on a reference database for cell types, and visualized using UMAP.

You can visualize the results of UMAP using DimPlot.

UMAP Visualization

By performing cell labeling using the SingleR package, you can obtain a UMAP with immune cell type labels as shown below.

Immune Cell Type Labeling on UMAP: scRNA-seq data in 10X Genomics format analyzed on SPCS

Benefits of Bioinformatics Analyzing on Snowpark Container Service

Changing computing resources in Snowflake is very simple, as it can be easily done by changing the `INSTANCE_FAMILY` parameter when running the `CREATE COMPUTE` command. This allows individuals to quickly set up a cloud-based analysis environment, making it very convenient. This is particularly useful when requiring analysis environments that demand more resources.

CREATE COMPUTE POOL tutorial_compute_pool
MIN_NODES = 1
MAX_NODES = 1
INSTANCE_FAMILY = CPU_X64_XS;

Changing INSTANCE_FAMILY allows easy modification of computing resources.

Snowflake provides its own OAuth authentication, enabling management of access to containers on a per-user basis. When hosting simple applications, creating Snowflake users allows for easy setup of a login authentication system. This is highly effective not only for application development but also for data operations in various programming languages other than Python.

In this way, Snowflake offers users a flexible and powerful cloud analytics environment, catering to a wide range of needs from changing computing resources to building authentication systems and developing in multiple languages.

In Conclusion

This article explained the use of Snowpark Container Service for single-cell RNA sequencing in bioinformatics. While the analysis was performed on pre-counted 10X Genomics format reads, hosting Cell Ranger on SPCS enables end-to-end analysis on Snowflake from fastq file processing to data analysis. This will be a topic for future verification. The analysis environment using SPCS is expected to bring innovative analytical capabilities not only in bioinformatics but also in various research fields.

--

--

Tatsuya Koreeda
Snowflake Engineering

CREATIVE SURVEY Inc. Data Engineer - Snowflake Japan WEST UG Leader & Snowflake Squad 2024 - Sharing insights on the use of Snowflake in life sciences🧬