Efficient and Insightful TCRB Repertoire Analysis with LZGraphs: A Hands-On Guide (Part 1)

On the Path to Mastering TCRB Repertoire Analysis: Navigating LZGraphs for Enhanced Insights

Thomas Konstantinovsky
Computational Biology Insights
9 min readMar 20, 2024

--

Introduction

In the rapidly evolving field of immunology, the analysis of T-cell receptor beta chain (TCRB) repertoires presents both a challenge and an opportunity to delve into the complexities of adaptive immunity. The diversity of TCR sequences is a testament to the immune system’s incredible adaptability, but it also poses significant analytical hurdles. Traditional methods often fall short in capturing the nuanced landscape of TCR variability and its functional implications. This gap underscores the need for innovative analytical approaches that are not only efficient but also capable of extracting profound insights from TCRB repertoire data.

Enter LZGraphs, a cutting-edge Python library designed to revolutionize TCRB repertoire analysis. Rooted in the principles presented in our recent research, LZGraphs employs the Lempel-Ziv 76 algorithm to offer a novel and efficient method for encoding and analyzing TCR sequences. This tool stands out by providing a suite of metrics, such as K1000 Diversity, LZPgen, and LZCentrality, that illuminate the structural and functional dynamics of TCR repertoires.

In this post, we’ll embark on a hands-on exploration of LZGraphs, utilizing a publicly available dataset from Adaptive Biotech. Through practical examples, we’ll demonstrate how LZGraphs can be harnessed to uncover the rich diversity and intricate patterns within TCRB repertoires, offering valuable insights into immune system variability and its implications for health and disease.

Left — The Structure of Amino Acid Poistional LZ Substring, Right — The Basic Logic Behind the Creation of a “Naive” DNA LZGraph

Data

For this practical guide, we will utilize a dataset publicly available from Adaptive Biotech, chosen for its relevance and accessibility to showcase LZGraphs’ capabilities. It’s important to note that while this specific dataset serves as our example, LZGraphs is designed to work with a wide range of TCRB repertoire datasets, making it a versatile tool for various research contexts.

The preprocessing applied to the data for this guide is minimal, aimed at maintaining the integrity of the dataset while making it compatible with LZGraphs. We will transform the dataset into a list of lists, where each sublist contains the CDR3 amino acid sequences from a given repertoire. A separate list will compile metadata associated with each repertoire, including details like age and sex. This metadata will enrich our analysis, allowing us to explore correlations and insights that extend beyond the sequences themselves, thereby demonstrating the comprehensive analytical power of LZGraphs in practical scenarios.

The K1000 Index: Unveiling Deeper Repertoire Diversity

Traditional measures of TCRB repertoire diversity often focus on clonal abundance, missing the intricate diversity within unique TCR sequences. Our research introduces the K1000 Index, utilizing the LZGraph data structure to offer a nuanced view of repertoire diversity beyond mere clone counts.

The K1000 Index begins by sampling 1000 unique sequences from a repertoire, constructing an LZGraph for each sample, and repeating this process K (Defaults to 50) times. The index is the average node count across these LZGraphs, revealing the “information capacity” or the richness of sequence diversity unaffected by clonal abundance.

This measure shines in comparing repertoires, showcasing the informational diversity within sequences. Where traditional metrics might see similarity, the K1000 Index discerns the subtle differences in the depth of unique sequence information, making it a powerful tool for in-depth repertoire analysis.

The Basic Logic Behind the K1000 Index—Out of the entire set of sequences (a single repertoire), 1000 sequences are sampled K times, and a graph is created out of the sampled sequences. The averaged node count is the K1000 index.

Example

Once the data is read and the CDR3 amino acid sequences for each repertoire are saved in a " repertoires " list, executing the code snippet below will enable us to derive the K1000 index for every repertoire within our list.

from LZGraphs import K1000_Diversity,AAPLZGraph

k1000_values = []

for repertoire in repertoires:
k1000 = K1000_Diversity(repertoire,AAPLZGraph.encode_sequence,draws=50)
k1000_values.append(k1000)

This snippet systematically processes each TCRB sequence repertoire, applying the K1000_Diversity function to measure the diversity within the repertoire. The function uses the AAPLZGraph.encode_sequence method for encoding the sequences, and the diversity index is averaged over 50 random subsamples (draws) to ensure statistical robustness.

Scatter plot highlighting repertoires with K1000 values above the average plus one standard deviation in red, illustrating the diversity within the sample set independent of repertoire depth
Scatter plot highlighting repertoires with K1000 values above the average plus one standard deviation in red, illustrating the diversity within the sample set independent of repertoire depth

After calculating the K1000 Index across all repertoires, we can identify those with significantly higher K1000 values, indicative of greater diversity. This allows for targeted exploration in repertoire or sequence-specific analyses, employing the K1000 Index as a filter to pinpoint repertoires of particular interest. In the visual representation, repertoires marked in red are those with a K1000 value exceeding the average by more than one standard deviation, highlighting their comparative diversity within this sample set.

It’s crucial to note the independence of K1000 values from repertoire depth — one of its notable strengths. This attribute ensures that the variability of repertoires can be assessed without the need for extensive depth in sampling. Such a feature is invaluable for researchers aiming to understand the breadth of diversity within their samples without being constrained by the volume of data collected.

Exploring the Extremes: Understanding K1000 Index Boundaries

To truly grasp the capabilities and limits of the K1000 Index, our investigation extends beyond conventional repertoire analysis to include experiments with both highly diverse and minimally diverse synthetic amino acid sequences. By constructing repertoires from entirely random sequences, we simulate an upper limit of diversity, akin to an ecosystem where every sequence is distinct, showcasing the index’s ability to capture maximal diversity. Conversely, by generating repertoires with sequences that differ by no more than five mutations from a single base sequence, we establish a lower diversity boundary, reflecting scenarios of significant sequence homogeneity.

This comparative analysis serves not only to validate the K1000 Index’s sensitivity to variations in sequence diversity but also to illustrate its robustness in differentiating between repertoires of varying complexity. Such an approach enables us to delineate the index’s operational range, ensuring that users can confidently employ it to identify repertoires of high interest, regardless of whether they exhibit extreme diversity or notable similarity.

import random
def generate_random_sequence(length):
amino_acids = 'ACDEFGHIKLMNPQRSTVWY' # 20 standard amino acids
return ''.join(random.choice(amino_acids) for _ in range(length))

def generate_low_diversity_sequence(base_sequence, changes=5):
amino_acids = 'ACDEFGHIKLMNPQRSTVWY'
sequence = list(base_sequence)
for _ in range(changes):
pos = random.randint(0, len(sequence) - 1)
sequence[pos] = random.choice(amino_acids)
return ''.join(sequence)

random_sequences = [[generate_random_sequence(len(seq)) for seq in rep.to_list()] for rep in repertoires]
low_diversity_sequences = [[generate_low_diversity_sequence(repertoires[0][0]) for _ in range(len(R))] for R in repertoires]

original_k1000_values = [K1000_Diversity(rep, AAPLZGraph.encode_sequence, draws=50) for rep in repertoires]
random_k1000_values = [K1000_Diversity(rep, AAPLZGraph.encode_sequence, draws=50) for rep in random_sequences]
low_diversity_k1000_values = [K1000_Diversity(rep, AAPLZGraph.encode_sequence, draws=50) for rep in low_diversity_sequences]

This code snippet is designed to explore the diversity of TCRB sequences by generating random sequences and intentionally low-diversity sequences, and then calculating their diversity using the K1000 Diversity index. Here's a breakdown of the snippet:

  1. Random Sequence Generation:
  • generate_random_sequence(length): This function creates a random sequence of amino acids of a specified length. The sequence is constructed by randomly selecting amino acids from the 20 standard options and concatenating them.

2. Low Diversity Sequence Generation:

  • generate_low_diversity_sequence(base_sequence, changes=5): This function takes a base sequence and introduces a specified number of random mutations (default is 5) to decrease its diversity. It randomly selects positions in the sequence and changes them to a random amino acid from the standard 20.

3. Application to Repertoires:

  • random_sequences: For each repertoire, this generates a list of random sequences matching the lengths of the original sequences in the repertoire.
  • low_diversity_sequences: For each repertoire, this generates a list of sequences derived from the first sequence of the first repertoire but with low diversity (5 random changes each).

4. Diversity Analysis:

  • original_k1000_values: Calculates the K1000 Diversity index for each original repertoire using the K1000_Diversity function, which measures the diversity of sequences by considering how they can be compressed.
  • random_k1000_values: Calculates the K1000 Diversity index for each list of randomly generated sequences, allowing for a comparison of diversity levels between original and purely random sequences.
  • low_diversity_k1000_values: Calculates the K1000 Diversity index for each list of low-diversity sequences, enabling an analysis of how intentional reductions in sequence variability impact diversity metrics.
Visualizing the Spectrum: K1000 Index Values Across High-Diversity Random Sequences and Low-Diversity Controlled Sequences Compared to Original Repertoires

While the K1000 Index adeptly captures the overarching diversity within entire sets of sequences or repertoires, offering a broad view of the informational landscape, it sets the stage for a more granular exploration of sequence significance. Enter LZCentrality — a complementary measure that zooms in on the individual sequences to gauge their influence within the complex tapestry of TCRB repertoires. This transition from the macroscopic diversity portrayed by the K1000 Index to the microscopic significance unearthed by LZCentrality enables us to not only appreciate the breadth of diversity at a repertoire level but also to identify and understand the pivotal roles played by specific sequences within that diversity.

LZCentrality: Gauging Sequence Influence in TCRB Repertoires

In the complex terrain of TCRB repertoires, LZCentrality stands out as a measure to discern the influence of individual sequences. Rooted in the LZGraph’s architecture, where nodes reflect sequence interactions, LZCentrality offers a glimpse into a sequence’s centrality, akin to understanding the importance of a bustling intersection in a vast network of roads.

Through the lens of LZCentrality, sequences are akin to paths in a city, with their centrality determined by the traffic at each intersection (node). For example, a “common” sequence, resembling a major highway, shows dense node traffic due to its similarity to other sequences in the repertoire. This high node traffic, or out-degree, signals a sequence’s central role in the repertoire’s landscape.

LZCentrality, calculated as the average out-degree across a sequence’s nodes, reveals the sequence’s connectivity and position within the repertoire. It’s a measure that transcends mere frequency, illuminating the sequence’s role in the repertoire’s intricate web.

Example

To illustrate the process of calculating LZCentrality values for each sequence in a given repertoire and identifying the 10 most central sequences, we embark on a practical journey through the code and its visual representation. This example serves as a hands-on guide to uncovering the sequences that play pivotal roles within their respective repertoires, akin to finding the most influential individuals within a vast network.

from LZGraphs import LZCentrality, AAPLZGraph

lz_centrality_values = {}

repertoire = repertoires[0]

# Create an LZGraph
graph = AAPLZGraph(repertoire)

# Calculate LZCentrality for each sequence
for sequence in repertoires[0]:
centrality = LZCentrality(graph,sequence)
lz_centrality_values[sequence] = centrality

# Sort sequences by their LZCentrality values in descending order
sorted_sequences = sorted(lz_centrality_values, key=lz_centrality_values.get, reverse=True)

# Select the top 10 most central sequences
top_10_sequences = sorted_sequences[:10]

Building on our understanding of LZCentrality within individual repertoires, we now embark on a more expansive exploration. We’ll take the top 10 sequences, as identified by their centrality from the first repertoire, and investigate how these sequences manifest across the entire dataset. This involves calculating the centrality of these top 10 sequences within each repertoire in our collection, providing a unique lens through which to view their ubiquity and influence across different repertoires.

In parallel, to introduce a comparative perspective, we’ll also select 10 random sequences from the first repertoire and track their centrality across the same range of repertoires. This dual approach will not only highlight the pervasiveness of highly central sequences but also offer a baseline comparison to gauge the significance of centrality within this complex sequence landscape.

Comparative Distribution of LZCentrality Across Repertoires: Delineating the Influence of Top 10 Central Sequences Versus Randomly Selected Sequences

Conclusions

In our exploration, we’ve taken a foundational dive into the K1000 and LZCentrality indexes, unraveling their potential through basic analytical approaches. These measures, however, extend far beyond the preliminary analyses showcased here. They hold the power to significantly advance our understanding and identification of influential sequences across a myriad of contexts — be it contrasting pathologies, different species, or other distinctive groupings. By leveraging these indexes, we stand on the cusp of enhancing our capability to detect and highlight significant repertoires and sequences that are critical to our understanding of immune diversity and function.

As we move forward, our forthcoming posts will delve deeper into the application of the K1000 and LZCentrality indexes. We aim to shed light on the subtle yet profound differences between various groups, employing these sophisticated tools to dissect and discern the nuances of immune repertoire diversity. Stay tuned as we continue to navigate the intricate landscape of immune sequencing, unveiling insights that could redefine our approach to immunological research and its applications in health and disease.

References

A Novel Approach to T-Cell Receptor Beta Chain (TCRB) Repertoire Encoding Using Lossless String Compression, T Konstantinovsky, G Yaari 2023.

--

--

Thomas Konstantinovsky
Computational Biology Insights

A Data Scientist fascinated with uncovering the mysterious of the world