Market Sizing the Immune System

Dhuvi Karthikeyan
16 min readJul 2, 2022

For my first content post, I wanted to pay homage to the question that inspired my fascination with the immune system that would eventually take root and drive me to pursue a career integrating the complexities of the immune system on a systems level. In this post, we’ll take a first-pass look at one of the most fundamental phenomena underpinning the adaptive immune response that deals with the balance of specificity and degeneracy of the Adaptive Immune Receptor Repertoire (AIRR). Specifically, we will be using statistics and probability to determine what factors influence the T-cell Repertoire size to be in the order of magnitude ~10⁶/10⁷ of all numbers in the range [0, ∞). This style of mathematical biology to market size the immune system came about in the early 90’s with a hallmark paper in 1993 titled, “How diverse should the immune system be?” by Boer and Perelson. They used a probabilistic style of argument to posit that the diversity of the immune system is not driven by how many potential pathogens the organism needs to detect, but rather how many self antigens the host needs to avoid. In this investigation, we’ll work through the Boer and Perelson paper to introduce the TCR repertoire diversity problem using their methods, with some commentary about what was known then and what is known now.

Edit: While I had intended for this to be a standalone post, it appears to have grown longer than I anticipated and will likely be an introduction to a rich topic with follow-ups including my own novel work in this matter down the line.

1. Background and Motivations

The term T-cell repertoire refers to the collective set of unique T-cell receptors within an individual. T-cells, one of the most important immune cells in our body, have many functions but are most well-known for their targeted killing of infected cells and tumor cells. They are able to selectively identify the self vs. non-self using their T-cell receptor (TCR), which scans molecules presented by antigen presenting cells and epithelial cells on a Major Histocompatibility Complex (MHC) molecule, and are activated upon finding a cognate antigen, as shown below. The specific peptide that is presented after the antigen is processed is called an epitope. The image shows two major types of T-cells: CD8 T-cells (cytotoxic T-cells) and CD4 T-cells (helper T-cells). Although they have different functions, the methods of activation are the same. You can find a pretty comprehensive review on T-cells here. Canonically speaking, each T-cell is covered by around 30,000 identical T-cell receptors, which are able to recognize a specific molecular signature. The image below shows one such TCR per cell for simplicity.

Image from NLRC5: a key regulator of MHC class I-dependent immune responses.

What we’ll be exploring in this post is the underlying question of why we all have an order of magnitude ~10⁶/10⁷ unique TCRs at a given point. To some, this might seem like a large number, or a small number. As such, it’s important to contextualize specific numbers with respect to scales that T-cells operate in. Let us assume a perfect immune system, defined here to mean that there exists a unique TCR for every possible target epitope. The theoretical upper bound for the epitope space for CD8 T-cells is defined by the size of peptides that bind to the MHC class I molecule (MHC-I). The MHC-I classically presents peptides of around 9 amino acids in length but can range from 8 to 11 and on occasion even higher (if you’re curious about the distribution of peptide lengths, see this). Combinatorially if we enumerated each of the 20 conventional amino acids at each position on the presented peptide, we would get a maximum estimate of around 20¹¹ (≈ 10¹⁴) possible epitopes that a perfect immune system would encounter; 10⁶ is many orders of magnitude smaller with 10¹⁴-10⁶ ≈ 10¹⁴. If we assume that the 10⁶ TCRs can protect against all 10¹⁴ epitopes, does it mean that every TCR is able to bind ~10⁸ (10¹⁴/10⁶) epitopes? That doesn’t seem all that specific and the answer is in fact a little bit more nuanced than that. This is because a perfect immune system need not protect against every epitope, but rather every pathogen. This is because evolutionary pressure and natural selection occur only when a pathogen is able to slip past the immune system and kill a host before it reaches reproductive maturity. The number of human pathogens was estimated in 2006 to be just over 1,400, which is many orders of magnitude less than the TCR repertoire size. Somewhere between the number of human pathogens and the potential theoretical max of epitopes lies a minimization problem solved by evolution, giving us ~10⁶/10⁷ TCRs in our repertoires.

[Note: For those who don’t enjoy math, feel free to skip to section 3 for a visual explanation of the following concepts.]

2. How diverse should the immune system be?

Rob J. De Boer and Alan S. Perelson, 1993

2.1 Background

In the paper, the authors set the scene by introducing the adaptive immune paradigm. That is, clones of the adaptive immune system (T cells or B cells) each have a unique antigen receptor and if two lymphocytes within the same individual have the same antigen receptor, they must be clones that share a common ancestor. They move on to question why the adaptive immune repertoire is so large since by the process of MHC restriction, only the subset of epitope space that can be loaded on a specific MHC allele product can be presented for recognition. Moreover, the epitope space is further reduced by self-tolerance, or the need for the immune system to avoid the body’s own proteins and their molecular motifs. Though in 1993, scientists didn’t have a good understanding of the mechanisms of immune tolerance (central vs. peripheral), Perelson and Boer recognized that there were multiple mechanisms and processes of self tolerance they would have to consider. Their model simplifies the process of tolerance by assuming random receptor generation, which on occasion gives rise to auto-reactive clones that are functionally deleted from the repertoire, a process we now understand as thymic selection. We now know more about the specifics of this processes mediated by specialized cell types like mTECs, cTECs, and Tregs.

2.2 Problem Formulation and Model Assumptions/Simplifications

The authors claim that a generic pathogen is recognizable by a specific, number of antigens: a. This simplification/approximation is shown by the authors, not affect the downstream results. Either way, most pathogens have only a handful of antigens. This is because antigens need to be relatively large (with the smallest ones being 8,000–10,000 Daltons) and are made up of proteins and other macromolecules. They must also be structured, non-repeating, and dissimilar from self. While often used interchangeably in a casual setting (which I may have done somewhere in this post), antigens are simply substances that generate an immune response against them and epitopes refer specifically to the molecular amino acid sequences that are recognized by adaptive immune receptors. The relation goes: a pathogen can have multiple antigens, and each antigen can have many epitopes. A coarse upper bound for the number epitopes for a given pathogen can be calculated by enumerating the contiguous k-mers in its exome (counting the number of different amino acid “words” of size k by sliding a window sized k down the exome). For a pathogen with 1,000 amino acids and for a k of 9, we can calculate there are 992 (N-k+1 or 1,000–9+1) k-mers. This upper bound is a relatively large overestimate for a couple of reasons. The first is tolerance, whereby k-mers that are found in self or approximate self k-mers are excluded. The other is that biases in antigen processing and peptide presentation whereby certain cut site motifs are preferentially digested and specific molecular patterns are needed for docking on MHC molecules. Here is a great resource on antigen processing and presentation that goes into the mechanisms of how such biases may arise. The authors then put forth the notion of a probability of escape for a pathogen such that a pathogen with a antigens has an associated probability of not being detected:

Note that the a is actually an exponent and not just a superscript. The authors are making the simplification that each antigen is equally likely to avoid detection, as given by the factorization of the total probability of escape into the probability of escape of a generic antigen raised to the a-th power.

This is, unsurprisingly, not true in-vivo. As previously mentioned, due to biases in MHC processing and presentation, there is an increased propensity for a certain subset of a pathogen’s antigens to be presented and recognized. Therefore, if we condition the probability of escape on the event of the antigen being presented in an individual, ℙ(escape|presented), then this new probability is approximately the same for all presented antigens. This is because the new value is the probability that a randomly generated TCR does not bind to a specific epitope which is approximately equal for all presented epitopes. They then make the argument that evolution aids in minimizing this probability such that the host organism is reliably able to reach reproductive maturity. As such, it should follow that this minimizing pressure is expected to stop at a value of the escape probability where the organism is expected to reach adulthood with almost certainty. The authors note here that their results are not sensitive to their assumptions of a, as their model relies only on PE .

2.3 Model

Table of terms used.

The authors then flip the idea of escape probability to define PR (probability of recognition) between a given lymphocyte clone’s receptor and a random epitope. This functions as a proxy for cross-reactivity, or the TCR’s ability to bind to multiple to multiple epitopes. Given the specificity of adaptive immune receptors, we expect this probability to be quite low, with estimates at the time around ~0.00001. Even today the order of magnitude holds. They define the repertoire size before and after functional deletion of self-reactive receptors as R0 and R, respectively. They also denote ε as the number of epitopes per antigen and n as the number of self antigens, so nε is the number of self epitopes. While the paper goes into both T and B-cell repertoires, our focus here is on the T-cell compartment. As the first step, the authors determine the difference between the pre-tolerance repertoire size and the functional repertoire size after auto-reactive clones have been removed. Let f be the fraction of receptors that are not functionally deleted from the repertoire. f must be equal to the probability that a receptor is not recognized by any one of the nε self peptides.

Derivation of fraction surviving functional deletion due to self-reactivity.

The authors calculate the probability of a generic receptor escaping nε self peptides here and the argument is that this probability is equal to the fraction of receptor space that would survive negative selection. The authors use the linearization approximation of ln(1+x) ≈ x to approximate ln(1−PR) ≈ −PR. This approximation is used frequently in later derivations so take some time to convince yourself of this now.

After modeling the result of functional deletion of the self reactive cells, the authors look to the evolutionary pressure exerted by pathogen escape. In the simplified case that they use to describe B-cell recognition, a generic pathogen needs to ensure that all εa epitopes are not recognized by any of the R receptors. This can be used to derive the functional receptor size R that maintains an acceptable probability of escape, as shown below for the simplified version of this problem which fits B-Cell biology:

Derivation of repertoire size (B-Cell case).

However, in the T-cell case, there is increased complexity in how antigen recognition is mediated through the presentation of the epitope of MHC molecules. To account for this, they define ε = ρPpm, where ρ is the number of peptides on average that a typical antigen can be processed or digested into; Pp is the probability of presentation of a random digested peptide by and MHC molecule; and m is the number of MHC molecule types within a certain class expressed by an individual. Pp is a function of the actual physiochemical properties of the peptide as well as the class and allele of the MHC molecule it is docking on but is kept constant as an averaged number in the model for simplicity. Furthermore, we also know that the number of different peptides that can be generated from an antigen is a function of the amino acid sequence with the localization of specific amino acid motifs as cut sites. The escape probability from the T-cell repertoire therefore can occur via two potential mechanisms in their model: failure of presentation by an MHC and failure of recognition upon presentation.

We can iteratively derive the above probability of a pathogen escaping the immune repertoire by first looking at the simpler probability of a single epitope with a single MHC. We can then collect over all the different peptides (epitopes) derived over an antigen, over the a antigens in a pathogen, and over the m MHCs:

Derivation of T-Cell Probability of Escape

However, since the number of antigens a is found on both sides of the
equation we can simplify terms by taking the a-th root of both sides of the equation.

We can then isolate the functional repertoire size R and use that to solve for R0 or the pre tolerance repertoire size:

Derivation of receptor repertoire size (pre and post tolerance).

The authors note a key restriction of the above equation: it is only defined if the operand of the first natural log, 1 + ln(PE)/ε is positive. This means that ln(PE)/ε must be order of magnitude 1 since the log of a fraction, or in this case probability, is negative and has a range of (−∞, 0]. Is this a purely mathematical constraint or does it have a biological significance? There are a number of implications that come with this and I encourage the reader to see if they can identify a couple.

2.4 Minimal Repertoire

The idea of a minimal repertoire is guided by an evolutionary way of thinking, which asks the following question: “What is the smallest (and therefore least resource consuming) method in which the immune system can maintain a predictably low probability of pathogen escape?” We can answer that by differentiating the expression for repertoire size to find any local extrema. R0 is defined in terms of four parameters: PR, PE, ε, and n. ε is equal to 𝝆Ppm, so we can say that there are 6 parameters in total. n, 𝝆, and m, are variables that can be estimated closely with in vitro methods whereas the probabilities PE, PR, and Pp represent more complicated phenomena whose values are coarser estimates. While the paper mentions the derivative with respect to PR we will take the derivative of R0 with respect to these parameters to see what we can learn about the minimal repertoire size through various contexts.

Solving for optimal PR.

This in and of itself is a rather neat finding: that the minimal repertoire size occurs at probability of recognition of 1/#self_epitopes (more on this later). The next two derivative calculations for PE and Pp are not found in the paper. They were performed to further illustrate a point the paper makes downstream.

Solving for optimal PE.
Solving for optimal Pp.

Plugging in for the solved value of the PR that minimizes R0, we get:

Derivation of minimal repertoire size wrt to PR.

The paper, derives the above equation, describing the minimal repertoire size for a given PE and ε, scales with n. We can, however, go one step further to convince ourselves of this by substituting our optimized value of Pp in for 𝝆Ppm wherever we see ε:

Further simplification using optimized Pp.

Here we show that after substituting in the optimized Pp, we eliminate the epsilon (ε) term from the expression of the minimal repertoire, further underscoring the author’s key takeaway: the minimal receptor repertoire size for a given probability of escape is a function of n, the number of self antigens.

3. Major Findings

The minimal receptor repertoire size for a given probability of escape is a function of n, the number of self antigens.

It might not be immediately obvious why this was such a hallmark finding. The notion that our receptor repertoire size has more to do with the self than it does the number of foreign objects we must defend against is rather counterintuitive. To understand the richness of such a concept, it can be especially helpful to think of it in the context of evolution. If we could put ourselves in the shoes of Evolution, the perfect iterative design system, how would one go about designing an immune system? How would one create a system capable of identifying threats without knowing them a priori? A reasonable method might be random generation of different receptors. Through randomness one can ensure roughly equal probability of protecting against and unseen threat. Since evolution must also prioritize frugal resource consumption, the goal would essentially be to cover the most amount of pathogen space with the least number of receptors.

As a more concrete way of envisioning this, imagine a game where a blindfolded individual must throw special darts at a board (they have a ring around them such that each covers a circular area of the board when thrown), while avoiding several marked X’s. If an ‘X’ is hit, the game doesn’t end but the dart is removed and added to the total number of darts thrown. The object of the game is to cover a certain percentage of the board with the least amount of throws. The board here represents epitope space when projected onto two dimensions. Each dart thrown is a randomly generated receptor and the area of the wall that the dart hits is the cross reactive area of that receptor (the set of different epitopes with a specific motif a receptor is capable of binding). The X’s symbolize the self antigens that are functionally deleted, the given percent of the board that must be covered is related to PE, and the number of darts thrown is the receptor size that needs to be minimized. Below is a schematic picture of the epitope shape space (dart board) with the immune receptors (darts) and their areas of cross reactivity.

Image taken from Arxiv

From the above it should be clear that there are two factors that govern the coverage of the epitope wall: the number of darts thrown at it (pre tolerance repertoire size), and the area covered by each dart (cross reactivity). Our claim here is that the number of darts thrown at the wall somehow has to do with the number of X’s. To see why, let us consider three cases:

  1. Optimized Repertoire Size: In this case, we minimize the repertoire size as much as possible by increasing the cross reactivity of each receptor such that the smallest number of them are needed to cover the shape space. However, because the receptors are so non-specific, and their cross reactive areas are so high, the majority of them will be self reactive and functionally deleted. This will mean that a large portion of the epitope space will be left untouched.
  2. Optimized Specificity: Here, we have highly specific receptors with low cross reactivity. The smaller areas mean that it takes many more receptors to fill up the epitope space and extremely resource intensive and is thus evolutionarily unfavorable in the long term.
  3. Both Optimized: In the final case, we have a cross reactivity that is in-between the previous cases. Here we have specific receptors that have a moderate cross reactivity where the number of receptors is far less than in case two while still covering the shape space with enough density that the host is able to reach maturity.

With these special cases, we can see how the cross reactivity area and the distribution of the self antigens results in varying levels of coverage of the epitope shape space. So, if we wanted to instead vary the number of self antigens and hold the level of cross-reactivity and %covered of the epitope space at a fixed value, what would need to occur?

It would, in fact, take a lot of very similar receptors with a lot of epitope space overlap to cover up the board to the same degree as that a smaller number of receptors would be able to given fewer self antigens. Which is just another way of arriving at

or the size of our immune repertoires is determined not by the magnitude of potential threats we may encounter in our lifespan, but rather the scale of the self peptides we must avoid.

4. Concluding Remarks

Although this paper was published in 1993, over a decade before the advent of RNA sequencing, let alone single-cell technology, Boer and Perelson eloquently capture the dependence and contribution of the various moving parts of maintaining a specific repertoire size. They used empirical values generated by methods available then to show that for reasonable parameters their minimal repertoire sizes fell within the expected range. A number of papers in the late 90’s and early 2000’s have corroborated their estimates using different setups or even in vitro methods to make better informed decisions. However, these different approaches all somehow arrive at the same conclusion: while it would theoretically only take a few receptors to protect the immune system from a variety of targets, the diversity of the repertoire is driven by the distribution of self antigens. While their model makes reductionist assumptions about the given system, I think there is something to be said about their understanding of the underlying mechanisms that gave rise to their model as evidenced by its veracity when held up to the test of time.

--

--

Dhuvi Karthikeyan

Microbio x Data Sci (Comp Bio) @UCBerkeley | Biophys and Comp Bio @UNC_BCBP 👨‍🏫👨‍🔬