Figures and data in Combining mutation and recombination statistics to infer clonal families in antibody repertoires

Figures
Tables
Additional files

9 figures, 1 table and 1 additional file

Figures

Figure 1

Download asset Open asset

Clonal families and $V J l$ classes.

(A) Variable region of the immunoglobulin heavy chain (IgH)-coding gene. (B) A clonal family is a lineage of related B cells stemming from the same VDJ recombination event. The partition of the B-cell receptor (BCR) repertoire into clonal families is a refinement of the partition into $V J l$ classes, defined by sequences with the same V and J usage and the same complementarity determining region 3 (CDR3) length $l$ . (**C–D**) Properties of $V J l$ classes in donor 326651 from Briney et al., 2019. (C) Distribution of $V J l$ class sizes exhibits power-law scaling. The total number of pairwise comparisons in the largest $V J l$ classes is $\sim {10^{5}}^{2} = 10^{10}$ . (D) Distribution of the CDR3 length $l$ . The distribution is in yellow for in-frame CDR3 sequences ( $l$ multiple of 3), and in gray for out-of-frame sequences.

Figure 2 with 7 supplements

Download asset Open asset

Complementarity determining region 3 (CDR3)-based inference method (HILARy-CDR3).

(A) Example distribution of normalized Hamming distances, $x = n / l$ , for one $V J l$ class with CDR3 length $l = 21$ , V gene IGHV3-9 and J gene IGHJ4 (black). We fit the distribution by a mixture of positive pairs (belonging to the same family, in blue) and negative pairs (belonging to different families, in red). See Figure 2—figure supplement 5 for example fit results across different CDR3 lengths. Inset: the prevalence is defined as a fraction of positive pairs and was estimated to $\hat{ρ} = 3.1 %$ . Data from donor 326651 of Briney et al., 2019. (B) Distribution of the maximum likelihood estimates of prevalence $\hat{ρ}$ across $V J l$ classes in donor 326651. (**C–F**) The choice of threshold $t$ on the normalized Hamming distance $x$ translates to the following a priori characteristics of inference (illustrated here for arbitrarily chosen $ρ$ and $μ$ ). (C) Fallout rate $\hat{p} (t) = \hat{F P} / (\hat{F P} + \hat{T N})$ . The null distribution of all negatives (N=FP + TN) is estimated using the soNNia sequence generation software. (D) Sensitivity $\hat{s} (t) = \hat{T P} / (\hat{T P} + \hat{F N})$ . (**E–F**) Precision $\hat{π} = \hat{T P} / (\hat{T P} + \hat{F P})$ . For the same choice of threshold $t$ , a low prevalence of $\hat{ρ} = 10^{- 3}$ (E) leads to lower precision than high prevalence of $\hat{ρ} = 10^{- 1}$ (F). (G) Model distribution $P_{T} (x | μ)$ of distances between unrelated sequences, for $l = 15, 30, 45, 60$ , computed by the soNNia software. (H) Precision $\hat{π}$ , computed a priori (i.e. before doing the inference) from the model with $\hat{μ} = 0.04$ , $\hat{ρ} = 0.1$ , and $l = 15, . . ., 81$ (colors as in G), as a function of the threshold $t$ . For each $V J l$ class and its own inferred $\hat{ρ}$ and $\hat{μ}$ , the threshold $t$ is chosen to achieve a desired $π^{*}$ . (I) High-precision threshold $t^{*}$ ensuring $\hat{π} (t^{*}) = π^{*} = 99 %$ a priori, as a function of CDR3 length $l$ for different values of the prevalence $\hat{ρ}$ , and $\hat{μ} = 0.04$ , as predicted by the model. (J) Sensitivity $\hat{s} (t^{*})$ at the high-precision threshold $t^{*}$ , as a function of CDR3 length $l$ for different values of the prevalence $\hat{ρ}$ (colors as in I). Solid lines denote a priori prediction for intermediate mean distance $μ = 4 %$ , dashed lines denote actual performance of HILARy-CDR3 in a synthetic dataset.

Figure 2—figure supplement 1

Download asset Open asset

Mean intra-family distances.

Distribution of the maximum likelihood estimates of mean intra-family distance $μ$ across $V J l$ classes.

Figure 2—figure supplement 2

Download asset Open asset

Null distribution $P_{N} (x | l)$ of CDR3 distances between unrelated sequences for $l \in [15, 81]$ , computed by soNNia software.

White line denotes a growing threshold ensuring a fallout rate $p < 10^{- 4}$ as determined by this distribution.

Figure 2—figure supplement 3

Download asset Open asset

Distribution of normalized Hamming distances.

$x = n / l$ , for largest $V J l$ class for each CDR3 length $l = 81$ (black). We fit the distribution by a mixture of positive pairs ( $P_{T} (x | μ)$ in blue), and negative pairs ( $P N (x)$ , in red). For $l = 18$ the estimate $\hat{μ}$ is too large results in large fitting error and for sensitivity computation we used global $μ = 4 %$ (in green).

Figure 2—figure supplement 4

Download asset Open asset

Distribution of post-selection probabilities.

P_post of CDR3 nucleotide sequences computed using soNNia across CDR3 lengths. Short junctions are on average more likely to be generated in VDJ recombination and pass subsequent selection (Isacchini et al., 2021). This makes inference in low-l classes more difficult, a feature reflected by synthetic dataset constructed by sampling unmutated lineage progenitors from the soNNia model.

Figure 2—figure supplement 5

Download asset Open asset

Site frequency spectra estimated for families identifed using high-precision CDR3-based inference method (HILARy-CDR3) in the subset of the data where this approach is highly reliable (large- $l$ and large- $\hat{ρ}$ regime).

The distributions are shown for families of varying family size, $z \in [10, 100]$ and averaged over all families of a given size. Together with the exact configuration of sequences carrying a given substitution, synthetic datasets of the sames ignatures of mutations and clonal expansions can be generated.

Figure 2—figure supplement 6

Download asset Open asset

Distribution of normalized Hamming distances $x = n / l$ , for $l$ classes, averaging over all $V J l$ classes.

We fit the distribution by a mixture of positive pairs using a geometric distribution ( $P_{T} (x | μ)$ in blue), and negative pairs ( $P_{N} (x)$ , in red). The corresponding prevalence estimates $\hat{p}$ are used for small $V J l$ classes for which this parameter cannot be reliably estimated independently.

Figure 2—figure supplement 7

Download asset Open asset

Prevalence and $V J l$ class size.

Dependence of prevalence estimates on $V J l$ class size N for largest classes in donor 326651 from Briney et al., 2019. 28% of variation in prevalence estimates can be explained by variation in $V J l$ class sizes.

Figure 3 with 2 supplements

Download asset Open asset

Full inference method with mutational information.

(A) For a pair of sequences, $n_{1}, n_{2}$ denote the numbers of mutations along the templated region (V and J), and $n_{0}$ is the number of shared mutations. For related sequences, $n_{0}$ corresponds to mutations on the initial branch of the tree, and is expected to be larger than for unrelated sequences, where $n_{0}$ corresponds to coincidental mutations. (B) Positive and negative pairs are called mutated if both sequences have mutations $n_{1}, n_{2} > 0$ . Among positive pairs in the synthetic datasets, more than 99% are mutated. (**C, D**) Distributions of the rescaled variables $x^{'}$ and $y$ (Equation 4), for pairs of synthetic sequences belonging to the same lineage (positive pairs) and sequences belonging to different lineages (negative pairs). The separatrix $x^{'} - y = t^{'}$ marks a high-precision (99%) threshold choice. (E) To limit the number of pairwise comparisons we make use of high-precision and high-sensitivity complementarity determining region 3 (CDR3)-based partitions. High precision corresponds to the choice $t = t^{*}$ . High sensitivity corresponds to a coarser partition where $t$ is set to achieve 90% sensitivity. When the two partitions disagree, mutational information can be used to break the coarse, high-sensitivity partition into smaller clonal families. (**F, G**) Mutations-based methods achieve high sensitivity across all CDR3 lengths $l$ in the synthetic dataset (G), extending the range of applicability with respect to the CDR3-based method (F).

Figure 3—figure supplement 1

Download asset Open asset

Merging partitions.

Red circles represent clusters from the coarse (high-sensitivity) partition, while green clusters represent the fine (high-precision) partition. When the two partitions differ, HILARy-full merges precise clusters inside each sensitive cluster whenever there exists of pair of positive sequences linking them.

Figure 3—figure supplement 2

Download asset Open asset

Error vs $V J l$ class size.

We plot the fitting error of *P(x)* by the mixture model, for each $V J l$ class in the synthetic dataset, as a function of their sizes. The error is computed as the squared difference between the model and data distributions of distances.

Figure 4 with 2 supplements

Download asset Open asset

Benchmark of the alternative methods on synthetic heavy-chain repertoires.

(A) Comparison of inference time using subsamples from the largest $V J l$ class found in donor 326651 from Briney et al., 2019. Comparisons were done on a computer with 14 double-threaded 2.60 GHz CPUs (28 threads in total) and 62.7 Gb of RAM. (B) Clustering precision $π_{post}$ (post single-linkage clustering of positive pairs), (C) sensitivity $s_{post}$ , and (D) variation of information $v$ as a function of complementarity determining region 3 (CDR3) length $l$ in the realistic synthetic dataset generated for this study. Solid lines represent the mean value averaged over five synthetic datasets. (**E–G**) Same as (**B–D**) but for the synthetic dataset from Ralph and Matsen, 2022, designed for the development and testing of the partis software. The solid lines represent the mean over the three datasets.

Figure 4—figure supplement 1

Download asset Open asset

Performance of spectral SCOPer using V gene mutations.

(A) Comparison of inference time using subsamples from the largest $V J l$ class found in donor 326651 from Briney et al., 2019. Comparisons were done on a computer with 14 double-threaded 2.60GHz CPUs (28 threads in total) and 62.7Gb of RAM. (B) Clustering precision $π_{post}$ (post single linkage clustering of positive pairs), (C) Sensitivity $s_{post}$ , and (D) variation of information υ vs CDR3 length $l$ in the realistic synthetic dataset generated for this study. Solid lines represent the mean value averaged over 5 synthetic datasets.

Figure 4—figure supplement 2

Download asset Open asset

Performance of single-linkage clustering with fixed threshold.

We call this method VJCDR3-sim, where sim is the threshold on the normalized similarity between two CDR3s, equal to $1 - x$ , where $x$ is our normalized Hamming distance. (A) Clustering precision $π_{post}$ (post single linkage clustering of positive pairs), (B) sensitivity $s_{post}$ , and (C) variation of information υ as a function of CDR3 length $l$ in the realistic synthetic dataset generated for this study. Solid lines represent the mean value averaged over 5 synthetic datasets. (**D-F**): Same than (**A-C**) using the synthetic data from Ralph and Matsen, 2022 and across mutation rates.

Figure 5

Download asset Open asset

Performance of HILARy as a function of mutation rate for heavy chains, on synthetic data from Ralph and Matsen, 2022, designed for the development and testing of the partis software.

(A) Clustering precision $π_{post}$ (post single-linkage clustering of positive pairs), (B) sensitivity $s_{post}$ , and (C) variation of information $v$ as a function of mutation rate, using the heavy chain only. Solid lines represent the mean value averaged over the three datasets.

Figure 6

Download asset Open asset

Benchmark of HILARy with paired light and heavy chains.

(A) Clustering precision $π_{post}$ (post single-linkage clustering of positive pairs), (B) sensitivity $s_{post}$ , and (C) variation of information $v$ as a function of complementarity determining region 3 (CDR3) length $l$ , on the synthetic datasets from Ralph and Matsen, 2022, designed for the development and testing of the partis software. (**D–F**): Same as (**A–C**) but as a function of mutation rate.

Figure 7

Download asset Open asset

Inference results across complementarity determining region 3 (CDR3) lengths.

Inference results for donor 326651 of Briney et al., 2019, are presented for nine quantiles of the CDR3 distribution, each containing between 8% and 12% of the total number of sequences (corresponding to nine colors in the inset of A). (A) Distributions of family size $z$ . All CDR3 length quantiles exhibit universal power-law scaling with exponent −2.3. (B) Site frequency spectra estimated for families of sizes $z = 100$ . Families of larger sizes were subsampled to $z = 100$ to subtract the influence of varying family sizes. (C) Distribution of lineage $d_{N} / d_{S}$ ratios computed for polymorphisms in CDR3 regions over all lineages within each nine quantile.

Author response image 1

Download asset Open asset

Author response image 2

Download asset Open asset

Tables

Table 1

Summary of notations used throughout the paper.

Hats ˆ denote estimates from the fit of the mixture model. Stars ∗ denote estimates after imposing 99% precision. The ‘post’ subscript denotes quantities after applying single-linkage clustering to obtain a partition from positive pairs.

	Definition
$ρ$	Prevalence/fraction of positive pairs
$π$	Precision = TP/(TP+FP)
$s$	Sensitivity = TP/(TP+FN)
$p$	Fallout = FP/(FP+TN)
$t$	Threshold on CDR3 distance
$l$	CDR3 length
$n$	CDR3 Hamming distance of a pair
$x$	Normalized CDR3 Hamming distance $= l / n$
$x^{'}$	CDR3 Hamming distance, centered and scaled
$y^{'}$	Shared mutations on V segment, centered and scaled
$μ$	Mean $x$ between positive pairs
$P_{T}$	Model distribution for positive pairs
$P_{F}$	Model distribution for negative pairs

Additional files

MDAR checklist: https://cdn.elifesciences.org/articles/86181/elife-86181-mdarchecklist1-v2.pdf
Download elife-86181-mdarchecklist1-v2.pdf

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Natanael Spisak
Gabriel Athènes
Thomas Dupic
Thierry Mora
Aleksandra M Walczak

(2024)

Combining mutation and recombination statistics to infer clonal families in antibody repertoires

eLife 13:e86181.

https://doi.org/10.7554/eLife.86181

Share this article

Cite this article

Clonal families and VJl classes.

Complementarity determining region 3 (CDR3)-based inference method (HILARy-CDR3).

Mean intra-family distances.

Null distribution PN(x|l) of CDR3 distances between unrelated sequences for l∈[15,81], computed by soNNia software.

Distribution of normalized Hamming distances.

Distribution of post-selection probabilities.

Site frequency spectra estimated for families identifed using high-precision CDR3-based inference method (HILARy-CDR3) in the subset of the data where this approach is highly reliable (large-﻿l and large-ρ^ regime).

Distribution of normalized Hamming distances x=n/l, for l classes, averaging over all VJl classes.

Prevalence and VJl class size.

Full inference method with mutational information.

Merging partitions.

Error vs VJl class size.

Benchmark of the alternative methods on synthetic heavy-chain repertoires.

Performance of spectral SCOPer using V gene mutations.

Performance of single-linkage clustering with fixed threshold.

Performance of HILARy as a function of mutation rate for heavy chains, on synthetic data from Ralph and Matsen, 2022, designed for the development and testing of the partis software.

Benchmark of HILARy with paired light and heavy chains.

Inference results across complementarity determining region 3 (CDR3) lengths.

Summary of notations used throughout the paper.

MDAR checklist

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Clonal families and $V J l$ classes.

Null distribution $P_{N} (x | l)$ of CDR3 distances between unrelated sequences for $l \in [15, 81]$ , computed by soNNia software.

Site frequency spectra estimated for families identifed using high-precision CDR3-based inference method (HILARy-CDR3) in the subset of the data where this approach is highly reliable (large- $l$ and large- $\hat{ρ}$ regime).

Distribution of normalized Hamming distances $x = n / l$ , for $l$ classes, averaging over all $V J l$ classes.

Prevalence and $V J l$ class size.

Error vs $V J l$ class size.