Large-scale network analysis reveals the sequence space architecture of antibody repertoires | Nature Communications

Mouse dataset

The dataset analyzed was produced as described by Greiff et al.21. Briefly, murine B-cell populations of pre-B cells (pBC, IgM, bone marrow, ≈7.5 × 105 cells/mouse, c-kit–CD19+IgM–CD25+PI–), naïve follicular B cells (nBC, IgM, spleen, ≈1 × 106 cells/mouse, CD138–CD19+IgD2+IgM+CD232+CD21+PI–), and memory plasma cells (PC, IgG, bone marrow, ≈9 × 104 cells/mouse, CD138+CD22–MHCII–CD19–IgM–PI–) were sorted using fluorescence-activated cell sorting (FACS) from C57BL/6 J mice unimmunized (n = 5) or prime-boost immunized with alum-precipitated antigens: nitrophenylacetyl-conjugated hen egg lysozyme (NP-HEL, n = 5), ovalbumin (OVA, n = 5) or Hepatitis B virus surface antigen (HBsAg, n = 4). Following total RNA extraction, full-length antibody variable heavy chain (VDJ) libraries were generated by a two-step PCR process, as described previously54. Libraries were sequenced using the Illumina MiSeq (2 × 300 bp) platform. Mean Phred-scores of raw data were ≥30. Approximate paired-end reads (full-length VDJ) were: pBC ≈ 5 × 106 reads (untreated n = 1,666,407, NP-HEL n = 2,306,769, OVA n = 2,337,876 and HBsAg n = 2 330 505 sequencing reads), nBC ≈ 10 × 106 reads (untreated n = 6 487 616, NP-HEL n = 4 157 887, OVA n = 4 245 486 and HBsAg n = 6,076,876 sequencing reads) and PC ≈ 4 × 106 reads (untreated n = 188 440, NP-HEL n = 125 118, OVA n = 194,003, and HBsAg n = 121,382 sequencing reads)21. The experimental design of the study minimized technological (sequence undersampling) and biological undersampling (cell undersampling) as explained in depth in a previous publication21.

Data preprocessing and CDR3 clonal analysis

Antibody read sequences have been preprocessed and VDJ annotated with MiXCR55 and further filtered to retain only those sequences that had CDR3 length ≥4 a.a. and occurred more than once in each CDR3 repertoire data set (Supplementary Fig. 1a). Clones were defined by 100% a.a. sequence identity of CDR3 regions. CDR3 regions were defined by MiXCR according to the nomenclature of the Immunogenetics Database (IMGT)56. Unique mean CDR3 a.a. clones analyzed for pBC cohorts were untreated n = 152,859, NP-HEL n = 185,128, OVA n = 188,971 and HBsAg n = 159,546; nBC untreated n = 424 940, NP-HEL n = 395 048, OVA n = 330 466 and HBsAg n = 440 834; PC untreated n = 143, NP-HEL n = 156, OVA n = 154 and HBsAg n = 132.

Human dataset

Sequencing data of naïve and memory B cells from three healthy human donors were published by DeWitt et al.20 and downloaded already preprocessed from http://datadryad.org/resource/doi:10.5061/dryad.35ks2. The dataset contains 2–4 × 107 naïve B-cells and 1.5–2 × 107 memory B-cells for each donor. Unique CDR3 a.a. clones analyzed were D1-M n = 2,305,669, D2-M n = 1,836,019, D3-M n = 3,127,059 for memory B cells and D1-Na n = 6 187 146, D1-Nb n = 5,716,124, D2-N n = 4,408,661, and D3-N n = 6,348,502 for naïve B cells (Supplementary Table 3). After alignment and preprocessing, we constructed large-scale networks from unique CDR3 of ≈6 million nodes.

Network construction

To construct networks (graphs), a sparse triangle matrix of pairwise Levenshtein distances (LD) between CDR3s must first be computed. For small samples (up to 100,000 unique CDR3 sequences) such a calculation is relatively fast on a single computer. However, due to the N2 complexity of required calculations, computing the pairwise matrix for samples of >100,000 unique CDR3 sequences becomes prohibitively expensive. To perform these computations, we developed software that utilizes the Apache Spark (2) distributed computing framework to partition the work among a cluster of many machines (Supplementary Fig. 1b). We chose specifically Apache Spark because (i) its deployment is very flexible with regard to underlying computing infrastructure and (ii) for similarity layers LD>1, the networks become extremely large and difficult to process. When two sequences were similar within a defined threshold (Levenshtein distance, LD = 1–12), they were connected in the repertoire network (i.e., similarities of 1 a.a. differences were captured in similarity layer 1, LD1, 2 a.a. in LD2 and so on). In these cases, our package can take advantage of the Spark Graph Frames distributed graph library57, which allows scaling to even larger samples with millions of sequences (Supplementary Fig. 1c). With this approach we were able to compute the distance matrices for large samples (>100,000 unique CDR3 sequences) within minutes (Supplementary Fig. 1b. c).

In addition to the computational complexity inherent in creating the distance matrix, the construction of networks for large LD is computationally expensive. We therefore avoided constructing networks altogether for calculating the node degrees and instead used a map-reduce distributed algorithm. For practical purposes, the construction of small networks was performed using the Networkx library58,59. For generating and outputting the largest graphs to disk in common network formats, we used the efficient graph-tool library (https://graph-tool.skewed.de/). For manipulating and analyzing the largest networks, our software package took advantage of the Spark GraphFrames distributed graph library as mentioned above57.

The software was developed in python (https://www.python.org/) using the Numpy/Scipy60 scientific libraries for matrix and array manipulation and Apache Spark17 as the distributed backend. Our software package for antibody repertoires imNet is available (https://github.com/rokroskar/imnet) and includes tutorials and demos, including scripts to set up the distributed computation environment on commonly-used compute cluster infrastructure The results shown in this work were obtained using 1–625 cores of the Euler parallel-computing cluster operated by ETH Zürich. In addition, imNet is a python library and can be used locally to work with both python 2 and 3.

Degree distribution fits

Degrees (number of similar CDR3 sequences to a specific CDR3 sequence) were calculated for each of the similarity layers LD1–12 for each CDR3 sequence in each sample. CDR3 with zero degrees that were not similar to any other CDR3 in the network were excluded in order to fit degree distributions. The power-law, exponential and Poisson distributions were fitted to the empirical degree distributions of the networks, constructed as described in Network construction, by estimating xmin (estimated lower degree threshold by minimizing the Kolmogorov-Smirnoff statistic61) and optimizing model parameters using the poweRlaw62 package. We first discriminated if the power-law distribution could describe the best fit to the degree distribution by bootstrapping 100 times the power-law p-value obtained from each sample after estimating xmin. Following the approach described by Virkar and Clauset63, a p-value ≥ 0.1 indicated that the power-law distribution described the degree distribution (Supplementary Fig. 1a). To determine the degree distribution in cases where the power law was not the best distribution fit (p-value < 0.1), we compared the exponential and the Poisson fits. Two-sided p-value ≈ 0 indicated that the fitted models could be discriminated, and one-sided p-value ≈ 1 (Wilcoxon test) indicated that the first (for example exponential) model was the best fit for the data62.

Robustness

Public clones were defined as clones shared among at least two subjects in a cohort (Supplementary Fig. 2). In order to assess the robustness of the architecture of antibody repertoire networks, we removed public clones from each sample. As controls, we performed repeated removal (20 times) of randomly selected clones in the size of public clones. The p-values (Wilcoxon test) for the power-law fit were calculated after 100x bootstrapping for each repertoire; one-sided and two-sided p-values were used for the comparison between the exponential and the Poisson fits (see Degree distribution fits).

Network analysis

Drawing from network theory64, we translated the concepts of network analysis23 to antibody repertoires. An antibody repertoire network is an undirected graph G = (V, E) described as a set of nodes (CDR3 vertices, V) together with a set of connections (similarity edges, E), representing the adjacency matrix A of pairwise Levenshtein distances (LD) between CDR3 a.a.

$${\mathrm{Sequences}}\,{\mathrm{A}} = \left( {\begin{array}{*{20}{c}} 0 & \cdots & {{\mathrm{LD}}_{1{\mathrm{n}}}} \\ \vdots & \ddots & \vdots \\ {{\mathrm{LD}}_{{\mathrm{n}}1}} & \cdots & {{\mathrm{LD}}_{{\mathrm{nn}}}} \end{array}} \right).$$

In the context of antibody repertoires, we let N = |V| and L = |E|. The order of a graph N represents the number of its unique CDR3 clones (nodes). The size of a graph L is the number of its CDR3 similarity connections (edges). The degree k, that represents the edges connected to a node, describes the count of all similar CDR3 clones to a CDR3 based on LD. Because the degree indicates how active a node is, it could be interpreted as a measure of how central a CDR3 clone is in the antibody repertoire network. In simpler terms, it quantifies the number of CDR3 clones that are similar to a certain CDR3, and thus the potential development or the evolutionary routes to this CDR3.

The average degree \(\left\langle k \right\rangle \equiv \frac{{\mathop {\sum }\nolimits_{{\mathrm{i}} = 0}^{\mathrm{n}} {\mathrm{k}}_{\mathrm{i}}}}{{\mathrm{N}}} = \frac{{2{\mathrm{L}}}}{{\mathrm{N}}}\) is the average number of similar CDR3 clones. The degree distribution P(k) = Nk/N, defined as the fraction of nodes with degree k (Nk) in total nodes, represents the fraction of CDR3 clones that have the same number of similar CDR3s. The cumulative degree distribution \({\mathrm{P}}_{\mathrm{k}} = \mathop {\sum}\nolimits_{{\mathrm{k}}\prime = {\mathrm{k}}}^\infty {{\mathrm{p}}_{{\mathrm{k}}\prime }}\) describes the fraction of nodes with degree greater than or equal to k‘. In Erdős–Rényi (ER) random graph models, degrees follow a Poisson distribution \({\mathrm{P}}\left( {\mathrm{k}} \right) = \frac{{\langle {\mathrm{k}}\rangle ^{\mathrm{k}}{\mathrm{e}}^{ – \langle {\mathrm{k}}\rangle }}}{{{\mathrm{k}}!}}\) in the limit of large numbers of nodes, while degree distributions have an exponential tail P(k) ~ e−αk in exponential networks65.

Global characterization23 described the network as a whole, such as degree distribution, centralization, largest component, diameter, clustering coefficient, assortativity and coreness. The centralization analysis indicates if the network is homogeneous (clones are connected in the same way) or is centered around certain nodes (highly connected clonal regions compared to less connected regions in the same network). The largest component is the largest cluster of connected CDR3 clones. The diameter (d) is the maximum distance (shortest path between two nodes) between any pair of CDR3 sequences. The clustering coefficient (C) represents the probability that neighbors of a node are also connected, which translates in antibody repertoires as the probability that CDR3 clones similar to a specific CDR3 are also similar among one another. Network density (D) is the ratio of the number of edges (CDR3 similarities) and the number of all possible edges in the network. The assortativity coefficient25 (r) indicates if nodes in a network connect to nodes with similar characteristics. It is positive if nodes tend to connect to nodes that are similar to them (i.e., highly connected CDR3 sequences are similar and connect to highly connected CDR3 sequences), and negative otherwise. Coreness is a measure of the network’s cohesion and allows one to understand the global network structure and is useful in comparing complex networks by analyzing the subsets of CDR3-cores that form layers in the antibody repertoire. K-core decomposition is a process that is performed by iteratively removing shells of all vertices of degree less than k (k < kmax) leaving the k-cores of a network (its connected component). The k-core of a graph is the maximal subgraph in which each node has at least degree k. We have computed the maximal k-core of antibody repertoire networks (the innermost core, kmax) and the core distribution along k degrees.

Clonal (local) characterization of antibody repertoires was performed by analyzing local properties of the networks23. The importance of CDR3 clones was measured by calculating the authority66, eigenvector27 and PageRank28 scores of each node in repertoire networks. In particular, the authority (a) of nodes is defined as the principal eigenvector of the transpose matrix t(A)*A, where A is the adjacency matrix of the network. Eigenvector centrality indicates the centrality of a CDR3 clone, not only dependent on the number of similar CDR3 (number of degrees, connections) but also on the quality of those connections: CDR3-nodes with high eigenvector values are connected to many other nodes which are, in turn, connected to many others (and so on). PageRank measures the importance of the similarity between two CDR3 clones within the network extending beyond the approximation of a CDR3 importance or quality. Closeness (centrality26) (c) was calculated to measure how many steps were required to access every other CDR3 from a given CDR3 clone in antibody repertoire networks. We calculated the normalized closeness by multiplying the raw closeness by n-1, where n was the number of nodes in the network. Clique analysis identified maximally-connected subgraphs (a subset of nodes) in which every CDR3 was similar to every other CDR3 sequence and the largest clique was the maximal completed subgraph which had more nodes than any other clique in the network. The node betweenness (b) is the number of geodesics (shortest paths) going through a node and indicates the “bridge” function of a CDR3 sequence. Network properties were calculated using the igraph67 R package.

Network properties

Units are numeric and dimensionless:

Network size is represented by the number of nodes (vertices) and/or number of edges (links, connections, degree).

Largest component size is the number of nodes in the largest component, calculated as the subgraph in which any two vertices are connected.

Diameter (numeric) is the largest number of vertices which must be traversed in order to travel from one vertex to another and is calculated by using a breadth-first search like method.

Assortativity coefficient25 r is a preference for a network’s nodes to attach to others that are similar in some way, e.g., the tendency of the nodes to connect with other nodes with similar degree values. The assortativity coefficient (r) is the Pearson correlation coefficient of the degrees at either ends of an edge and lies in the range −1 ≤ r ≤ 1: \(r = \frac{1}{{\sigma _q^2}}\mathop {\sum }\limits_{jk} jk(e_{jk} – q_jq_k)\) where ejk is the joint probability distribution of the remaining degrees of the two vertices at either end of a randomly chosen edge, symmetric in its indices on an undirected graph ejk = ekj obeying the rules (i) \(\mathop {\sum}\nolimits_{jk} {e_{jk} = 1}\) and (ii) \(\mathop {\sum}\nolimits_j {e_{jk} = q_k}\) (given that pk is the probability that a randomly chosen node on the graph will have degree k and qk is the normalized distribution of the remaining degree—the number of edges leaving the node other than the one selected \(q_k = \frac{{(k + 1)_{p_{k + 1}}}}{{\mathop {\sum }\nolimits_j pj_j}}\)). \(\sigma _q^2\) is the variance (standard deviation) of the distribution qk and it is useful when comparing networks in order to normalize.

Clusters are connected components of a graph. The cluster size is the number of connected nodes in a cluster. The cluster number is the number of clusters in a graph.

Clustering coefficient/transitivity28,67 is the the ratio of the triangles and the connected triples in an undirected graph. Let ej(j) denote the number of edges that connect the immediate neighbors of a node j and let kj denote the node degree of j, that is, its number of immediate neighbors, the clustering coefficient is \(C_j = \frac{{2 \ast e_j(j)}}{{k_j(k_j – 1)}}\). The clustering coefficient for the whole graph is the average of the local values: \(C = \frac{1}{n}\mathop {\sum}\nolimits_{j = 1}^n {C_j}\)

Density is the ratio of the number of edges and the number of possible edges.

Centralization is the network centrality indices which characterize each vertex/edge with respect to their position within the network.

Average degree is the average number of connected vertices.

Neighborhood of a vertex v is the number of vertices adjacent to v, the subgraph induced by all vertices adjacent to v: \(N\left( G \right) = \mathop {\bigcup}\nolimits_{v \in G} {N(v)}\)

Centrality measures the influence of a node in a network:

Eigenvector centrality score is the values of the first eigenvector of the graph adjacency matrix; the score is the result of a process in which the centrality of each vertex is proportional to the sum of the centralities of those vertices to which it is connected. In general, vertices with high eigenvector centralities are the ones connected to many other vertices which are, in turn, connected to many others and so on67: \(x_v = \frac{1}{\lambda }\mathop {\sum }\limits_{t \in M(v)} x_t\) where M(v) is a set of the neighbors of and λ is a constant.

Authority is the centrality of each vertex proportional to the sum of the centralities of those connected to it.67 a = t(A)*A where A is the adjacency of the graph.

PageRank is a technique that identifies important nodes based on the link structure of the graph. Every node of the graph (v) is represented by a numerical score between 0 and 1, known as its PageRank28, π(v), which depends on the structure of the graph, i.e., the probability to reach any node from a given node, and on the value of α that expresses the teleport operation probability to jump from a node to any other node in the graph (fixed parameter chosen in advance). PageRank is the principal left eigenvector of the transition probability matrix P = NxN, characterizing a Markov chain of states, where Pij is the probability that the state at the next time-step is j, conditioned on the current state i. The left eigenvectors of the transition probability matrix P are N-vectors \(\vec \pi\) such that \(\vec \pi P = \lambda \vec \pi\). The \(N\) entries in the principal eigenvector \(\vec \pi\) are the PageRank values for the corresponding nodes.

Closeness centrality of a vertex measures how easily other vertices can be reached from it. It is defined as \(C\left( v \right) = \frac{1}{{\mathop {\sum }\nolimits_w d(v,w)}}\) where d(v, w) is the distance between vertices v and w.

Betweenness centrality for each vertex is the number of the shortest paths that pass through it defined as \(B\left( v \right) = \mathop {\sum}\nolimits_{s \ne v \ne t} {\frac{{\sigma _{st}(v)}}{{\sigma _{st}}}}\) where σst is the total number of shortest paths from node s to node t and σst(v) is the number of those paths that pass through v.

Quantifying the predictive performance (Q
2) of linear regression models

The predictive performance (Q2) of each linear regression model (Y =  + ε) was calculated using leave-one-out cross-validation (LOOCV): \(Q^2 = \left( {1 – \frac{{{\mathrm{PRESS}}}}{{{\mathrm{TSS}}}}} \right) \cdot 100\), where PRESS is the predictive error sum of squares \(\left( {\mathop {\sum}\nolimits_{j = 1}^n {\left( {Y_j – \hat Y_{[j]}} \right)^2} } \right.\) with \(\hat Y_{[j]}\) denoting the prediction of the model when the j-th case is deleted from the training set and TSS is the total sum of squares \(\left( {\mathop {\sum}\nolimits_{i = 1}^n {\left( {Y_j – \bar Y} \right)^2} } \right)\) (Greiff et al., 2012). X and Y are CDR3 degree vectors of repertoires at each LD1–12. LOOCV was performed using the forecast R package68. Cross-validation was used because, in contrast to regular regression analysis, it enables the quantification of the predictive performance of each regression model.

Simulated networks

Networks (nodes V = 102–105) were simulated with the ER, exponential and power-law models using base R69 and igraph67. Random networks were simulated according to the ER model, exponential networks were simulated setting a probability of a connection between two nodes p = 0.5 and scale-free networks were simulated using the Barabási-Albert model (Barabási and Albert, 1999).

Graphics

Graphic representations were produced using base R69 and the ggplot2 R package70. Heatmaps were produced using the NMF package71. Networks and network clusters visualization were performed using igraph67 employing the Fruchterman–Reingold force-directed and Kamada–Kawai layout algorithms. Large-scale networks (Fig. 1a) were visualized using Gephi (version 0.9.1)72; node size was scaled 10–100 proportional to the degree of a node and a blue to gray color gradient was applied to nodes from high to low degrees.

Statistical significance

Statistical significance was tested using the Wilcoxon rank-sum test. Results were considered significant for p < 0.05.

Reporting summary

Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.