Large-scale network analysis captures biological features of bacterial plasmids | Nature Communications

Mục Lục

A dataset of complete bacterial plasmids

A dataset of complete bacterial plasmids was assembled comprising 10,696 sequences found in bacteria from 22 phyla and over 400 genera (Supplementary Data 1, Fig. 1a and Supplementary Fig. 1). The composition of plasmid hosts reflects current research interests, with the Proteobacteria and Firmicutes phyla together representing over 84% of plasmid sequences. The dataset includes plasmids from a diversity of bacterial hosts, with 66 plasmids from unknown bacterial families, 14 from uncultured bacteria and 37 samples from candidatus species (Supplementary Data 1). In total, 510,463 different coding sequences (CDSs) were identified in the plasmid dataset. In all, 66.01% of the CDSs were predicted to encode a hypothetical protein, 27.9% had a known product with Gene Ontology (GO) biological process annotation, with the remaining 6.09% encoding a known protein product with unknown biological function (Fig. 1b). There are 3,328,916 bacterial genes available in the RefSeq database (NCBI Gene Statistics accessed on 19 June 2019), meaning that roughly 1 in 20 of the currently known bacterial genes are plasmid borne. The GO biological processes associated with plasmid CDSs are diverse. After accounting for multiple occurrences of annotated CDSs in the dataset, the dominant associated GO terms relate to catabolic and biosynthetic processes (20.64% relative to total number of annotated CDSs), transposon mobility (17.09%) and positive and negative regulation of transcription (7.70%). Replicon-based typing classified 27.66% of the plasmids into 163 different replicon types (Fig. 1c and Supplementary Fig. 2). However, 31.67% of these classified plasmids were assigned to multiple replicon types. MOB typing was more comprehensive, successfully classifying 32.63% of the plasmids into six MOB types, of which 9.48% were assigned to multiple types (Fig. 1c). Unsurprisingly, classification by these two methods performed best for well-studied plasmids of the phyla Proteobacteria and Firmicutes.

Fig. 1: Summary of the dataset of complete bacterial plasmids.

a The distribution of host phylum represented in the plasmid dataset. b Functional annotation of plasmid-borne genes. The pie chart shows the proportion of unique CDSs with hypothetical function as predicted by Prokka48, and CDSs (genes) with known/unknown biological function based on GO annotation. The bar chart provides the most common biological functions associated with plasmid-borne genes also considering the respective frequency of these genes on plasmid genomes. c The percentage of plasmids covered by the three classification methods: replicon and MOB-typing schemes, and clique assignment. d The distribution of pairwise plasmid similarities (Jaccard index).

Full size image

Uncovering the population structure of plasmids

We constructed a network based on the plasmid pairwise sequence similarities. This represents a weighted, undirected network with plasmids (vertices) connected by edges indicating similarity (Supplementary Fig. 3). Similarity was scored using the exact Jaccard index (JI), defined as the size of the intersection divided by the size of the union of two sets of k-mers. Plasmid pairs that shared <100 k-mers were considered to have a JI equal to zero. This cut-off value was implemented since the majority of CDSs found on plasmids have lengths >100 bp, thus only a fraction of the functional genome is common between plasmids with low shared k-mer count (Supplementary Figs. 4 and 5). The majority of plasmid pairs shared little to no similarity (Fig. 1d). In all, 6.14% (657) of the plasmids were singletons, while 3.31% (354) were connected to only one other plasmid, illustrating the high levels of diversity across bacterial plasmid genomes. It follows that plasmids with more k-mers in common are more likely to share the same functional genetic elements and hence participate in similar biological processes falling within the same host niche (Supplementary Fig. 5). Such plasmids are presumed to form cliques within the network with higher internal JI score. The objective is then to identify cliques that contain plasmids with markedly higher similarity between themselves, relative to their immediate network neighbourhood.

Listing all cliques of our large plasmid network and assessing their internal similarity is computationally intractable with current tools29. A solution for a single clique can be quickly verified, but the time required to process all possible cliques scales rapidly as the size of the network increases. As an alternative solution, a stochastic community detection algorithm OSLOM (Ordered Statistics Local Optimization Method) was implemented30. OSLOM detects communities (i.e. densely interconnected subgraphs) with statistical significance, meaning that they have a low probability of being encountered by chance in a random network with similar features to the plasmid network. OSLOM is well suited for this task since it can be used to analyse undirected networks with overlapping communities. In addition, OSLOM shows similar performance to other widely used methods such as Infomap or Louvain30,31, which, unlike OSLOM, were unable to analyse this dataset due to computational limitations. To validate the results from the stochastic clique assignment, all communities of size three or more detected by OSLOM were assessed for their completeness (i.e. whether they form cliques) against the original plasmid network (Supplementary Fig. 3).

Despite the notable dissimilarity among plasmids, the original network was too dense (network density = 0.0438) to yield a consistent performance for every OSLOM run (Fig. 2 and Supplementary Figs. 3 and 6). Furthermore, a large proportion of communities detected did not form cliques and would have to be disregarded (Fig. 2a). A JI threshold was introduced to increase the sparsity of the network and to upweight more similar plasmids, thus optimizing the performance of OSLOM. A range of thresholds were assessed based on the following criteria: (i) the clique to community ratio (Fig. 2a), (ii) the proportion of plasmids assigned to cliques (Fig. 2b), (iii) the congruence with replicon-based typing (Fig. 2c) and (iv) the consistency of OSLOM performance (Fig. 2 and Supplementary Fig. 6). The optimum threshold was consistently obtained at a JI of 0.3. The resulting sparse network is shown in Fig. 3 (network density = 0.00128).

Fig. 2: Optimization of OSLOM performance.

A range of Jaccard index (JI) thresholds were applied to the original plasmid network (Supplementary Fig. 3) with edges below a particular threshold being removed prior to OSLOM analysis. During the process, several criteria were considered: a clique to community ratio; b percentage of plasmids covered by the cliques; c the congruence with replicon typing measured by NMI score. NMI was calculated for all cliques containing plasmids assigned to a single or multiple replicon types (legend: All) and just to a single replicon type (legend: Single). Error bars (a, b) and light-coloured shading (c) provide ±2 standard deviations (SDs) of uncertainty. Standard deviation around every value on the y-axis across all JI thresholds assessed (points and bars) was calculated based on results of n = 5 iterations of the OSLOM software (see ‘Methods’). The dashed vertical line indicates the selected optimal JI threshold of 0.3.

Full size image

Fig. 3: Sparse network of plasmids assigned to cliques by OSLOM algorithm (network density = 0.00128).

The network includes 5371 plasmids (nodes) assigned into 561 cliques (complete subgraphs). The completeness of identified cliques was evaluated based on the original network (Supplementary Fig. 3). 5008 unassigned plasmids, which formed disjoined singletons and pairs, were removed from the network. Coloured nodes indicate plasmids assigned to a single clique.

Full size image

The OSLOM-guided clique detection algorithm offers flexibility and identifies cliques of plasmids with a wide range of internal similarity scores (Supplementary Fig. 7). We assessed the importance of considering pairwise JI distances as a continuous variable by reanalysing the dataset with the Bron–Kerbosch Max-clique algorithm32, implemented in the graph-tool Python library33. The Bron–Kerbosch algorithm is computationally highly effective, but the pairwise distances between plasmids are treated as binary values defined by the given threshold. Applied across a range of JI thresholds, the Max-clique approach systematically identifies a very large number of cliques (Supplementary Fig. 8A), assigns a large proportion of plasmids to multiple cliques (Supplementary Fig. 8B) and leads to a low correlation between resulting cliques and plasmid replicon types (Supplementary Fig. 8C).

Plasmid cliques agree with current typing schemes

Analysis of the sparse network with OSLOM successfully assigned 50.21% (5371) of the plasmids into 561 cliques of size three or more (Figs. 1c and 3 and Supplementary Fig. 14). Only 1.64% (88) of these plasmids were assigned to multiple cliques, and these were found in the densest regions of the network and at the interfaces between cliques indicating the presence of ‘chimeric plasmids’ (i.e. hybrid plasmids generated through merging of two different plasmids), large-scale transposition or recombination events, or extensive repeated transposition/recombination (Figs. 1c and 3). In addition, this approach covered 564 plasmids from phyla other than the Proteobacteria and Firmicutes, namely from Spirochaetes, Chlamydiae, Actinobacteria, Tenericutes, Bacteroidetes, Cyanobacteria and Fusobacteria. Interestingly, after applying the 0.3 JI threshold, 38.01% (4066/10696) of plasmids that could not be assigned to cliques of size three or more were separated from the network as singletons, while 10.10% (1080) shared an edge with a single plasmid. Therefore, only 1.67% (179) of plasmids were effectively left unassigned. Nonetheless, due to the apparent lack of shared genetic signal, plasmid singletons and pairs were not considered in subsequent analyses. To assess the extent to which ‘mobile elements’ shared between plasmids affect the classification into cliques, we repeated the clique assignment analyses after having removed all accessory CDSs (29,913) associated with transposition, pathogenesis, or resistance (Supplementary Fig. 9). Pruning these genes did not markedly affect the assignment of plasmids into cliques, which gives support to the genetic signal being driven by the genetic similarity of plasmid backbones rather than shared mobile genetic elements.

Clique purity and normalized mutual information (NMI) were used to assess the quality of clique-based classification (see ‘Methods’). These metrics were calculated for cliques comprising plasmids with identified replicon type, plasmids carrying a single identified replicon type, or plasmids with assigned MOB type. Untyped plasmids were disregarded. The observed purity scores were high (>85%), indicating the homogeneity of cliques for a particular plasmid type (Supplementary Fig. 10). This was particularly the case for MOB types (purity = 0.9887) and plasmids assigned to a single replicon type (purity = 0.9522). NMI provides an entropy-based measure of the similarity between two classification systems where a score equal to one indicates identical partitioning into classes, while zero means independent classification. NMI penalizes differences in the number of assignment classes, which justifies the low score observed when assessing clique-based versus MOB-based typing (NMI = 0.5223). Nevertheless, high NMI scores were obtained when considering a replicon-based classification scheme (NMI = 0.9044 all types, and NMI = 0.9336 for single replicon types). It follows that plasmids with the same replicon type often fall together within the same clique. This is also supported by the high correlation between the clique membership size and the number of plasmids assigned to the corresponding replicon class (Supplementary Fig. 11, R2 = 0.862 for plasmids assigned to a single replicon types).

There are exceptions where plasmids from larger replicon classes are further resolved into a few smaller evolutionary-related cliques. One such example is provided by the 22 ‘broad-host-range’ IncP plasmids, which have been split into three cliques (14, 118 and 332) (Supplementary Fig. 12, Supplementary Data 1). While plasmids within these cliques share notably high JI similarity, the similarities between cliques remain low. This is especially true for clique 332 and 14, for which between-clique similarity is zero. Interestingly, plasmids from clique 332 have been exclusively associated with Gammaproteobacteria, while the ones from cliques 118 and 14 are mostly found in hosts from the Betaproteobacteria class. This arrangement of IncP into multiple cliques with a more constrained host range is in line with previous findings of weaker incompatibilities in IncP34 and the existence of multiple genetically distinct IncP sub-lineages whose backbone is coadapted to their host35. Another example of a genetically heterogeneous replicon type is provided by IncY and p0111 plasmids collected from Escherichia coli strains, which fall into three cliques (119, 230 and 372) (Supplementary Fig. 13). Clique 119 and 372 cluster IncY and p0111 plasmids, respectively, with a single, possibly misplaced IncFIB plasmid. Conversely, clique 230 comprises both IncY and p0111 plasmids, with a remarkably related genetic backbone. The latter result raises questions on the distinctiveness of IncY and p0111 plasmid types.

Candidate replicon genes recovered from untyped plasmids

The majority of plasmids with unknown replicon types formed small cliques (Supplementary Fig. 14). In fact, 81.02% of the smallest cliques (carrying three to five plasmids) contain exclusively untyped plasmids. Together with the aforementioned singletons and lone plasmid pairs, this trend highlights the many understudied and underrepresented plasmids in sequence databases. Accordingly, the next objective was to investigate the genetic content of untyped cliques to determine candidate replicon genes and further traits of biological relevance.

In total, there are 388 cliques with no assigned replicon types. As the cliques tend to be homogeneous for a replicon type, only the core genes (i.e. genes occurring on all plasmids of a particular clique) found on untyped cliques were considered. Core genes were translated into protein sequences and screened against the translated PlasmidFinder database using TBLASTN36. A range of e values were assessed to determine the threshold maximizing the discovery of replicon candidates while minimizing false positives (Supplementary Fig. 15). The majority of plasmids were assigned to one replicon type with some plasmids having hits to a maximum of three to four different types. Accordingly, the optimal e value threshold was selected when the total number of core gene hits started to diverge from the number of untyped cliques covered. A conservative e value threshold of 0.001 was chosen, which resulted in the identification of 105 candidate genes from 106 plasmid cliques. The accession numbers and positions of candidate genes are listed in the Supplementary Data 1 (Candidate_replicon_gene column) for all carrier plasmids.

To verify the plausibility of the identified gene candidates, HMMER (version 3.2.1) was used to scan amino acid sequences for known protein domain families found in the Pfam database (version 32.0)37. One hundred and sixty-six families, with e values lower than 0.001, were identified on 97 protein sequences and were most commonly associated with replication initiation (Supplementary Fig. 16). Moreover, the majority of functions associated with the discovered protein families relate to plasmid replicon proteins. For example, domains with helix–turn–helix motifs are important for DNA binding of replicon proteins and allow some proteins to regulate their own transcription38. Other examples of transcriptional regulators also exist in plasmid replicon regions, while DNA primase activity has been found on the RepB replicon protein38. Interestingly, replicon proteins involved in rolling-circle replication (a mechanism of plasmid replication) share some of their motifs with proteins involved in plasmid transfer and mobilization38. This could explain why some of the discovered domain families are linked to plasmid mobilization. On the whole, the candidate replicon genes are highly specific to a particular clique of plasmids and should assist description of new incompatibility types.

Cliques exhibit common GC content and bacterial hosts

The unprocessed plasmid network exhibited a pronounced structure in terms of the plasmid nucleotide composition, measured by GC content (Supplementary Fig. 3). This trend was also reflected in the clique composition (Supplementary Fig. 17A). Within a clique, the standard deviation of GC content rarely exceeds 0.02 and is weakly correlated with the clique size (R2 = 0.0155) (Supplementary Fig. 17B). Moreover, a significant difference in GC content is often found between cliques. Analysis of variance, followed by a Tukey’s test, found that 85.3% of the time the GC content between two cliques differs significantly (adjusted p value < 0.001). In contrast, the sequence lengths of plasmids within a clique are more variable, but are also not strongly correlated with clique size (R2 = 0.029) (Supplementary Fig. 17C, D). Similarly, a Tukey’s test showed that a significant difference in plasmid length between cliques is observed <34% of the time (adjusted p value < 0.001).

Plasmid GC content has been shown to be strongly correlated to the base composition of the bacterial host’s chromosome39. Indeed, the cliques showed a very high homogeneity (purity) relative to their hosts (Supplementary Fig. 18), a trend that has been identified in other plasmid network reconstruction efforts21. At higher taxonomic levels, cliques have near-perfect purity scores (>0.99). The purity score slightly decreases at the level of the plasmid host family, reaching a value of 0.807 at the species level. Therefore, plasmids with high genetic similarity rarely transcend the level of the bacterial genus, which suggests a limited host range for the vast majority of plasmids. However, these results need to be carefully considered due to inherent biases in the dataset, especially in terms of the predominance of well-studied taxa. Overall, the plasmid cliques show a strong intrinsic propensity towards confined GC content and are found in a limited range of bacterial hosts.

Plasmids within cliques have uniform gene content

The gene content of cliques was assessed for all genes occurring five or more times in the dataset. This threshold was chosen to facilitate computation, and to adequately characterize more prevalent genes. In total, 15,851 out of 35,883 (44.17%) of the assessed genes were ‘core’ genes, meaning they had a within-clique frequency equal to one, suggesting an overall uniformity of gene content in cliques (Supplementary Fig. 19). Furthermore, 6577 (18.33%) of the genes were ‘private’. Private genes are those found in only one clique, with a frequency of one, and their relatively high abundance in the dataset suggests the uniqueness of some cliques with respect to their gene content. However, there is an inherent bias. Plasmids within larger cliques tend to be more dissimilar and share proportionally fewer genes (Supplementary Fig. 20). This pattern can, in part, be explained by the broader gene content of large cliques and the high sequence similarity required for same-gene clustering (95%) within the default implementation of the Prokka–Roary annotation pipeline. In all, 31.94% of cliques containing five or more plasmids were found to have 1 to 10 core genes. However, cliques exhibited a wide range in the number of core genes with 7.74% of cliques carrying over 100 shared genes. Interestingly, 13.55% (42) of cliques had no core genes that could also be an artefact of the gene annotation pipeline sensitivity or poor-quality assemblies. For instance, plasmids from 19 cliques carried no recognized genes from the pool of 35,883 assessed genes. Functionally, core genes were found to be more often associated with various metabolic processes, transcription regulation and transmembrane transport (Supplementary Fig. 21) when compared to the overall distribution of GO terms, shown in Fig. 1b. Similarly, fewer core genes were involved in transposon movement, pathogenesis and resistance.

Inferring HGT through clique interactions

Gene content was also considered in the context of clique structure and interconnectedness. To do so, the original network of plasmids (Supplementary Fig. 3) was rearranged such that: (i) plasmids assigned to the same clique were clustered under a single vertex; (ii) plasmids assigned to multiple cliques were left as solitary vertices anchoring the cliques; (iii) unassigned plasmids were removed. The resulting network is shown in Fig. 4. As highlighted earlier, large cliques generally show lower internal similarity compared to the smaller ones. It is important to note that an arbitrary JI threshold of 0.01 was introduced in Fig. 4 to assist visual interpretation, but the unfiltered version of the network is provided in Supplementary Fig. 22.

Fig. 4: The network of cliques.

Cliques, represented as vertices, are connected with an edge if the average Jaccard index (JI) between plasmids of two cliques is >0.01. The colour of the edges indicates the average JI, while the width is proportional to the number of connections between a pair of cliques. The shape and colour of the cliques indicates the phylum of the predominant bacterial host. The size and the transparency are proportional to the clique size and the internal JI, respectively. The cliques form multiple clusters, which have been named based on the genus of the bacterial host characteristic for a particular cluster. There are two exceptions—the Proteobacteria and the Dairy (Lactic) cluster whose respective genera distributions have been provided. The most common GO biological functions of the genes found on plasmids of Proteobacteria, Staphylococcus, Enterococcus and Dairy clusters were further assessed. During the assessment, the respective frequencies of the genes were considered. In case of Proteobacteria, the bar chart distribution of the biological functions is provided. The shared and core gene content of Staphylococcus, Enterococcus and Dairy clusters is presented in the Venn diagram with the numbers in the diagram indicating the number of core and shared genes.

Full size image

The clustering of cliques in Fig. 4 shows high concordance with the phylogenetic hierarchy of the bacterial hosts. On a global scale, there are four large interconnected clusters (three corresponding to cliques from the phylum Firmicutes and one from the Proteobacteria), eight disjointed clusters and a dozen singled-out triplets and pairs. The clique clusters mostly contain plasmids from a specific genus with some minor deviations—hence the cluster naming. The only two exceptions are the large and diverse Proteobacteria cluster, which harbours plasmids mainly from the genera Escherichia, Klebsiella and Salmonella and the Dairy bacteria. The majority of genes identified in these four large clusters were those functionally involved in transposition. Specifically, 26.4% of the genes in the Proteobacteria cluster were transposition related. In addition, 9.66% of the genes in the Proteobacteria were involved in some form of AMR or metal resistance, and 7.38% in pathogenesis, which may reflect the high number of pathogens found in this phylum40.

The core and shared gene content of the three Firmicutes clusters (Staphylococcus, Enterococcus and Dairy) was also assessed (Fig. 4, Venn diagram). Gene sharing was most common between the clusters associated with Staphylococcus and Enterococcus potentially indicating a high frequency of HGT between them, and the least between the Staphylococcus and Dairy bacteria cluster. Analysing the content of these shared genes provides insight into both plasmid function and dynamics, such as the identification of HGT events. For example, the same lactose metabolism genes were found in both Staphylococcus and Dairy bacteria plasmids. Also, the trpF gene, involved in tryptophan biosynthesis and previously associated with the Tn3000 and Tn125 transposable elements41,42, was found on plasmids in all three clusters. In contrast to these, the more disjoint clusters of plasmid cliques observed for other genera may be driven by the species’ ecology and life history, which may lead to limited opportunities for contact between lineages. Such an explanation seems plausible for strict pathogens with restricted host range, such as Xanthomonas or Borrellelia. Conversely, for lineages with a wider environmental niche like Bacillus, the lower connectivity between cliques may be due to intrinsic genetic factors leading to lower between-plasmid recombination and/or transposition rates.