Mapping road network communities for guiding disease surveillance and control strategies | Scientific Reports

Detecting communities in the African road network

We modeled the ARN as a’primal’ road network, where roads are links and road junctions are nodes33. Spatial road networks have, as any network embedded in two dimensions, physical spatial constraints that impose on them a grid-like structure. In fact, the ARN primal network is composed of 300, 306 road segments that account for a total length of 2, 304, 700 km, with an average road length of 7.6 km ± 13.2 km. Such large standard deviations, as already observed elsewhere23,24,34, are due to the long tailed distribution of road lengths, as illustrated in Fig. 1c. Another property of road network structure is the frequency distribution of the degree of nodes, defined as the number of links connected to each node. Most networks in nature and society have a long tail distribution of node degree, implying the existence of hubs (nodes that connect to a large amount of other nodes)21, with the majority of nodes connecting to very few others. For road networks, however, the degree distribution strongly peaks around 3, indicating that most of the roads are connected with two other roads. The long tail distribution of the length of road segments, coupled with the peaked degree distribution, indicates the presence of translational invariant grid-like structure, in which road density smoothly varies among regions while their connectivity and structure does not. Within such grid-like structures it is very difficult to identify clustered communities, i.e. groups of roads that are more connected within themselves than to other groups. This observation is confirmed by the spatial distribution of betweenness centrality (Bc), which measures the amount of time the shortest paths between each couple of nodes pass through a road. The probability distribution of Bc is long tailed (Fig. 1d), while its spatial distribution spreads across the entire network, with a structural backbone form, as shown in Fig. 1b. Again, under such conditions and because of the absence of bottlenecks, any strategy to detect communities that employs pruning on Bc values35, will be minimally effective.

To detect communities in road networks we follow the observation that human displacement in urban networks is guided by straight lines36. Therefore, geometry can be used to detect communities of roads by assuming that people tend to move more along streets than between between streets. We developed a community detection pipeline that converts a primal road network, where roads are links and roads junction are nodes33, to a dual network representation, where link are nodes and street junction link between nodes37, by mean of straightness and contiguity of roads. It is important to note here that the units of analysis are road segments, which here are typically short and straight between intersections, making the straightness assumption valid. Community detection in the dual network is then performed using a modularity optimization algorithm38. The communities found in the dual network are then mapped back to the original primal road network. These communities encode information about the geometry of road pattern but can also incorporate weights associated with a particular disease to guide the process of community detection.

Nodes in the dual network represent lines in the primal network. The conversion from primal to dual is done by using a modified version of the algorithm known as continuity negotiation37. In brief, we assume that a pair of adjacent edges belongs to the same street if the angle θ between these edges is smaller than θc = 30°. We also assume that the angle between two adjacent edges (i, j) and (j, p) is given by the dot product cos (θ) = ri, j r j,p/ri, jrj,p, where ri, j = r j ri. Under these assumptions, the angle between two edges belonging to a perfect straight line is zero, while it assumes a value of 90° for perpendicular edges.

Our algorithm starts searching for the edge that generates the longest road in the primal space, as can be seen in Fig. 2a. Then, a node is created in the dual space and assigned to this road. Next, we search for the edge that generates the second longest road, and a new node is created in the dual space and assigned to this road. If there is at least one interception between the new road and the previous one, we connect the respective nodes in the dual space. The algorithm continues until all the edges in the primal space are assigned to a node in the dual space, as shown in Fig. 2b. Note that the conversion from primal to the dual road network has been used extensively to estimate human perception and movement along road networks (Space syntax, see36), which also supports our use of road geometry to detect communities.

Figure 2figure 2

Visualization of the community detection method developed. (a) The primal road network with a single segment highlighted in black. (b) The dual representation of the same network with the single segment highlighted in black. (c) The result of community detection on the dual network representation. (d) Conversion back to the primal road network with the identified communities mapped. Figure produced using R v3.4 (www.r-project.org).

Full size image

Despite the regular structure of the network in the primal space, the topology of these networks in the dual space is very rich. For instance the degree distribution in dual space follows the power-law P(k) k−γ. This property has been previously identified in urban networks33 and it is strongly related to the long tailed distribution of road lengths in these networks (see Fig. 1c). Since most of the roads are short, most of the nodes in dual space will have a small number of connections. On the other hand, there are a few long roads (Fig. 2a) that originate at hubs in the dual space (Fig. 2b). Our approach for detecting communities in road networks consists then in performing classical community detection in the dual representation (Fig. 2c) and then bringing the result back to the primal representation, as shown in Fig. 2d. The algorithm used to detect the communities is the modularity-based algorithm by Clauset and Newman35.

The hierarchical mapping of communities on the African road network, with outputs for 10, 20, 30 and 40 sets of communities, is shown in Fig. 3. The maps highlight how connectivity rarely aligns with national borders, with the areas most strongly connected through dense road networks typically straddling two or more countries. The hierarchical nature of the approach is illustrated through the breakdown of the 10 large regions in Fig. 3a into further sub-regions in b, c and d, emphasizing the main structural divides within each region in mapped in 3a. Some large regions appear consistently in each map, for example, a single community spans the entire north African coast, extending south into the Sahara. South Africa appears as wholly contained within a single community, while the horn of Africa containing Somalia and much of Ethiopia and Kenya in consistently mapped as one community. The four maps shown are example outputs, but any number of communities can be identified. The clustering that maximises modularity produces 104 communities, and these are mapped in Fig. 4.

Figure 3figure 3

Example outputs of community detection on the unweighted Africa road network, constrained to (a) 10 communities; (b) 20 communities; (c) 30 communities and (d) 40 communities. Figure produced using ArcGIS v10.5 (www.arcgis.com).

Full size image

Figure 4figure 4

Africa road network community mapping for the strongest configuration of communities, with bridge areas of low connectivity identified, for (a) Africa, with bridge areas in white, (b) East Africa with bridge roads between communities mapped in red, (c) West Africa with bridge roads between communities mapped in red, and (d) Southern Africa with bridge roads between communities mapped in red. Figure produced using ArcGIS v10.5 (www.arcgis.com).

Full size image

Even with division into 104 communities, the north Africa region remains as a single community, strongly separated from sub-Saharan Africa by large bridge regions. South Africa also remains as almost wholly within its own community, with Somalia and Namibia showing similar patterns. The countries with the largest numbers of communities tend to be those with the least dense infrastructure equating to poor connectivity, such as DRC and Angola, though West Africa also shows many distinct clusters, especially within Nigeria. Apart from the Sahara, the largest bridge regions of poor connectivity are located across the central belt of sub-Saharan Africa, where population densities are low and transport infrastructure is both sparse and often poor. The communities mapped in Figs 3 and 4 align in many cases with recorded population and pathogen movements. For example, the broad southern and eastern community divides match well those seen in HIV-1 subtype analyses12 and community detection analyses based on migration data27. At more regional scales, there also exist similarities with prior analyses based on human and pathogen movement patterns. For example, the western, coastal and northern communities within Kenya in Fig. 4b, identified previously through mobile phone and census derived movement data39,40. Further, Guinea, Liberia and Sierra Leone typically remain mostly within a single community in Fig. 3, with some divides evident in Fig. 4c. This shows some strong similarities with the spread of Ebola virus through genome analysis15, particularly the multiple links between rural Guinea and Sierra Leone, though Fig. 4c highlights a divide between the regions containing Conakry and Freetown when Africa is broken into the 104 communities. Figure 3 highlights the connections between Kinshasa in western DRC and Angola, with the recent yellow fever outbreak spreading within the communities mapped. Figure 4d shows the’best’ communities map for an area of southern Africa, and the strong cross-border links between Swaziland, southern Mozambique and western South Africa are mapped within a single community, as well as wider links highlighted in Fig. 3, matching the travel patterns found from Swaziland malaria surveillance data41.

Integrating P. falciparum malaria prevalence and population data with road networks for weighted community detection

The previous section outlined methods for community detection on unweighted road networks. To integrate disease occurrence, prevalence or incidence data for the identification of areas of likely elevated movement of infections or for guiding the identification of operational control units, an adaptation to weighted networks is required. We demonstrate this through the integration of the data on estimated numbers of P. falciparum infections per 5 × 5 km grid square into the community detection pipeline. The final pipeline for community detection calculated a trade-off between form and function of roads in order to obtain a network partition.

The form is related to the topology of the road network and is taken into account during the primal-dual conversion. The topological component guarantees that only neighbor and well connected locations could belong to the same community. The functional part, on the other hand, is calculated by the combination of estimated P. falciparum malaria prevalence multiplied by population to obtain estimated numbers of infections, as outlined above.

The two factors were combined to form a weight to each edge of our primal network. The weight wi, j of edge (i, j) is defined as

$${w}_{i,j}=\frac{1}{/{{\bf{r}}}_{i,j}/}\begin{array}{c}{{\bf{r}}}_{{{\bf{r}}}_{j}}\\ {{\bf{r}}}_{i}\end{array}m({\bf{r}})p({\bf{r}})d({\bf{r}})$$

(1)

where m(r) is the P. falciparum malaria prevalence and p(r) is the population count, both at coordinate r. These values are obtained directly from the data. When the primal representation is converted into its dual version, the weights of primal edges, given by Eq. 1, are converted into weights of dual nodes, which are defined as

$${\lambda }_{\bar{i}}=\,{\rm{\max }}({w}_{i,j}),\quad (i,j)\in {{\rm{\Omega }}}_{\bar{i}},$$

(2)

where \(\bar{i}\) represents the i th dual node and \({{\rm{\Omega }}}_{\bar{i}}\) represents the set of all the primal edges that were combined together to form the dual node \(\bar{i}\) (see Fig. 2a,b). Finally, weights for the dual edges are created from the weights of dual nodes, by simply assuming

$${\lambda }_{\bar{i},j}=\,{\rm{\max }}({\lambda }_{\bar{i}},{\lambda }_{\bar{j}}).$$

(3)

The dual network weighted by values of λij was used as input for a weighted community detection algorithm. Ultimately, when the communities detected in the dual space are translated back to primal space, we have that neighbor locations with similar values of estimated P. falciparum infections belong to the same communities. For the example of P. falciparum malaria used here, the max function was used, representing maximum numbers of infections on each road segment in 2015. This was chosen to identify connectivity to the highest burden areas. Areas with large numbers of infections are often ‘sources’, with infected populations moving back and forward from them spreading parasites elsewhere6,42. Therefore, mapping which regions are most strongly connected to them is of value. Alternative metrics can be used however, depending on the aims of the analyses.

The integration of P. falciparum malaria prevalence and population (Fig. 5a) through weighting road links by the maximum values across them produces a different pattern of communities (Fig. 5b) to those based solely on network structure (Fig. 3). The mapping of 20 communities is shown here, as it identifies key regions of known malaria connectivity, as outlined below. The mapping shows areas of key interest in malaria elimination efforts connected across national borders, such as much of Namibia linked to southern Angola43, but the Zambezi region of Namibia more strongly linked to the community encompassing neighbouring Zambia, Zimbabwe and Botswana44. In Namibia, malaria movement communities identified through the integration of mobile phone-based movement data and case-based risk mapping26 show correspondence in mapping a northeast community. Moreover, Swaziland is shown as being central to a community covering, southern Mozambique and the malaria endemic regions of South Africa, matching closely the origin locations of the majority of internationally imported cases to Swaziland and South Africa41,45,46. The movements of people and malaria between the highlands and southern and western regions of Uganda, and into Rwanda47, also aligns with the community patterns shown in Fig. 5b. Finally, though quantifying different factors, the analyses show a similar east-west split to that found in analyses of malaria drug resistance mutations6,48 and malaria movement community mapping27.

Figure 5figure 5

Data and example outputs for Plasmodium falciparum malaria weighted road network analyses. (a) Africa road network with each road segment coloured by its maximum value of P. falciparum prevalence multiplied by population. (b) Output of community detection on the data in (a), showing the result for 20 communities. Figure produced using ArcGIS v10.5 (www.arcgis.com).

Full size image