Social network analysis in Telecom data – Journal of Big Data

This section describes the data sets used in this work, how we built social network and the feature extraction methods.

Solution architecture

We have chosen Hortonworks Data Platform (HDP)Footnote 3 as a big data platform to install and use in the study. HDP is a free open-source framework under the Apache 2.0 LicenseFootnote 4 designed to deal with data from different sources and formats. It has a variety of open source systems and tools such as:

  • Hadoop Distributed File System (HDFS)Footnote 5: it is a Java-based, file system for storing large volumes of data; it provides scalable and reliable data storage.

  • Apache YARNFootnote 6: it represents the processing layer for managing distributed applications that run on multiple machines in a network, it allows using various data processing engines for batch, interactive and real-time stream processing of data stored in HDFS, so YARN provides resource management while HDFS provides storage.

  • Apache SparkFootnote 7: A distributed, in-memory data processing engine designed for large-scale data processing and analysis.

  • Apache ZeppelinFootnote 8: A web-based notebook which supports interactive data exploration, visualization and collaboration.

We stored Data in HDFS as a spark DataFrameFootnote 9 format which is a Dataset organized into named columns; it is similar to data frame in R/Python or a table in a relational database, but with more optimizations. We used Spark tools for processing data, building the telecom social network and calculating SNA features.

Data description and preparation

Four data sources were selected for our work:

  1. 1.

    Customer data: The customer data has been collected from CRM system, it contains customer contract information (subscriber GSMs, subscription type, age, gender, location …). In addition to the above, the customer data also contains all services, offers, packages that were subscribed by the customer.

  2. 2.

    Mobile IMEI information: It contains all information about the customer mobile devise such model, brand and if the device is dual-SIM or not.

  3. 3.

    Call details records (CDRs): It contains all transactions and actions that were taken by the customer; we selected for our work only data of calls. This data source is generated as text files.

  4. 4.

    Cells and towers information: it contains the information of actions location like longitude and latitude coordinates, sub-area, area and city.

We have exerted great effort in the process of collecting and clearing previous sources of information due to the large volume of data and the variety of its sources. In addition to collecting and clearing data, we had to understand and link all types of data so that they can be used for our research. In the end, 3 months of data sets were prepared which contained more than 10 million customers with their data and about one billion records of calls between customers.

An ETL on CDRs were performed in a duration of 3 months and prepared using big data platform HDP. Before the storage operation, the data were cleansed by eliminating the irrelevant numbers from CDRs. These numbers were classified in three types: first type is call center numbers which are numbers that receive a lot of calls from massive amount of other numbers but do not themselves make any calls. Second type is Telesales numbers (the opposite behavior of call center numbers) which are numbers make a lot of calls to a lot of numbers and don’t receive many calls back. The last type is the wrong calls numbers which are numbers that received one call with short duration and didn’t make any calls at all.

After cleansing operation, we stored two types data in HDFS:

  • Detailed data: contains all calls in each day with time of the call and the duration. This type of data is used to extract Multi-SIM subscribers model features. Table 2 shows a sample of detailed data.

  • Summarized data: contains the aggregation of detailed data over 3 months considering direction of calls, each record contains calling and called part with number of all their calls and total calls duration in the whole period. This type of data is used to build the social network. Table 3 shows sample of summarized data.

Table 2 Sample of detailed data

Full size table

Table 3 Sample of summarized data

Full size table

Building social network

Summarized data was used as mentioned before to build the social network for 3 months where nodes represent GSM numbers of subscribers and edges represent any interactions between subscribers (calls in our case). The result was a direct graph which contains about 16 million nodes and about 300 million edges. Figure 2 visualizes a sample of our social network where size and color of nodes express ranking degrees and lines between the nodes express ranking weights. We used Fruchterman-Reingold algorithm [27] which is a force-directed layout algorithm for drawing the graph.

Fig. 2figure 2

Visualization for a sample of the Telecom social network

Full size image

Weights were added to our graph where each edge carries a different weight. The calculated weights depend on the count and duration of all calls between each side of edge. In order to unify weight over all the graph we normalized the duration of calls for each edge by dividing to the max value of duration in the network and we did the same thing for the count of calls. Finally, edges weight was calculated using the following equation:

$$\begin{aligned} W=\ \alpha \ .\ D_{Norm}\ +\left( 1-\alpha \right) \ .\ N_{Norm} \end{aligned}$$

(1)

where W represents edge weight, \(D_{Norm}\) represents normalized duration of calls, \(N_{Norm}\) represents normalized number of calls and \(\alpha\) represents the importance factor where \(0<\alpha <1\). By choosing \(\alpha >0.5\) we give duration of calls more importance than number of calls and vice versa when \(\alpha <0.5\). In our network, calls duration was considered a little bit more important than calls number so we selected \(\alpha =0.6\) and the calculated weight must be \(0<W<1\). Table 4 shows sample of Telecom social network data.

Table 4 Sample of Telecom social network data

Full size table

Social network analysis features

With the help of social network analysis, Telecom companies can recognize the customer’s behavior and predict the strength of relations among customers. We analyzed our social network and calculated centrality measures. First, degree centrality (In-Degree, Out-Degree, Degree) was calculated for each node. Neighborhood degree (ND) [28] which is the node degree plus the sum degrees of node neighbors was also calculated, ND is given by the equation:

$$\begin{aligned} ND\left( v\right) =\ d\left( v\right) +\ \sum _{u\in N(v)}{d(u)} \end{aligned}$$

(2)

where \(d\left( v\right)\) is the degree of the node v and \(N\left( v\right)\) represents neighbors of node v. Another calculated centrality measure is local clustering coefficient (LCC) [29], which indicates how close the node’s neighbors are to be a clique (complete graph). This measure is given by the equation:

$$\begin{aligned} LCC\left( v\right) =\ \sum _{u\in N(v)}{\frac{\left| N\left( v\right) \ \cap N(u)\right| }{\left| N(v)\right| *(\left| N\left( v\right) \right| -1)}} \end{aligned}$$

(3)

All calculated measures are normalized by dividing to the max value of each measure over all the graph. Table 5 shows a sample of calculated SNA centrality measures.

Table 5 Sample of SNA centrality measures

Full size table

The calculated SNA features (in-degree, out-degree, degree, ND, LCC) were used to enhance the churn prediction models that used in the Telecom company by adding social network features on top of the traditional churn predictors.

Detecting influencers

Previous measures can help find out the most influenced customers in the network but they are not enough, so more measures were calculated to come up with an influence score which expresses the importance of each node in the graph. This importance is based on the strength of links with other nodes presented by eigenvector centrality (EV), and the global location of the node within the graph presented by influence capability of node (IC).

Eigenvector centrality measures a node’s importance while considering the importance of its neighbors; the main idea is that links from important nodes (as measured by previous centrality measures) are more valuable than links from unimportant nodes. All nodes start with equal EV value, but as the computation progresses, nodes with more degree start gaining importance. This importance propagates out to the nodes to which they are connected. After a number of computing iterations, the EV values stabilize and give the final values for eigenvector centrality for all nodes. EV centrality is calculated by using the equation:

$$\begin{aligned} EV\left( v\right) =\ \frac{1}{\lambda }\sum ^n_{j=1}{A_{ij}v_j\ \ \ } \end{aligned}$$

(4)

Where \(\lambda\) is a constant scalar value and A is the adjacency matrix [30] which represents the network mathematically and has values:

$$\begin{aligned} A_{ij}=\ \left\{ \begin{array}{*{20}l} 1 & \hbox {If there is an edge between vertices } v_i \hbox { and } v_j \\ 0 & Otherwise\end{array} \right. \end{aligned}$$

(5)

Edges weight was included in the calculations of EV by replacing the one values in adjacency matrix A with the edge weight.

Wang et al. [18] proposed an Influence Capability measure based on k-shell values and the iteration information in the decomposition process to distinguish nodes with the same \(k_s\) values. As we see in Table 1 many nodes have the same \(k_s\) values but different locations in graph so their influence capability differs and nodes with higher iteration values are closer to the core nodes. First, they proposed a k-shell iteration factor:

$$\begin{aligned} {\delta }_v=k_s\ .\ \ \left( 1+\frac{n}{m}\right) \end{aligned}$$

(6)

where, \(k_s\) is the k-shell value for node v, m is the total iteration number and v is the removed node in the nth iteration of the k-degree process. Then they proposed an influence capability factor defined as follows:

$$\begin{aligned} IC_v=\ {\delta }_v\ .\ D_v+\ \ \sum _{u\ \epsilon \ N(v)}{{\delta }_u\ .\ D_u} \end{aligned}$$

(7)

where \(IC_v\) represents the influence capability of node v, \({\delta }_v\) is the k-shell iteration factor of node v, \(D_v\) is the degree of node v and \(N\left( v\right) \mathrm {\ }\)represents neighbors of node v. The Influence Capability factor IC contains the degree which is a local measure and the k-shell iteration factor which is a global measure. Therefor, IC takes into consideration the local and global influence capabilities of the node which help to distinguish between nodes more accurately. We calculated EV and IC for all nodes in network and normalized values. Table 6 shows a sample of calculated EV and IC measures.

Table 6 Sample of EV and IC measures

Full size table

Multi-SIM subscribers model

There are only two major Telecom companies in our country, we have named the first operator to which the data set belongs to as “Original operator”, the other operator has been named as “Second operator”. We used the social network and detailed data to build Multi-SIM subscribers’ model. First, we benefited from graph proprieties to find nodes (subscribers) that share mutual nodes (neighbors) because the subscriber often calls the same people from his different SIMs. Finding nodes with mutual neighbors is very complex and expensive, so we simplified it by distributing calculations over periods of 10 days throughout the 3 months duration, and removing node pairs that share less than three neighbors in each period. After finding node pairs that share neighbors, two filters were added to the resulted data. The first filter removes pairs that have edges between them because if the subscriber has two SIMs, it is rare that he calls from one of his SIMs to the other SIM (calling himself). The second filter removes pairs that have a number of shared nodes less than the selected threshold. We selected 10 shared nodes as a threshold. The next step in our Multi-SIM subscribers’ model was calculating two types of measures: SNA similarity measures and SNA behavioral measures for each pair of nodes after filtration step. SNA similarity [31] measures include the following:

  • Jaccard measure: in this measure we normalize the number of shared neighbors between two nodes based on the size of union of its two neighborhoods’. This measure is given by the equation:

    $$\begin{aligned} {Sim}_{Jacc}\left( v\ ,\ u\right) =\ \frac{\left| N\left( v\right) \ \cap N(u)\right| }{\left| N\left( v\right) \ \cup N(u)\right| } \end{aligned}$$

    (8)

    where \(N\left( v\right) \mathrm {\ }\)represents neighbors of node v.

  • Cosine measure: it is the cosine of the angle between the characteristic vectors of the neighborhoods of two nodes. This measure is given by the equation:

    $$\begin{aligned} {Sim}_{Cos}\left( v\ ,\ u\right) =\ \frac{\left| N\left( v\right) \ \cap N(u)\right| }{\sqrt{\left| N\left( v\right) \ \right| *\left| N\left( u\right) \ \right| }\ } \end{aligned}$$

    (9)

We calculated a similarity score, which is the average between the two previous similarity measures. The similarity score plays a main role to detect pairs that have high probability to be similar and exclude ones with low probability by filtering on a threshold (we selected 0.1 threshold). Table 7 shows a sample of calculated similarity SNA measures.

Table 7 Sample of calculated similarity SNA measures

Full size table

SNA behavioral measures focus more on the behavior of each node of the pair with the shared neighbors. We used detailed data to extract these measures and use them in addition to similarity score to increase the probability. SNA behavioral measures include the following:

  • Common duration range: it represents the most common duration range of voice calls with each node of the shared neighbors. Duration ranges are clustered into 6 groups:

    • 0 to 4 min

    • 5 to10 min

    • 11 to 20 min

    • 21 to 40 min

    • 41 to 60 min

    • More than 60 min

  • Common time of day: it represents the most common period of the day where a voice call occurs with each node of the shared neighbors. The day is divided into 4 periods; and each period spans 6 h:

    • Period 1: 00:00 a.m. to 06:00 a.m.

    • Period 2: 06:00 a.m. to 12:00 p.m.

    • Period 3: 12:00 p.m. to 06:00 p.m.

    • Period 4: 06:00 p.m. to 00:00 a.m.

  • Common day of week: it represents the most common day of week where a voice call occurs with each node of the shared neighbors.

A friend is considered as a mutual behavioral friend when he has at least two of the behavioral measures with the same values with each side of the pair.

In addition to previous SNA measures, two other measures were used: subscriber location and IMEI. The operator can only records location and IMEI for its subscribers, as a result we can use these information only for detecting similarities between subscribers in the same operator. Subscribers often switch between their SIMs using the same mobile device, so IMEI can be used as an indicator of similarity between subscribers who have the same IMEI value. Subscriber location can help also as IMEI by identifying the most common places where the subscriber resides (home and job). We extracted these measures for pairs whose their two nodes are from the same operator using customer data, mobile IMEI information and towers information to increase accuracy in detecting similar subscribers.