Social Network Analytics
Mục Lục
Social Network Analytics
Social Network Analytics (with a Case Study in Python)
Introduction
Have you ever wondered how Social Media websites recommend you the friends or connections? In fact, some social media websites like linkedin also notify you with the jobs you may be interested to apply. Any guesses about how do they do that?
They study the Social Networks (A network is a set of objects(nodes) with interconnections (edges)) where people act as nodes of the network and determine how connected the two profiles are by various parameters. And this helps them in recommending the jobs, connections etc.
Let us take up an example to understand this better.
Consider a political party “ABC” where the people within the party are having disputes among themselves. If the party splits into two parties, do you think we will be able to identify which member would join which party? Think about it. Based on the Social Network of all the party members, one can answer the question “Which member will join which group?”. Interesting, right?
Social Network Analysis can also help prevent spreading of rumours on Social Media by detecting the important nodes in the Social Network which when turned off will minimize the spread of a particular rumour.
In this article we will first study about the various Node and Edge attributes, different types of Graphs and also about different parameters that describe the Social network with a case study in Python. Let’s get started!
Table of Contents
- Node and Edge Attributes of Graphs
- Bipartite Graphs
- Clustering Coefficient and Distance Measures
- Centrality Measures
- Case study in Python
Node and Edge Attributes of Graphs
Edges connecting different nodes can be weighted to determine the strength of relation between nodes.And also they can be given different relations like ‘family’, ‘friend’, ‘enemy’.
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib notebook
G = nx.Graph()
G.add_edge(‘A’,’B’,weight=13,relation=’friend’)
G.add_edge(‘B’,’C’,weight=9,relation=’family’)
G.add_edge(‘B’,’D’,weight=7,relation=’friend’)
G.add_edge(‘E’,’B’,weight=10,relation=’friend’)
G.add_edge(‘E’,’A’,weight=1,relation=’enemy’)
G.add_edge(‘F’,’B’,weight=13,relation=’family’)
G.edges(data=True)
Output:
[(‘C’, ‘B’, {‘relation’: ‘family’, ‘weight’: 9}),
(‘E’, ‘B’, {‘relation’: ‘friend’, ‘weight’: 10}),
(‘E’, ‘A’, {‘relation’: ‘enemy’, ‘weight’: 1}),
(‘B’, ‘F’, {‘relation’: ‘family’, ‘weight’: 13}),
(‘B’, ‘A’, {‘relation’: ‘friend’, ‘weight’: 13}),
(‘B’, ‘D’, {‘relation’: ‘friend’, ‘weight’: 7})]
Undirected graphs have edges that do not have a direction. The edges indicate a two-way relationship, in that each edge can be traversed in both directions.Directed graphs have edges with direction. The edges indicate a one-way relationship, in that each edge can only be traversed in a single direction.
Node attributes can be any information about a person,company (node) for example we can have node attributes as ‘Manager’, ‘Analyst’ , ‘Engineer’ etc.
Degree of Nodes : It is the number of nodes that the given node is connected to. There are two types of degree ‘In-degree’ and ‘Out-degree’ which comes into picture in directed graph.
G.add_edge(‘A’,’B’,weight=13,relation=’friend’)
G.add_edge(‘B’,’C’,weight=9,relation=’family’)
G.add_node(‘A’,role=’Trader’)
G.add_node(‘B’,role=’Analyst’)
G.add_node(‘C’,role=’Manager’)
G.nodes(data=True)
Output:
[(‘C’, {‘role’: ‘Manager’}),
(‘B’, {‘role’: ‘Analyst’}),
(‘A’, {‘role’: ‘Trader’})]
Bipartite Graphs
Suppose we are working in a company where the users earn money or prizes if the team or players selected by them win in a particular match.And we are given the responsibility to find the important players or teams which the users will be targeting.This network between the users and the players selected by them is a Bipartite graph.
Bipartite Graph is a graph whose nodes can be split into two groups G1 and G2 and each edge connects the node from G1 to node from G2.
Consider here there are 5 users (blue nodes) and 4 Players or Teams (green nodes) as this graph is split into two groups of nodes this is a Bipartite Graph.
from networkx.algorithms import bipartite
B = nx.Graph()
B.add_nodes_from([‘A’,’B’,’C’,’D’,’E’],bipartite=0)
B.add_nodes_from([1,2,3,4],bipartite=1)
G.add_edges_from([(‘A’,1),(‘B’,1),(‘C’,1),(‘C’,3),(‘D’,4),(‘E’,1),(‘A’,2),(‘E’,2)])
bipartite.is_bipartite(B)
Output:
True
Bipartite Graphs have two groups one of users and other of the products that the users have purchased in the past. So now we need to create the Graph between users which share the common products and this would be a weighted graph where weights would be proportional to how many products they have in common.
This Graph is called as PROJECTED Graph and so if a customer buys some new product we can recommend that product to the users which share an edge in the Projected Graph with that user.
Clustering Coefficient and Distance Measures
Triadic closure : It is the tendency of the nodes (people) who share common connections in a social network to get connected.
Local Clustering Coefficient : Fraction of pair of Node’s friend that are friends with each other.
Consider node ‘E’ it has 4 friends.So number of pairs of friends are 4C2 = 6.
Pairs of ‘E’ friend’s which are friend’s with each other = 2.
Clustering Coefficient =No of pairs of ‘E’ friend’s who are friend’s with each otherNo of pairs of ‘E’ friends
Local Clustering Coefficient in our case = 26= 0.33
Global Clustering Coefficient : It measures the degree to which the nodes in the whole network tend to cluster or form triangles.It can be calculated using two approaches using Average Local Clustering Coefficient and other method is to calculate Transitivity (It is measure of number of open triads in the network).
Distance Measures
Path : It is a sequence of nodes connected by an edge. Consider a path from node 8 to node 1. One of the path is 8–2–5–1 and there are many other paths.
Distance between two nodes will be the length of the shortest path between them.
To compute the distance of all the nodes from a given node Breadth First Search algorithm is used.In Networkx we can use nx.bfs_tree(G,’A’) command to get the tree to get distances from node ‘A’.
Suppose in a company we need to study the Performance of 5 different groups based on certain characteristics. So we need to compare the closeness of different nodes in a network and compare these 5 groups based on degree of closeness between their members.This can be computed by Average distance between a pair of nodes in a network.
nx.average_shortest_path_length(G)
Diameter : It is the maximum distance between any pair of nodes in a network.
Eccentricity : Eccentricity of a node n is the largest distance between node n and all other nodes.
Periphery of Graph : It is the set of nodes that have eccentricity equal to the diameter of the graph.
Centrality Measures
Many times in real world networks we need to answer the question — “ Which are the important nodes in the network ??”.Its main application is while studying social network to avoid spreading of rumours on network.To identify key locations of airports Hubs in the flight connectivity network.And this can also be useful to identify important web pages for Customer churn prediction model in Marketing Analytics.
Important nodes in a network are identified by Network Centrality measures.
Degree Centrality : This is based on assumption that Important nodes have many connections.
Cdeg of node ‘a’ = Degree of node ‘a’No. of nodes in the network — 1
In directed networks we have the choice of using In-degree centrality or out-degree centrality of node based on the application.
Closeness Centrality : This is based on the assumption that important nodes are close to other nodes.
Closeness centrality of node ‘a’ = No of nodes in the network — 1 d(a,u)
Where d(a,u) represents the shortest distance between node a and node u.
Betweenness Centrality : Distance between two nodes is the shortest path between them. It measures how many times a given node lies in the shortest path between the two nodes.It will be an important parameter in the Transportation network so as to create hubs for maintenance purpose in case of aircraft.
Case Study in Python
Consider a Social Network where people are connected to each other and we have the information about the network based on current Timestamp and we need to predict future links that will be created in the network.
import networkx as nx
import pandas as pd
import numpy as np
import pickle
G = nx.read_gpickle(‘Social_Network.txt’)
print(nx.info(G))
Output :
Type: Graph
Number of nodes: 1005
Number of edges: 16706
Average degree: 33.2458
This graph consists of 1005 users on Social network and according to current timestamp there are 16706 edges or connections.
So now for Link Prediction Model we have different measures or indexes like-
Measure 1 — Common Neighbors
Measure 2 — Jaccard Coefficient : Number of common neighbours normalized by total no. of neighbours.
Jaccard_coeff(X,Y) = N(X) N(Y)N(X) N(Y)
Where N(X) represents no. of neighbours of X.
Measure 3 : Resource Allocation
Measure 4 : Adamic-Adar Index
Measure 5 : Preferential Attachment
future_connections = pd.read_csv(‘Future_Connections.csv’, index_col=0, converters={0: eval})
So now we would create the features for predicting the links between different users.
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScalerfor node in G.nodes():
G.node[node][‘community’] = G.node[node][‘Department’]
preferential_attachment = list(nx.preferential_attachment(G))df = pd.DataFrame(index=[(x[0], x[1]) for x in preferential_attachment])
df[‘preferential_attachment’] = [x[2] for x in preferential_attachment]
SH = list(nx.cn_soundarajan_hopcroft(G))
df_SH = pd.DataFrame(index=[(x[0], x[1]) for x in SH])
df_SH[‘soundarajan_hopcroft’] = [x[2] for x in SH]
df = df.join(df_SH,how=’outer’)
M = mean(df[‘soundarajan_hopcroft’] )
df[‘soundarajan_hopcroft’] = df[‘soundarajan_hopcroft’].fillna(value=M)
df[‘resource_allocation_index’] = [x[2] for x in list(nx.resource_allocation_index(G))]
df[‘jaccard_coefficient’] = [x[2] for x in list(nx.jaccard_coefficient(G))]
df = future_connections.join(df,how=’outer’)
df_train = df[~pd.isnull(df[‘Future Connection’])]
df_test = df[pd.isnull(df[‘Future Connection’])]
features = [‘cn_soundarajan_hopcroft’, ‘preferential_attachment’, ‘resource_allocation_index’, ‘jaccard_coefficient’]
%%Creating an MLPClassifier Model for predicting the future links
X_train = df_train[features]
Y_train = df_train[‘Future Connection’]
X_test = df_test[features]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
clf = MLPClassifier(hidden_layer_sizes = [10, 5], alpha = 5,random_state = 0, solver=’lbfgs’, verbose=0)
clf.fit(X_train_scaled, Y_train)
test_proba = clf.predict_proba(X_test_scaled)[:, 1]
predictions = pd.Series(test_proba,X_test.index)
target = future_connections[pd.isnull(future_connections[‘Future Connection’])]
target[‘prob’] = [predictions[x] for x in target.index]
End Notes
Social Network Analytics have great application in the data Science Industry and it can be used in a wide range of applications such as transportation network.
Online vegetable delivery companies can identify their important nodes in the network to keep their reserves at important hubs because it is important for them as some of the fresh vegetables get deteriorated after some time. Graphs are also used to solve various problems in Operations and Supply Chain Industry for example in solving Traveling salesman problem, Vehicle routing problem etc.
Shreyansh Nanawati
I am currently in the fourth year of my graduation in Mechanical Eng from NIT Nagpur. Being a data Science enthusiast and a person who loves to play with maths, I am looking forward to explore my knowledge in this domain.