Network Analysis from Social Media Data with NetworkX
Mục Lục
Network Analysis from Social Media Data with NetworkX
Image from Marketing Stats
Social media, and Reddit in particular, which is a channel of communication, information, and entertainment. The data from social media can be viewed as a network of friends and followers. They are playing a key role in our lives from conveying and spreading information to influencing people. How we can analyze and visualize the network based on these relationships/ influences/ objects is important. Let’s learn how to extract, visualize, and analyze networks from Reddit data in Python using NetworkX.
Based on an example of Covid-19-related Reddit data, the following sections will help you several questions such as:
- The provenance of the data? How to get the data? How to build the graph from the collected data?
- What is the overall structure of the social graph?
- Who are the important people, or hubs, and communities in the network?
Some explanatory data analysis and discussion will be included in the last part of the post. Let’s begin!!!
Dataset Preparation
Creating a Reddit application
To collect Reddit data, we need to register to create an application on Reddit from https://www.reddit.com/prefs/apps. Fill out your application information including name, description, url (optional), and redirect url. Developers name can be changed. The term below “web app” and beside “secret” will be used in the next step to collect the data.
Collecting Covid-19-related Reddit data with PRAW
First, making sure that we have PRAW installed with !pip install praw
, then import praw
to collect Reddit data.
Then, pasting “web app” and “secret” into client_id and client_secret, respectively.
Finally, given a subreddit, we will be able to collect the Reddit posts. For example, to collect Covid Reddit data, I used “covid” as a subreddit. Note that, the subreddit is customizable, so we can collect other data by simply changing the subreddit.
The collected data for Covid-19 posts looks like:
ADROBLES2024 - /r/COVID has it's own chatroom! (If link doesn't work, check the sidebar.
mark1241 - Donald Trump Recovers From Covid-19, "Don't be afraid of COVID"
how_when_why_where - The White House’s Version of Contact Tracing
MarsupialElectrical8 - Coffin Dancers
The50centTourist - Does Anyone Else Think Trump Is Faking Covid?
Miniskrik - Trump says will leave hospital on Monday, "Don't be afraid of Covid."
Julie_Roys - Pastor Greg Laurie becomes the 12th person at the Rose Garden ceremony on Sept. 26 to test positive for COVID-19. Laurie also attended a prayer march the same day with the Vice President, Franklin Graham, and thousands of others.
White_Mlungu_Capital - What special treatment did Trump get to recover from covid so fast?
MickGhee - Trump Claims COVID Vaccine to be Released "Momentarily" in Latest Video
MrDrProfessorScience - Has anyone caught covid from food?
aeb526 - What were your first covid symptoms?
kaushiksridhar83 - “Lives or Livelihoods” – The social cost of COVID-19
I collected Covid-19-related data from Reddit since 1) Reddit is an American social news aggregation where every member can submit content to the site such as text, images, links, etc. and are then voted by other members. Additionally, Reddit is a kind of free speech where we can hear the thoughts of others without any constraints. With these convenience, Reddit is quite suitable for Covid-19 since we may want to hear what people think about the pandemic, and then we can discuss about that; 2) In previous post, I have collected Covid-19 data from Tweeter. So it is worthy to collect data from another platform and probably from another audiences.
NetworkX Installation
It is the time to begin working with the collected data in Python. First, we need to install NetworkX by typing this into your command line: pip install networkx.
If you are running into any problems, you might try updating the package with pip install networkx --upgrade
.
Getting Started
Creating and Populating the Graph
To create a graph, we use g = nx.Graph()
.
To populate data collected into the graph, we need to define what are nodes and edges of the graph. In Reddit posts, we collected information related to author, comments, and title of each post. So, we can consider using authors as nodes in the graph and connections/links between them are edges of the network. The following implementation is for populate the graph and print out post information.
Getting Graph Information
Basic Information
Here’s some basic information about the network we will be working with
Graph information
The graph represented a social network, so the nodes would represent people, and an edge between two nodes could signify that those two individuals are friends. There are 52 nodes in and 47 edges in the network and average degree is 1.8. Degree of a node defines the number of connections a node has. As we can see this is a fairly connected network, and the number of edges in the network is smaller than the number of nodes, and degree of node is near 1, so the network is sparse. In fact, sparse networks are typically found in social, computer, and biological networks since most real networks are large and sparse [1].
Nodes and Edges information
To get nodes and edges information, we can print out these information from the graph g
.
Nodes:
Node: ADROBLES2024
Node: mbizzer
Node: CyberBunnyHugger
Node: HnTPixelStudio
Node: OliverJones611
Node: Nata2211
Node: ShipAnchorMooClergy
Node: StricklyM3
Edges:
Edge: (Redditor(name='ADROBLES2024'), Redditor(name='mbizzer'))
Edge: (Redditor(name='ADROBLES2024'), Redditor(name='HnTPixelStudio'))
Edge: (Redditor(name='ADROBLES2024'), Redditor(name='OliverJones611'))
Edge: (Redditor(name='ADROBLES2024'), Redditor(name='Nata2211'))
Edge: (Redditor(name='ADROBLES2024'), Redditor(name='ShipAnchorMooClergy'))
Edge: (Redditor(name='ADROBLES2024'), Redditor(name='StricklyM3'))
We can also print out node and edges with or without data as follows.
This part is just to print out information about nodes and edges in the social network. We can see name of nodes, edges, and the links among them.
We can also have information about the overall network as follows.
Degree of each node in the graph
Degree of a node defines the number of connections a node has. To get degree of each node in NetworkX, we can use g.degree()
. It is shown here “ADROBLES2024” is an individual who has the largest number of connections in the network.
DegreeView({Redditor(name='ADROBLES2024'): 9, Redditor(name='mbizzer'): 2, Redditor(name='CyberBunnyHugger'): 2, Redditor(name='HnTPixelStudio'): 1, Redditor(name='OliverJones611'): 1, Redditor(name='Nata2211'): 1, Redditor(name='ShipAnchorMooClergy'): 1, Redditor(name='StricklyM3'): 1, Redditor(name='Spore2012'): 1, Redditor(name='ejk8799'): 1, Redditor(name='mark1241'): 8,
... Redditor(name='routineawkward'): 2, Redditor(name='LaSage'): 1, Redditor(name='beelll'): 1, Redditor(name='BizKitten'): 1})
Density of the graph
Density of a graph is simply the ratio of actual edges in the network to all possible edges in the network. Density of this network is approximately 0.0354. On a scale of 0 to 1, it is not a very dense network. We can see this fact from visualization later.
Transitivity of the graph
To measure closure of a graph, the structural network measure is known as transitivity. It is the fraction of pairs of node’s friends all possible cases. Like density, transitivity expresses how interconnected a graph is in terms of a ratio of actual over possible connections. Like density, transitivity is scaled from 0 and 1. The network’ transitivity is about 0.1935, higher than its 0.0354 density. That is because the network is sparse and there are fewer possible connections, which may result in slightly higher transitivity. Nodes with lots of connections are likely to be part of enclosed triangles.
Degree centrality
Nodes with highest degree have highest degree centrality. In that case, ADROBLES2024 has the highest degree, and can immediately reach the most neighbors.
Closeness centrality
Nodes with the most shortest paths have highest closeness centrality. Again, ADROBLES2024 has the highest closeness degree, and can reach the most nodes quickly.
Betweenness centrality
Nodes that appear most often in shortest paths have highest betweenness centrality. Many paths must flow throw nodes with high betweenness. Laogama user has the highest betweenness centrality.
Node attributes
NetworkX allows people to add attributes to both nodes and edges, and providing more information about each of them.
Graph Visualization with NetworkX
Now, we move on the the most interesting part of this post… VISUALIZATION! Let’s see what NetworkX can do to interpret the network and provide qualitative information about the data.
We can use the Spring layout algorithm [2] to visualize the graph with node labels
There are different ways we can visualize the graph using NetworkX, for example:
Visualization with random nodes and edgesVisualization with nx.draw_circularVisualize nodes with random colors
Parallel Betweenness Centrality
In a social network, it is crucial to determine who is the most “important” individual. And individual’s betweenness is one of the ways to define “importance”. The betweenness centrality is to measure the number of shortest paths pass through a vertex. The more shortest paths that pass through the vertex, the more central the vertex is to the network. Pool
from multiprocessing library and the itertools library is used in this task.
As in the figure, vertices that are at the center of a hub or acts a bridge between two hubs have higher betweenness centrality. In this sparse network, since the connection between users are less, there is not much bridge vertices. The hub center vertices have high betweenness because all intra-hub paths pass through them.
Community Detection
In social networks, our friends probably are from different aspects in reality, such as college friends, classmate friends, roommate friends, etc. Due to the different group components like that, we want to identify these different communities in the social network. Using community detection algorithms can break down a social network into different overlapping communities. The criteria for this is to maximize intra-community edges while minimizing inter-community edges. Good communities should have a high number of intra-community edges, and low number of inter-community edges.
Community Detection
As shown in the figure, the network is sparse and the communities closely align with the vertex hubs.
Export Graph
To export graph, we can use GraphML.
nx.write_graphml(g, "output.reddit.graphml", prettyprint=False)
Discussion and Conclusion
NetworkX is used to visualize and analyze the social network in this post since it is easy to use and rapid development, and have flexibility ideal for representing networks found in many different fields. Additionally, compared with many other tools, NetworkX is designed to handle data on a scale relevant to modern problems, extremely fast, highly flexible graph implementations, and have an extensive set of readable and writable formats. As claimed in [3], NetworkX should not be used with large-scale problems (i.e., massive networks with 100M/1B edges). Therefore, for this small data, NetworkX is well-suited.
In this post, Reddit is used to collect Covid-19-related posts. There are several reasons for that. Firstly, Reddit is an American social news aggregation, and discussion website for everyone. A member can submit content to the site such as text, images, links, etc. and are then voted up or down by other members. With these convenience, Reddit is quite suitable for Covid-19 since we may want to hear what people think about the pandemic, and then we can discuss about that. Secondly, in previous post, I have collected Covid-19-related data from Tweeter. So it is worthy to collect data from another platform and probably from another audiences. One problem that is shown in the analysis and visualization is the collect social data generates a sparse network, which can be seen as the interactions among the users are few.
Possible ethical issue is that Reddit’s model is where everyone can submit their contents, comments and discussion. It does not have much constraints or policies to bring its communities under an ethical umbrella. Therefore, Reddit is a kind of free speech, the ethics and its data correctness issues can be a problem if using their data for research purposes.
You can find a full version of implementation here.
If you like this post, please give me a clap. Thank you for reading!
References
[1] https://en.wikipedia.org/wiki/Sparse_network
[2]https://networkx.github.io/documentation/stable/reference/generated/networkx.drawing.layout.spring_layout.html
[3] https://www.cl.cam.ac.uk/~cm542/teaching/2011/stna-pdfs/stna-lecture11.pdf
[4] https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python
[5] https://www.datacamp.com/community/tutorials/social-network-analysis-python