Social Media Analysis with Cloud Graph Database — Part 1

Introduction

aris budi santoso
9 min readMar 10, 2024

In the past, organizations had to pay more for storage and computing power as their own assets. Research and innovation which involving ICT required large investment at that time. Considering the cost of ICT, organizations capturing, processing and storing only the most important event. At that phase, most of organizations manage their data in relational database management system as part of their enterprises information system. Investing more in ICT will matter and gives more competitive advantage in that era.

Today is very different, we live in a world with amounts of data produced and transmitted in every second, it is similar to our heartbeat. It is driven by the increasing of computer power and internet bandwidth and becomes more achievable for all organizations. Hence, the number of ICT investment doesn’t matter anymore in creating competitive advantage. It becomes like electricity that could be reached by anybody without huge investment. The challenge is that only an organization with research and innovation capabilities who will be able to turn it into competitiveness.

Big data is a simple word that has become popular today. It is associated with 5V that consists of volume, velocity, variety, veracity, and value. The first 4V represents characteristics of data that we are facing at this era, we need methods, techniques and tools that are suitable with it. The last V represents the final goal of big data analytics, it is value. Big data not only gives big opportunities, but also gives big challenges for organizations in creating big value from bunch of data with 4V characteristics as mentioned before.

High velocity and volume of data is an effect of the industry 4.0 that digitizes almost of aspect of our life. Every day, people doing interaction with digital media for most of their time. Interaction of people in digital media varies from just for fun to very serious business. Those interactions are represented in the form of text and symbol. All of those are recorded, managed and stored by the digital media platform.

Seeing opportunities from those data, digital media platform implemented data mining to create value. They also involve other parties to create innovation from data. They share it through an API. People who are interested in making value from data, could register in developer portal and subscribe the API then get data for their analytics project.

Social media is one of digital platforms that enable people to share their mind and interact. The platform also shares data API for application developers to create innovative products or insight with analytics. This article tries to demonstrate how we could mine usable information from a bunch of semi structured data that is produced from social media platforms. We will demonstrate Social Network Analysis (SNA) using cloud graph database.

SNA is a process to find insight from graph data. It uses metrices, structures, methods and techniques that are inherited from graph theory. If the data could be formatted as a graph, we could apply SNA as our analytical methods. Extracting nodes and edges from unstructured data is a challenging task.

This article will uses data from Twitter/ X which represents interaction of people in political topic in Twitter/ X. The tweet data represents the interaction of users who involved in discussion about Indonesia presidential election that will be held in 2024. This article aims to provide an overview about the process of transforming data into graph, and then performing network analysis using both centrality measures and pattern search. Neo4j Sandbox is used as a graph database, and the nodes and edges data have been prepared and stored in Github.

  1. Prepare Neo4J Sandbox

Neo4J offers a sandbox environment, it could be accessed through internet network by visiting https://sandbox.neo4j.com/. It provides several mechanisms for authentication. We could simply choose to sign in with Google. After authenticated you will be redirected to the main page to choose the sandbox type. The sandbox landing page is as follows.

We could create a graph database using sample data for several case studies that represent certain business domain. Besides that, if we have data sources and need to be analyzed as a network, it is possible to use that sandbox as an experimental environment, despite it can only last for 3 days. We will use a blank sandbox with graph data science for this tutorial.

2. Create blank database on Neo4J Sandbox

The sandbox provides a blank sandbox with data science capabilities. Simply by choosing the Blank Sandbox — Graph Data Science and clicking the “create” blue button, wait for the moment while the sandbox is being created.

3. Access The Sandbox via Browser

The sandbox instance or project that will be visible on the landing page. The instance could be accessed through several ways, we found that “Open with Browser” is the easiest way to access Neo4J sandbox, so will use it for this tutorial.

After choosing and clicking Open with Browser, we will be redirected to the page that looks like a workbench. The database information can be found in the sidebar, it consists of information about node labels, relationship types, and property keys. The main panel form is similar to notebook that enable to run command interactively.

4. Create Graph Database

This section will demonstrate how to construct multi graphs on Neo4J Sandbox. This step will use social media dataset that have been collected and formed as node and edge list.

a. Create user node

The node of our graph is a Twitter user, it identified by the unique user id. Properties could be attached to the node. The id is the identifier of the node and name is the property, if needed, we can attach more properties to each user node

:auto LOAD CSV with headers FROM "https://raw.githubusercontent.com/digital-budisantoso/graphdata/main/presiden24-node-list1.csv" as nodes
call{
WITH nodes
CREATE (n:USR { id:toInteger(nodes.id), name:nodes.name})
} IN TRANSACTIONS OF 500 ROWS

After the USR node has been created, we need to add an index, that will make it faster while processing the match query.

CREATE INDEX FOR (usr: USR) on usr.id

The node with label USR is ready to use, next we need to create relationships to complete our graph.

b. Create retweet relationship

Retweeting is one of interaction between users on Twitter/ X, we construct pair of user id to represent that user interaction as a relationship in our graph. The edges are directed, and we could set some value as a weight of the relationship, but in this example, we don’t assign any weight. The cypher for creating retweet relationship from csv file is as below.

:auto LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/digital-budisantoso/graphdata/main/presiden24-edge-list1.csv" AS edges
call{
WITH edges
MATCH (a:USR{id:toInteger(edges.srcid)})
MATCH (b:USR{id:toInteger(edges.dstid)})
CREATE (a)-[r:RETWEET]->(b)
} IN TRANSACTIONS OF 500 ROWS

Now we have done for retweet graph and can see it in the browser by clicking the RETWEET in sidebar. That click automatically writes graph query which is called cypher and run it on the browser. The result can be seen as a graph, table or text.

The next step is to create another relationship from the same dataset.

c. Create mention relationship

Other interaction between users about some topic in Twitter is a mention. Sometimes people need to notify another user by mentioning them on their post. A user can mention many other users in one post, so in one post can contain more than one relationship.

:auto LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/digital-budisantoso/graphdata/main/menpresiden24-edge-list4.csv" AS edges
call{
WITH edges
MATCH (a:USR{id:toInteger(edges.srcid)})
MATCH (b:USR{id:toInteger(edges.dstid)})
CREATE (a)-[r:MENTION]->(b)
} IN TRANSACTIONS OF 500 ROWS

After the mention relationships are successfully created, it will be visible on the sidebar as a relationship type. When clicking the relationship label it will execute cypher to show a sample of nodes with relationships as shown below.

2. Analysis

Now we already have multigraphs in our Neo4J sandbox. The graphs represent the interaction retweet and mention in social media. We can analysis the graph in several ways. This part will demonstrate graph analysis with the SNA method. Basic network centrality metrices will be conducted in this tutorial. Steps in calculating centrality metrices are as follows:

a. Make a graph projection

Calculation of centrality metrices in Graph Data Science (GDS) requires the graph to be projected into graph catalog, hence it will be prepared for algorithm execution. The cypher for projecting graph is as follows.

CALL gds.graph.project('rtGraph','USR', {RETWEET:{} })

b. Calculate degree centrality

Degree centrality is a metric that represents the importance level of the user based on the number of incoming dan outgoing links. Before executing the degree centrality calculation, it will be better if we estimate the memory required for that process.

CALL gds.degree.write.estimate('rtGraph', { writeProperty: 'degree' })

YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory

After we make sure that the resources are enough for executing the algorithm, we can safely execute the degree calculation. The degree centrality could be indegree, outdegree or both. The type of degree calculation will be dependent on the orientation that we set. Natural for outdegree, reverse for indegree, and undirected for both degree.

CALL gds.degree.stream('rtGraph', { orientation: 'REVERSE' })
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score AS retweet
ORDER BY retweet DESC, name DESC

The result of the indegree calculation is as bellows.

the degree centrality is also could be written as new attribute on each nodes. Here is the cypher to write it to the node.

CALL gds.degree.write('rtGraph', { writeProperty: 'degree' })
YIELD centralityDistribution, nodePropertiesWritten
RETURN centralityDistribution.min AS minimumScore, centralityDistribution.mean AS meanScore, nodePropertiesWritten

c. Closeness Centrality

Closeness centrality of node is a metric that measures the average distance of shortest path between some node and other nodes. A node with highest closeness centrality is the node closest to other nodes, in other words, it is a node which is easiest to reach from others. Cypher to calculate closeness centrality are as bellows.

CALL gds.closeness.stream('rtGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).id AS id, gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC

d. Betweenness centrality

Betweenness centrality of a node is a metric that measures the number of shortest paths between nodes that involve that node. A node that has role as a broker or an intermediary will be identified from it betweenness centrality score.

CALL gds.betweenness.stream('rtGraph')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS name, score
ORDER BY score DESC

Recap

This article has demonstrated how to construct graphs from social media dataset on Neo4j sandbox environment. We create two graphs on single Neo4J instance. The graphs represent interaction between users about certain topic in Twitter/ X. Network centrality was also conducted on those graphs to gain insight from the user network. Next part of this article will demonstrate community detection and pattern recognition from graph data.

--

--