How I visualised my Instagram Network and what I learned from it

Maxim Piessen
16 min readMay 13, 2019

--

I’m currently doing an advanced Master in Artificial Intelligence and Big Data Analytics at the KU Leuven (Belgium). One of my elective courses goes by the name “Analysis of large scale social networks”. In this course, we learn how to work with and how to interpret network data. A report dedicated to research in a network related field contributes to part of our final grade. I chose to visualise and investigate my Instagram network.

Section 0 — The final visualisation

To get your attention straight away, I’ll start by showing the result. In subsequent sections, I’ll explain in more detail how I got to this visualisation. If you’re not interested in the coding-part, you can skip to section 4 where we’ll dive deeper into some network theory and apply this theory to get more insights about the network.

The final interactive result can be viewed here: https://bl.ocks.org/MaximPiessen/raw/6c4637faeebbdc3fd12ecfd7b67cdd27/

If you hover over a node, you can see the username. If you click on a node, you only see nodes connected to the clicked node. If you click somewhere next to a node, the whole network shows again. Lastly, you can zoom and drag the whole network. If you’re on mobile, I strongly suggest you to grab your laptop and start playing with the graph. For those of you who resist to do so, here you have a non-interactive screenshot:

Figure 1: My instagram network

Section 1 — Getting that data

Visualising networks is very satisfying. The only problem is that you need data in order to make a visualisation. For my idea, I first want to get a list of profiles that I follow, and also follow me. Using these profiles (let’s call it my network), I want to get an overview of who follows whom. The problem with my visualisation idea is that Instagram has very few API endpoints. (An API-endpoint can be used in your program to request data from the Instagram servers). One approach to solve this problem would be to manually check each of your followers and write down who they follow. This approach could take weeks or months..

Luckily, some geniuses created Selenium Webdriver. Selenium lets you create “internet bots” that do the boring, repetitive work for you. Before I explain how I used Selenium, I suggest you first install the following (Mac-users. Windows users can surely find their equivalents using Google):

$ brew install python
$ brew install chromedriver
$ pip install selenium

I built a bot class and two scripts that use this class. The first script creates a txt-file with all the profiles that follow you and that you follow. The second script makes use of this file to check every one of these profiles and outputs a txt file with all the connections. The source code can be found here: https://github.com/MaximPiessen/instagram_network_analysis
- Note: I’m sure the code can be improved a lot. If you’re the person to do so, don’t hesitate to create a pull request on GitHub!

In your terminal, navigate to the “01 scraping” folder that contains the scripts. You can run the first script using this command:

$ python get_my_followers.py --username your_IG_username\ 
--password your_IG_password

Once the first script finishes running, you can run the second script using:

$ python get_relations.py --username your_IG_username\ 
--password your_IG_password\
--relations_file relations.txt

The second script makes use of the output of the first script and saves the relations in relations.txt. Depending on the amount of followers you have, it will take a while to get all the relations (a while like several days). Sometimes, Instagram will block your requests. In this case the program will automatically terminate. A start_profile.txt file will be created on the first run. If you rerun the get_relations.py, this file is used to start scraping from the last profile that was not fully scraped on the last run.

Some tips:

  • If you run the program, a new chrome window will appear. It’s important that a small part of the window is in view (otherwise IG blocks your requests). I always drag the window to one of the lower corners of my screen. (example) This way I can continue working on other things.
  • If the program tries to scrape someone that follows thousands of people, it will fail. In this case, you can terminate the program and manually (using a text editor) increment the number in start_profile.txt with 1. If you now rerun the program, it will skip this profile.

Eventually, the program will have scraped all the profiles and you’ll end up with a relations.txt file containing all the relations. Up to this point, this file is not that useful. In order to visualise our network, we need a .json formatted file. I wrote a script that does this conversion for you. You can run it using:

$ python relations_to_json.py --username your_IG_username\
--input_txt_file relations.txt\
--output_json_file relations.json\
--include_me False

If you set the include_me flag to “True”, your profile and all the relations containing it will be included in the json-file. In the resulting visualisation, every node will have an edge leading to you. For me, this behaviour is unwanted. I thus set this flag to “False”.

Section 2— Playing with d3.js

Equipped with our relations.json file, we are ready to create our visualisation.
When it came to the features of my ideal visualisation, I needed the following:

  1. Force directed graph (learn more about them in this excellent post)
  2. Different styles for bi-directional edges (an edge between two nodes who follow each other) and non bi-directional edges
  3. The username for a specific node to show on hover
  4. On ‘node-click’, show all connected nodes

Important note: d3.js only works in firefox when running locally.

2.1 Force directed graph

The Javascript library d3.js offers a lot of very powerful visualisation tools. There’s also a very broad and lively community that shares creations. One of those formed the basis of my creation. This gist is called “D3v4 Selectable, Draggable, Zoomable Force Directed Graph” and does exactly what’s in its name.

2.2 Different styles for different edges

As stated above, I want a different style for A → B then for A ← → B (where the arrow indicates who follows who). One possibility was to work with actual arrows in the graph, but as I have 300+ nodes and 1900+ edges, this soon becomes messy. A more neat approach is to provide the non bi-directional edges with a gradient. I chose for a red → green approach, so red follows green. The following figure illustrates the choice of colours:

Figure 2: The middle node is followed by and follows the left node (grey edge). The middle node follows the right node, but the right node does not follow the middle node.

Using this jsfiddle and this stackoverflow post I managed to get gradients on the edges. The main takeaway is (see gist below) that you define a gradient with a specific ID for every edge and store it in a <defs></defs> html element (lines 3–20). When you draw the edges, you set the stroke style to the gradient belonging to this specific edge (lines 31–33). When the nodes and edges move, the coordinates of the gradients are updated (lines 77–80). I also assign different classes for the different kind of edges (lines 34–42). For class ‘link-bi’ (bi-directional edge), using CSS, I set:

stroke: #999 !important;

2.3 The username for a specific node to show on hover

Using this gist as inspiration, adding the name on hover was quite straightforward. The styling of the tooltip can be altered in the .tooltip css class.

2.4 On ‘node-click’, show all connected nodes

Lastly, I wanted to be able to show only the nodes connected to a selected node. This enables you to more easily gather information for specific nodes in this dense network. The code I wrote to achieve this makes use of jquery DOM manipulations based on the css classes and id’s assigned to the nodes and edges. hidden is a css class containing the attribute opacity: 0.2;

Section 3— Building your own visualisation

This section is a short one. Just follow these steps:

  1. Log in / create a GitHub account
  2. Fork my gist: https://gist.github.com/MaximPiessen/6c4637faeebbdc3fd12ecfd7b67cdd27
  3. Copy and paste your network data into the relations.json section
  4. Save gist
  5. Copy the link to your gist
  6. Change gist.github.com by bl.ocks.org such that you get something like https://bl.ocks.org/MaximPiessen/6c4637faeebbdc3fd12ecfd7b67cdd27
  7. Share that link with all your friends (for full view, click on the “open” link under the visualisation).
  8. Share your links in the comments. I’m very curious about your networks!

Section 4— Network properties

In this section, we’ll put some theoretical network concepts into practice. In network theory, you can have directed or undirected graphs. For directed graphs, edges have a direction, e.g. (A → B), meaning that you can go from A to B, but not from B to A. Undirected graphs don’t have this restriction. We will make use of a directed graph, as “following” is a directed property.

We can study a network on a global or local level. From my course notes:
Global level analysis: This type of analysis focusses on the large scale structure of the network or large subgraphs.
Local level analysis: This discusses the role or position of individual
nodes or small subgraphs within the network.

Lastly, we’ll try out different clustering / community detection algorithms to see if our program can distil different groups of profiles from our network. I will then compare these results to my knowledge of the network to test the validity.

For the following subsections, you’ll need to install matplotlib, networkX and scipy.

$ pip install matplotlib
$ pip install networkX
$ pip install scipy

4.1 Global level analysis

In your terminal, navigate to the “03 analysis” folder and run:

$ python global_analysis.py --username your_IG_username\ 
--input_txt_file path_to_relations.txt\
--include_me False

You will get the following statistics about your network:

Density. As the name suggests, this metric tells you how dense the network is. The maximal number of edges for a directed network with N nodes is N(N-1), if we count A → B and B → A separately. For my network, this would be 314*313 = 98282. In most networks, the actual number of edges is far smaller than this maximum number. Density tells you exactly how much smaller the actual number is by dividing the actual number by the maximal number of edges. For our graph, we get: Density = 0.033

Degree. This property tells us how many edges are connected to a node. In our case (directed graph), we distinguish in_degree and out_degree. The former tells us how many profiles follow a specific profile, the latter tells us how many other profiles a specific profile follows. We suspect the average in_degree to equal the average out_degree as every edge connects two nodes in our graph. The average in_degree and out_degree = 10.3
The degree distribution is visualised in figure 3:

Figure 3: Degree distribution

We see that the distributions mostly overlap. This means that, on average, profiles have as many followers as they are following. Furthermore, we see that the counts follow a power law with power -0.56 & -0.55 for in and out degree. This means that there are more profiles who follow / are followed by a small number of profiles than there are profiles who follow / are followed by a large number of profiles.

The fact that the degree distribution of our network follows a power law means that we are dealing with a scale free network. So, what exactly is a scale free network and a power law? Let me use the explanation and figures from this great article, I cite:

“A crucial difference between the normal and power-law distribution is that the number of nodes with really high numbers of edges is much higher in the power-law distribution than in the normal distribution. But generally well connected nodes are more common in a normal distribution. This means that in networks you will often find a small number of very highly connected nodes. They have a number of connections that would not occur if the distribution would be normal:

Figure 4: The difference between a normal distribution and a power law distribution

A network is called scale-free if the characteristics of the network are independent of the size of the network, i.e. the number of nodes. That means that when the network grows, the underlying structure remains the same.”

Figure 5: An example of a scale free network

Average shortest path length. If we take a random profile A and a random profile B. What is the minimum amount of profiles you have to visit in order to get from profile A to profile B? This metric is called shortest path length. If we calculate this length for every possible combination of nodes (profiles), we can get the average shortest path length. For our network, this average shortest path length = 3.62 As a comparison, the average shortest path length between all Facebook users is 4.74 It makes sense that our number is lower, as we only take into account profiles of my network; my friends (even from different groups) are more connected than let’s say you and Mark Zuckerberg.

4.2 Local level analysis

Now run:

$ python local_analysis.py --username your_IG_username\ 
--input_txt_file path_to_relations.txt\
--include_me False

Before analysing my own results, I will briefly explain the different local level centrality measures we’re about to put into practice:

Betweenness centrality: As explained before, between every two nodes there exists a shortest path. If you follow such a shortest path, you traverse different nodes. The betweenness centrality of a certain node is the number of shortest paths that pass through that specific node. As this number scales with the total number of nodes in the network, betweenness centrality is often rescaled by dividing by the number of pairs for which the shortest path doesn’t contain the node in question. This result is then further normalised without loss of precision. Betweenness centrality thus identifies nodes that act as bridges in our network.

Closeness centrality: For a specific node, this measure is calculated by taking the reciprocal of the sum of all shortest path lengths between this node and every other node. In other words, it tells you which nodes you should target to reach all the other nodes most quickly.

Degree centrality: Because we are dealing with a directed graph, both in-degree centrality as well as out-degree centrality can be calculated. These centrality measures tell us the number of nodes a specific node is connected to (out-degree centrality) or how many nodes are connected to the specific node (in-degree centrality). Both measures are normalised by dividing by the maximal possible degree. Degree centrality thus tells you which nodes are able to influence the highest number of other nodes directly (one step connections).

Eigenvector centrality & PageRank: This centrality measure extends degree centrality by not only taking into account the number of direct connections (degree), but also considering the degree of the connected nodes, the degree of the connected nodes of these connected nodes and so on. E.g. I only know one person. My degree would be 1 and I wouldn’t be considered an important node in the network. Let’s say this person is Filip Leopold Lodewijk Maria, aka King of the Belgians. Through my friend Filip, I’ll be able to have great influence in the network. Eigenvector centrality takes these kinds of connections into account. A variant of eigenvector centrality is PageRank. From this excellent post, I cite:

Designed for ranking webpages, PageRank uses links between pages as a measure of importance. Each webpage is treated as a node in a network, and is assigned a score based upon its number of in-coming links (its ‘in-degree’). These links are also weighted depending on the relative score of its originating node. The result is that nodes with many in-coming links are influential, and nodes to which they are connected share some of that influence. PageRank is famously one of the ranking algorithms behind the original Google search engine (the ‘Page’ part of its name comes from creator and Google founder, Larry Page)

Results and interpretation for my network: In the following table I present the top 3 profiles per local centrality measure.

Table 1: Centrality measures for my Instagram network

Let’s try to interpret these results. juliepiessen is my sister and connects different groups of friends. This results in her having the highest score for betweenness and the fifth highest score for PageRank. chaimfes, a good friend of mine scores high in all centrality measures. I got to know him trough a student association (SINC), so he’s connected to all my other friends at SINC. Furthermore, he’s tightly connected to two other groups of friends. stereyou_box only appears at the top of the out-degree centrality measure. This is an account for a speaker that you can rent for events (created by chaimfes and some others at SINC). This account thus follows a lot of people, but is not followed back as often. maxschoepen is a friend that I’ve known for a long time and who has the highest pagerank. This can be attributed to the fact that I’ve introduced him to the most important people in my other friend groups. Some of them started following him resulting in a high PageRank. The PageRank top 5 is completed by other people that connect groups of friends + photrea_com, the instagram page for a photography platform that I created last year.

4.3 Clustering / Community detection

From my course notes I cite:

Clustering or Cluster Analysis is a general term used in analytics. It refers to a broad set of techniques that share a common goal: to create set or group (so called clusters) of items or objects which are similar or share similar properties but where the structure of the data is hidden or unknown.”

In other words: We want to identify profiles that group together in my network. As I have a lot of knowledge about my own Instagram network, I plotted my manual grouping in figure 6. The meaning of the different groups can be found in the figure’s caption. I will use this figure for a qualitative comparison with the results of the automated clustering algorithms.

Figure 6: Manual labeling of different groups. A = South-Africa, B = high school, C = coding bootcamp, D = A group of friends, E = student association, F = startup school, G = travel group, H = South African roomies, I = another group of friends.

We can distinguish 2 kinds of hierarchical clustering methods: Agglomerative and divisive. Agglomerative methods assign each node to its own cluster and iteratively merge these clusters. Divisive methods assign all nodes to the same cluster and then iteratively split the cluster into new clusters. We will investigate one method per kind. For networkx to work, we have to treat the network as undirected in this section.

Louvain method (agglomerative): This method maximises the modularity of the network. Modularity is the fraction of edges within the clusters minus the expected fraction if edges were distributed randomly. A higher modularity thus yields better clustering. The Louvain method consists out of two main steps that are repeated multiple times:

1) It starts by assigning each node to its own cluster. In a next step, we iterate over each node, assign it to a cluster of one of its neighbours and calculate the change in modularity. This calculation is done for every neighbour of the node. If all changes in modularity are negative, the node stays in its current cluster. If one ore more changes are positive, the node moves to the cluster with the highest change in modularity. This process is repeated until no more changes happen.
2) A new, coarse-grained, network is built from the clustering obtained in 1). The newly discovered clusters thus form the nodes.

By repeating step 1) and 2), a clustering on different levels is obtained. The algorithm stops when no more modularity-optimising changes happen. (For a more in-depth explanation, click here)

Girvan-Newman method (divisive): Just like the Louvain method, Girvan-Newman is an iterative method and it is based on edge betweenness. Just as with node betweenness, we can calculate edge betweenness as the number of shortest paths between two nodes that traverse a specific edge. Girvan-Newman starts by calculating the edge betweenness for every edge (1). It then removes the edge with the highest edge betweenness (2). After this step, the betweenness of all edges affected by the removal is recalculated (3). Step (2) and (3) are repeated until no more changes happen. The result is a dendrogram (like a family tree). The deeper you go in the tree, the more clusters you will have. To determine the depth of the Girvan-Newman method, I calculated the modularity for the clustering at each depth. If the newly calculated modularity was higher than the previous one, we go one step deeper. This process is repeated until the modularity stops increasing.

Results:

$ pip install python-community $ python community_detection.py --username your_IG_username\ 
--input_txt_file path_to_relations.txt\
--input_json_file path_to_relations.json\
--include_me False

The Louvain method terminated at a modularity score of 0.63, the Girvan-Newman reached 0.58. A possible reason for the difference is that the Louvain method optimises with regard to the modularity score directly. For both methods, I plotted the graph together with the red circles from figure 6.

Figure 7: The final clustering resulting from the Louvain method.

The Louvain method (figure 7) seems to identify most of the groups I had in mind. It considers E and D as one group. This could be accounted to the fact that a lot of people of my student association (E) are befriended with people from friend group (D).

Figure 8: The final clustering resulting from the Girvan-Newman method.

Although the Girvan-Newman method (figure 8) considers D, E, F and G as one group, it also captures some interesting smaller clusters that the Louvain method leaves unnoticed. One of these examples are the 4 red nodes (3 inside B, one under B): these represent my girlfriend, her brother, her mother and a guy that is followed by her brother. The Louvain method assigns these profiles to I. A second example are the dark green nodes to the right of D. This cluster represents a group of photographers. The Louvain method assigns them to cluster F.

From an objective point of view (modularity score), the Louvain method is a clear winner. However, with my knowledge of the network, I would pick the clustering of the Girvan-Newman method as most accurate.

Section 5— The end

I congratulate you for having reached the end of this introduction to network analysis by example. I hope you learned something and I’m very curious to see what your Instagram network looks like!

If you’re a coder and you’re willing to make improvements / adjustments to my code, feel free to submit a pull request here!

--

--

Maxim Piessen

CTO @ Credix —Building the future of global credit markets | DeFi — Blockchain — AI — Photography | Twitter: @PiessenMaxim