Which websites have the most influential links pointing to them?

Jasmi Kevadia
INST414: Data Science Techniques
3 min readFeb 27, 2023

Introduction:

In this post, I will analyze the structure of a web-based network and identify important nodes within the network. The network that I will analyze is a web graph that consists of edges between domains. Specifically, I will focus on the top 100 websites on the internet, as determined by Alexa Internet.

Data Collection:

Which websites have the most influential links pointing to them? This insight might inform marketing decisions by showing which websites have the most potential to drive traffic to a specific target website. The data for this analysis was collected using the Python requests and BeautifulSoup libraries. I used requests to make HTTP requests to the Alexa Top Sites list, which is a list of the top 500 websites on the internet. I then used BeautifulSoup to parse the HTML and extract the domain name of each website.

Calculating Node Importance:

After analyzing the network, I identified the top three nodes with the highest degree centrality. These nodes were Google, Facebook, and YouTube, in that order. To build the network graph, I used the NetworkX library in Python. I started with each domain as a node and then added edges between nodes where a link existed from one domain to another. For example, if website A had a link to website B, then there would be an edge between the nodes representing A and B. Nodes with higher degree centrality are considered more important because they have more connections to other nodes in the network. To calculate the degree centrality for each node, I used the degree_centrality() function from the NetworkX library.

The main issue I encountered while collecting the data was the presence of subdomains. For example, there were separate nodes for www.google.com and google.com, even though they represent the same website. To address this, I removed all subdomains and only kept the main domain for each website. Additionally, I removed any nodes with no edges, as they did not add any value to the analysis. The following table summarizes the degree centrality of the top 10 nodes in the network.

Results:

One non-obvious insight that can be drawn from the analysis of the web-based network is the potential impact of the network’s design on users. By identifying the most important nodes, it may be possible to predict which pages users are most likely to visit and engage with. This information can be used to design websites that are more user-friendly and engaging. For example, by placing links to important pages in more prominent locations, website designers may be able to increase user engagement and overall traffic to the site.

Limitations:

One limitation of this analysis is that it only considers one measure of node importance, degree centrality. There may be other measures of importance that are more relevant in different contexts. Additionally, the data only includes the top 100 websites on the internet, so it may not be representative of the entire web graph. Finally, the data is limited to a snapshot in time and may not reflect changes in the web graph over time.

Conclusion:

In conclusion, analyzing the structure of a web-based network can provide valuable insights into the connections between different nodes and their importance within the network. In this analysis, we used Python libraries such as requests, BeautifulSoup, and NetworkX to collect data, build the network graph, and calculate node importance. We found that Google, Facebook, and YouTube were the most important nodes in the web graph based on their degree centrality values. This information could be useful for businesses and organizations that rely on the internet to reach their audience or for researchers interested in the structure of the web graph. However, it is important to note that this analysis has limitations, including the focus on only one measure of node importance and a limited dataset. Further research could expand on this analysis by exploring different measures of node importance or using a larger dataset that includes a more representative sample of the web graph. Overall, analyzing the structure of a web-based network can provide insights into the connections and importance of different nodes within the network, which can be valuable for understanding the internet and its impact on society.

Github: https://github.com/jasmi01/INST414Exercises/blob/main/assignment2

--

--