Analyzing the Structure of a Web-Based Net: Wikipedia Cities Graph

Danny Pham
INST414: Data Science Techniques
4 min readOct 2, 2024

Introduction

As the internet grows larger and larger, web graphs provide a rich source for data analysis. In this post, I will analyze a network derived from a Wikipedia API focusing on articles name associated with US major cities.

Question and Stakeholders

This data can be used by stakeholders such as urban planners and researchers studying urban influence who may ask “Which cities are the most influential within the Wikipedia article network and does it compare to population of these cities?” The more articles in network associated with the city could indicate the digitial footprint of cities and how that information plays in urban influence. Digital promenance could be used to plan city branding or leverage online data to boost a city’s presense.

In this graph, vertexes represent cities and articles about cities and edges represent the connection of the title of these articles associated with the corresponding city. These edges are unweighted and the articles are unlikely to linked to multiple to cities. An example of these vertexes the ten larges US cities by population according to Investopedia: New York, Los Angelos, Chicago, Houston, Phoenix, Philadelphia, San Antonio, San Diego, Dallas, and San Jose.

Data and Graph Structure

  • Nodes/Vertexes: Cities and city-related articles.
  • Edges: Connections between articles linked by city names, forming an unweighted network.
  • Example Cities: New York, Los Angeles, Chicago, Houston, Phoenix, Philadelphia, San Antonio, San Diego, Dallas, and San Jose.
  • Having my graph structured in this way allows me to identify importance based on how often a city is mentioned.

Data Collection and Cleaning

I collected my data through the Wikipedia API and using link redirects using the request library and cleaned the data into a graph using networkx which was visualized using matplotlib. I was about to build my graph by identifying similar keywords based off the titles of articles and finding more articles related by parsing through the API, this automatically cleans the data by only including relevant articles in the graph. A common bug that may occur with this method is the repetition of articles if the build_network function is called multiple times in a recursive manner. I experiment with modifying the function to include multiple recursive calls which lead to highly unrelated nodes which could be addressed with sets and checking string similarity. Recursive calls were removed for the scope of this project as they required 30+ minutes to compute. This program automatically cleans out data that holds no relevance to our given cities. Below is a sample of the two functions that handled the majority of the work.

Using degree centrality, I determined that the top most influential cities were Chicago, Dallas, and Houston. This contrasts the three larges cities in population being New York, Los Angeles, and Chicago. Degree centrality measures the number of direct edges a city has to the wikipedia articles indicating that these cities are well-connected in Wikipedia. This discrepency between degree centrality and population size highlights the distinction between physical and digital prominence and could indicate a younger population or a large population of content creators located in these cities compared to other major cities.

Graph

  • It’s difficult to visualize a graph with so many edges, but it’s clear to see how particular nodes have a higher degree than others

Limitation and Biases

My data collection process included serveral challenges. Mainly with my limited use of wikipedia data due to computational complexity. I constructed these graphs based off the key word of the titles of the articles and if they contained the city name. Ideally, I would be able to recursively find all of the articles associated, but due to lack of computing power and time I wasn’t able to do so. Additonally using this method, I would eventually run into issues of unassociated articles that may have nothing to do with the city as well as repeated articles, so relevance may become a problem expanding my data. There could also be biases such as cultural bias and link bias that may cause over or underrepresentation due to lack of interest or overall pride of a city.

Conclusion

In this analysis, I demonstrated how the structure of a web-based network can reveal insights into the digital influences of cities. The results highlight how digital connectivity can mirror and differ from real-world influence and relationships.

Github Repository

You can find the code used for this analysis in my GitHub Repository here.

--

--