Software Superheroes: Sourcing Startup Developers from GitHub Networks

Wasif Pervez
INST414: Data Science Techniques
5 min readMay 1, 2024
Photo by Mohammad Rahmani on Unsplash

The world of open-source software, also known as OSS, is immense. From browsers like Chromium to data engineering tools like Airbyte, from computer graphics software like Blender to entire operating systems like Linux, the world is becoming increasingly dependent on the quality code that is written and maintained to a high standard of quality by software engineers all around the world. Unfortunately, there are all too many cases in which open-source software developers are undercompensated for their work. Most projects don’t have the resources to pay developers and engineers for their work, with the exception of open-source software whose development is sustained by funding from large companies.

A growing problem has emerged in parallel on the other side of the aisle. As the downward-trending job market creates larger competition for open positions, especially in oversaturated tech fields like software development and engineering, recruiters find it harder and harder to sift through a constantly growing number of applicants. The wide-scale adoption of remote work during the pandemic introduced more people to the advantages of tech job lifestyles, and recent layoffs at large companies across multiple industries have done the opposite of levelling the playing field for the average job hunter (further amplifying the load on talent acquisition specialists).

Flipping back to developers: many folks who are looking to make themselves stand out from the myriad programmers who also hold bachelor’s degrees or bootcamp certifications have taken up contributing to open-source software as a means of bolstering their resumes. While this trend certainly has the potential to grow the open-source software ecosystem, it is also inhibited by those who only make minor contributions (like small edits to a GitHub repository’s README file) for the sake of stat padding and cheapen the culture as a whole. These individuals miss out on gaining experience with the valuable skills, like version management and SOLID software development principles, that such an experience should provide.

This analysis stands in response to these converging issues and the important question that they raise: can GitHub social networks be leveraged to unearth employable software engineering talent? Establishing an open-source to enterprise software pipeline could potentially both incentivize volunteer developers to make more meaningful contributions and facilitate the hiring of experienced, professional programmers.

Introducing the Dataset

The ideal data for this task would include detailed information on both developers (what languages they have experience working in, how long they have been programming, how they learned to program, how many contributions that have made to open-source software projects, etc.) and GitHub repositories (how many stars and forks they have, what languages they contain, how many unresolved issues they have, how many pull requests they receive regularly, etc.).

The subset of the data which will be used for this analysis simply contains one list of GitHub usernames and one list of connections between GitHub users. It was obtained from Kaggle as a number of CSV files published by Kaggle user Ben Rozemberczki, who originally collected the data by scraping GitHub’s public API in 2019.

Entities, or nodes, represent GitHub developers (or their accounts) and relationships, or edges, represent bidirectional mutual follower relationships on the GitHub platform in this dataset. The goal of this analysis is to identify important GitHub users in the network, with emphases on degree (indicator of which user has the most connections in the network) and betweenness (indicator of which user connects the most nodes in the network) centrality as metrics of influence.

The final dataset that was used to create a graph using NetworkX was assembled by joining the aforementioned list of GitHub usernames and list of connections between GitHub users. Columns were then renamed in order to clarify which IDs corresponded to which usernames.

Figure 1: A visualization of the entire GitHub developers dataset produced in Gephi.

Most Influential GitHub Users

Figure 2: GitHub users ranked by degree (which user has the most connections) and betweenness (which user connects the most nodes) centrality.

According to the two chosen metrics, the most influential users are dalinhuang99, nfultz, Bunlong, and addyosmani. Notably, dalinhuang99 and nfultz are more than two to fives times more influential than the next user in the rankings. Bunlong has higher betweenness centrality then addyosmani, while the opposite is demonstrated in the ranking of important users by degree centrality. Any of these four users would be prime choices for software engineering teams looking to expand, although dalinhuang99 and nfultz are clearly the top choices.

Figure 3: A visualization of an ego graph centered on user dalinhuang99 produced in Gephi.
Figure 4: A visualization of an ego graph centered on user nfultz produced in Gephi.
Figure 4: A visualization of an ego graph centered on user Bunlong produced in Gephi.
Figure 5: A visualization of an ego graph centered on user addyosmani produced in Gephi.

Limitations and Reflections

Although these preliminary results are intriguing, more data will be required to better ascertain the programming ability of developers on the GitHub website. Although it would follow that users with a high number of social connections on a website dedicated to writing and disseminating software would therefore be strong in those skills, more data can help better contextualize these results and reduce the likelihood of mistaking a popular programmer for a talented one.

In addition, the dataset is now nearly five years old and potentially updated. More contemporary data would provide more contemporary results for a network analysis, especially with the recent rise in artificial intelligence and data science.

Reference

The code used to perform data transformation, cleaning, and manipulation of the GitHub social network dataset can be found at the link below.

--

--