Using GitHub Data to Construct a Software Development Activity Map of the World
Have you ever wondered:
- How many software developers are there in each country of the world?
- Are there more software developers in China than in the US?
- What programming languages are developers using to write code?
- Is Python more popular than JavaScript in US or in the overall world?
- What software topics are trending in each country of the world?
- Is Cybersecurity a hotter topic than Deep Learning?
- …
If you are curious about the questions of the sort above then this post is for you!
Because in this post we present a Map of the world based on all the Software Development activities happening in the different parts of the world. We call it MapSD.
The map is constructed using software repositories published on the internet by the developers around the world on the most popular open-source online software platform, GitHub. GitHub has over 40 million software developers (as of January 2020) contributing to open-source software development.
To the best of our knowledge, a detailed map such as the one presented in this article is not available in the software engineering literature.
The developers using the GitHub platform have the option to include their location information in their public profile page. We, therefore, scrape the data of millions of developers on GitHub and construct a map using the location tag information associated with the accounts of the developers.
So, the key idea is that we collect the location information of the developers around the world from the location tag associated with the developers profile page on GitHub. Many Thanks to GitHub for making the data public and open for analysis!
For example, my GitHub account profile page has the location tag enabled as “United States”. It looks like the following:
We scrape the data of more than 2.5 million developers to construct the map. To perform this scraping we wrote a Python script which automatically downloads HTML files from GitHub. We process these files to extract the location information along with other useful things which we will discuss later in this post.
Scraping all this data off of GitHub and processing it took more than 2 months to complete!
Disclaimer: The author of this post is not associated with GitHub and provide the statistics in this post as a third party entity. GitHub provides statistics regarding some of the issues that we have discussed in this article. For statistics provided by GitHub we refer the reader to Octoverse. The statistics provided by GitHub, however, can not be used to construct a detailed map of the world.
HOW ARE DEVELOPERS DISTRIBUTED ACROSS THE PLANET?
Using the location data extracted from the HTML files downloaded from GitHub we cluster the developers together based on the country that they belong. The distribution of software developers around the different countries of the world is shown below:
The map above is constructed using 2,407,950 (around 2.5 million) developers on GitHub.
From the map we can observe that US has by far the most number of software developers. The top 10 countries in terms of number of developers are:
- US: 799384
- China: 176872
- India: 116026
- Canada: 95851
- Germany: 92459
- Brazil: 90037
- France: 56310
- Russia: 53872
- UK: 52465
- Ukraine: 47954
We have data for 255 countries of the world. Some key insights that we can draw from this map are as follows:
- Canada is 4th in the list of most number of developers. Canada also has the least population among all the countries in the list above.
- Ukraine is 10th in the rank of most number of developers. Pretty cool for a a relatively small country!
- Brazil is the only country from South America in the top 10 list.
- None of the African or Oceanian countries made the list of top 10.
- United Arab Emirates is 15th in the rank of most number of developers (not shown in the top 10 list above).
- The GDP of a country correlates with the number of software developers present in the country.
Yes! the GDP (Gross Domestic Product) which defines the economic performance of a country does correlate with the number of developers present in the country.
More number of developers would mean that the country is performing better economically. This is evident from the correlation between GDP and number of developers.
Below we show the chart comparing GDP with number of developers:
When looking at the continent-wise distribution of software developers we see that North America has by far the most number of developers in the world. The developers are distributed across continents as follows:
- North America: 931312
- Asia: 625705
- Europe: 567832
- South America: 153284
- Africa: 63993
- Oceania: 62652
- Antarctica: 230
Some key insights we can draw are:
- Antarctica — a continent which has no permanent residents and only get few thousand visitors annually— has around 230 developers.
- North America with only 8% of the population in the world contributes more to software development than the other continents.
- Africa with 16% of world population and Oceania with 0.5% of world population have similar number of developers.
WHICH COUNTRIES GENERATE THE MOST TRAFFIC ON GITHUB?
In addition to the number of software developers in the different countries of the world we also present a map of how active the software developers are in each country of the world. The data to generate this map is obtained from the GitHub archive. This map is shown below:
The map of the commits-based activity of software developers around the world shows similar statistics as the number of developers map showed earlier in the post. The top 10 countries in terms of generating commit traffic to GitHub are:
- US: 120235750
- China: 19562661
- Hungary: 16331608
- Germany: 15182239
- Canada: 11504023
- UK: 9016289
- India: 8563189
- France: 8310565
- Brazil: 7796723
- Japan: 7049143
We make the following key observations from the map of the number of commits activity on GitHub:
- The most commit traffic on GitHub is generated by the US.
- The developers from India, Pakistan, Bangladesh — the South Asian countries — commit their code less frequently than the rest of the world.
- Hungary with only 6761 developers active on GitHub generated surprisingly large number of traffic on GitHub. This is still an unresolved mystery!
- There was a DDoS attack on GitHub in 2018 which is evident in the data.
Yes! the commit traffic data reflects the famous 2018 DDoS attack on GitHub. When we analysed the traffic data of GitHub through an exploratory study we noticed some accounts on GitHub which generated surprisingly large traffic during the year 2018. The names of these accounts ended with the string “bot” or “[bot]”. When we opened the profile pages of these accounts, these were mostly empty accounts with no repository or the account had been deleted. We, therefore, discard the accounts with more than 100,000 commits.
We plot a histogram of all the users based on the commits that they made on GitHub during the year 2018 shown below:
The following key observations can be made from the plot above:
- Most developers on GitHub annually make less than 1000 commits.
- The average number of commits a developer makes annually on GitHub is 129 while the median is 52. Meaning an average developer only makes two commits per week.
- Many accounts on GitHub are dormant accounts, i.e. accounts that generate almost no traffic.
While GitHub has many software developers on its platform, most of these developers are dormant and commit very infrequently.
WHAT PROGRAMMING LANGUAGES ARE POPULAR AROUND THE GLOBE?
When we process the HTML profile pages of developers downloaded from GitHub, we not only extract the location information but also extract the information about the software repositories that are publicly posted by the developers on GitHub.
In particular, we extract the names of the software repositories, the short descriptions usually attached to the name, as well as the programming language used to write code in the software repositories.
We notice that there are 369 programming languages that developers have used to write their code on GitHub. The ten most popular programming languages along with the number of software repositories on GitHub that use these languages are given below:
- JavaScript: 6474958
- Python: 3275478
- Java: 2796633
- HTML: 2165206
- PHP: 1317723
- Ruby: 1217100
- C++: 1189791
- CSS: 1113338
- C#: 1009865
- C: 856036
Some of the key insights drawn from this study are as follows:
- Go — a relatively new language — with 707638 repositories is 13th in the rank of most popular languages.
- Kotlin — another fairly new language — with 159164 repositories is 18th in the rank of most popular languages.
- We still see quite a few repositories that use Lisp and its dialects.
- FORTRAN still has 14933 software repositories on GitHub.
WHICH TOPICS ARE TRENDING IN THE WORLD OF SOFTWARE DEVELOPMENT?
Remember we extract software repository names and the short descriptions that are usually attached with the repository names from the HTML files of the GitHub profiles pages of the developers. This is where we use them!
We apply a popular topic modeling technique called Latent Dirichlet Allocation (LDA) to extract important topics from around 40 million GitHub repositories data.
Topic modeling works in the following manner. We provide a large amount of text data to the algorithm. The algorithm parses the text and generates topics that are dominant in the text corpus. We can then generate a topic for each document present in the corpus. In this way, we can obtain the number of documents that belong to a certain topic.
The documents in our case are the repository names and short descriptions attached with the names. There are around 4000000 (40 million) repositories that we used to generate the topic model. The total number of unique tokens in these repositories are 50745.
We provide the repositories data as input to the LDA algorithm to generate a topic model of the software repositories. We use the multiprocessing implementation of LDA provided by the popular Gensim library. The algorithm took more than a day to finish execution on an Intel Quadcore machine with 20 workers.
We have generated 200 topics using the LDA model. These 200 topics are divided into the following 8 categories after manual examination.
- Programming Languages: 17
- Educational Material: 12
- Technologies: 69
- Web Application Frameworks: 23
- Computer Network and Security: 20
- Signal Processing and Data Science: 31
- Miscellaneous: 20
- Can’t really tell: 7
This division is inspired from the topic categorization described in the paper by Markovtsev et al. However, we adopt a slightly different categorization than the one mentioned in the paper.
The Programming Language category contains the topics about the different programming languages. We noticed that the LDA topic model after processing the repositories data created 17 topics about the programming languages. Some topics in this category are: Swift, JavaScript, and Kotlin. These topics really don’t tell much since we already know what programming languages are used by the repositories on GitHub (see the previous section).
We notice that there are several topics — 12 to be exact — which are related to educational material. These topics include: Final project, Course code, Udacity, LeetCode, and Interview Questions. These repositories are mainly written by students who are either preparing for their interviews or have posted their course project on GitHub.
The Technologies category covers the topics about technologies in general. This is the largest topic category in our model. There are 69 topics under this category. Some of the prominent topics are: GraphQL, Ansible, Arduino Robot App, MongoDB, and NPM.
The Web Application Framework could be placed inside the Technologies category but we wanted to give it a separate status of its own. This is because Web development is a world unto itself. We see 23 topics under this category. Some of them are: REST API, React Native, Frontend, HTML, and CSS.
The 5th category is Computer and Network Security where we have topics related to the computer network systems and security aspects of software engineering. There are 20 topics in this category. Some of them are: Mobile bot, Blockchain, Cryptocurrency, Cryptography, Email, and TCP IP.
In the 6th category called Signal Processing and Data Science, we include topics related to data science, signal processing, machine and deep learning. There are 31 topics in this category. Some of them are: Deep Learning, RNN, NLP, Neural Network, Pytorch, and Tensorflow.
The Miscellaneous category is the one in which the topics are not related software engineering and development. Rather, the topics are more general purpose and belong to everyday life. Some of the topics include: Personal Portfolio, Content, Font, Theme, Style Guide, Love, Career, Exercise, and Blog Post.
The 8th and final category is Can’t really tell. We include those topics in this category which we could not place in either of the above mentioned categories. These topics are mostly in language other than English. We don’t consider them as valid topics.
The order of popularity in which the above mentioned 8 super categories of topics are covered in the GitHub repositories are as follows:
- Technologies: 1369206
- Signal Processing and Data Science: 563734
- Miscellaneous: 517936
- Web Application Frameworks: 482558
- Programming Languages: 330026
- Educational Material: 291549
- Computer Network and Security: 281693
- Can’t really tell: 163014
Notice the following important points drawn from the discussion above:
- Technologies category with 1369206 repositories leads the most number of repositories race on GitHub.
- Signal Processing and Data Science category which includes topics on deep and machine learning has more number of repositories than Computer Network and Security. This implies developers belonging to deep and machine learning related domains are more active in open-source software development on GitHub than Cybersecurity folks.
- However, that does not conclusively establish that Deep Learning is a hotter area than Cybersecurity. May be due to the discrete nature of software development in Cybersecurity, their developers don’t like to publish code in open-source platforms.
- There are a lot of repositories in the Miscellaneous category. That means a lot of people like to share repositories on GitHub that don’t have anything to do with software development.
As far as open-source repositories on GitHub are concerned, Deep Learning is a much hotter area than Cybersecurity.
WHAT LIES AHEAD FOR THIS PROJECT?
Recently, there has been quite a few advancements in the field of topic modeling. It would be good to see topic modeling results obtained using these novel models. Especially, progress in the field of topic modeling from short texts.
GitHub is obviously not the only platform with active open-source software developer. It would be interesting to see if more data sources like GitHub are out there which can be explored to construct a richer map of the world.
It would be good to see a collaboration map of software developers across different countries of the world. How software developers from different countries of the world collaborate with each other would be interesting to see.
The granularity of this map is country-level. It would be interesting to see how developers are distributed across cities of the world.
If you are interested in extending this project please let me know!
WHERE TO ACCESS THE DATASET AND CODE?
The dataset to generate the map and the code used to scrape GitHub as well as process the scraped data is provided on the link below:
About the author:
I have several years of experience in data mining, machine learning, computer vision, and information retrieval. More recently, I have been working on Mining Software Repositories. Find more about me here: Webpage, Linkedin, Scholar.