Open Source Words — Part 1

I set out to understand words used to describe software, and in particular, open source software. What words form the collective developer lexicon, and what were they in response to?

I turned to the most popular code repository, GitHub, to find the answer. I wrote Python code to scrape and analyze the data, as well as d3-cloud to visualize it as word clouds.

For the results, see Open Source Words — Part 2


Data Collection

TL;DR

  • I wanted a list of highly rated (stared) repositories
  • I wanted the README for each repository
  • I foolishly used Scrapy instead of the GitHub API (V3)

Getting started

These were the hypothetical APIs I wanted from GitHub:

GET https://github.com/all_the_repos
GET https://github.com/all_the_readmes

Scraping

Thankfully the GitHub Search URL scheme is easy to work with. The query parameter p (short for page I presume) acts as the only variable here:

/search?o=desc&q=is:public+state:open&s=stars&type=Repositories&p=0

However I quickly came upon an issue:

GitHub Search Pagination

As it turns out, GitHub Search shows 10 results per page, and limits searches to 100 pages of results. That’s right, there’s no page 101 or 102 (though the data exists). Nonetheless, that’s still 1000 repositories per search.

Another option is to segment advanced searches by selecting a range of stars:

?q=stars:10..1000

Then you could conceivably scrape repositories in chunks of 1000.

That’s the approach I went with… until I decided scraping almost 2000 repositories that was probably enough. It was time to get some READMEs.

Downloading READMEs

There are more places than /README to put a README file on GitHub. There’s also /docs/README and /.github/README, and not all READMEs are markdown files.

GitHub file header

When you browse files in a repository, there’s a toolbar at the top. The option “Raw” lets you download the file directly. This URL scheme is very simple:

https://raw.githubusercontent.com/:user/:repo/:branch/:path/:file

So given a user and repository name, the potential readme could be found at:

https://raw.githubusercontent.com/:user/:repo/master/README.md

There were several caveats though:

  • Not all README files are named README.md
  • Not all master branches are named master (the GitHub default)
  • Not all README files are in the root folder

That said, using a list of possible README file names and assuming all README files were in the root directory of the master branch was sufficient to get most files.

To minimize the impact on githubusercontent.com I used the HTTP HEAD request method to check if the file exists (status code 200).

HEAD https://raw.githubusercontent.com/:user/:repo/:branch/README.md

After running this script, I ended up with 1919 READMEs out of 1949 repos, not bad. Next step: data processing.

Data Processing

TL;DR

  • I needed to simplify READMEs into plain text files
  • I needed a uniform format without punctuation
  • This process was complicated but successful: MD > HTML > TXT

From nearly 2000 READMEs I ended up with this file extension distribution:

{
'': 9,
'md': 1803,
'markdown': 30,
'mdown': 4,
'html': 1,
'rst': 53,
'txt': 15
}

So the vast majority were markdown (mostly .md with some .markdown and .mdown). I also ended up with a few .rst files, a format I’d never heard of. It stands for reStructuredText and is common among Python libraries.

Using libraries like docutils, markdown, and BeautifulSoup, I converted these documents to HTML, then extracted the plain text. I applied a few more filters (no punctuation, unicode characters, or multiple spaces). The results look something like this, from Apple’s Swift repository:

swift programming language architecture master package macos x86 64 ubuntu 14 04 x86 64 ubuntu 16 04 x86 64 ubuntu 16 10 x86 64 swift community hosted ci platforms os architecture build debian 9 1 raspberry pi armv7 fedora 27 x86 64 ubuntu 16 04 x86 64 ubuntu 16 04 tensorflow x86 64 ubuntu 16 04 ppc64le ubuntu 16 04 aarch64 ubuntu 16 04 android x86 64 welcome to swift swift is a high performance system programming language...

Not perfect, but at least it’s uniform and easy to parse. The last step to constructing frequency counts was to read these text files, split on spaces, and increment word counts in a dictionary.

I choose to count unique and total words. Finally, I used nltk (natural language toolkit) to filter out stop words and map words to parts of speech as the focus was on adjectives and adverbs as project descriptors.

Conclusion

I used to think I knew all the answers. Then I thought I knew maybe a few of the answers. Now I’m not even sure I understand the questions. Nobody knows anything.
– Pete Nelson
GitHub Repositories > 100 stars, by Language

In the end, I couldn’t fully address my question with the words from READMEs alone. To look at trends in when certain buzzwords emerged, I would need time series data. Similarly, if I wanted to know which adjectives correlated with high ratings, I’d need to decouple a few features including programming language popularity. Words like React should show up more often than Cocoa, simply because there are many more JavaScript repositories than Objective-C or Swift ones. I’d also expect repositories in more popular languages to have more stars, as there is more of a developer community available to make use of them.

This project still proved interesting in many ways. I had no idea so many of the top repositories on GitHub pertained to education. Books, interview questions, technical challenges, and language learning resources show up often in the Top 2000 list. In fact, the most starred repository is educational: freeCodeCamp, with nearly 300k stars at the time of writing.


Code for this project is available on GitHub: Tombarr/open-source-words

Results and visualizations available at Open Source Words — Part 2