Open Source Words — Part 1
I set out to understand words used to describe software, and in particular, open source software. What words form the collective developer lexicon, and what were they in response to?
For the results, see Open Source Words — Part 2
- I wanted a list of highly rated (stared) repositories
- I wanted the README for each repository
- I foolishly used Scrapy instead of the GitHub API (V3)
These were the hypothetical APIs I wanted from GitHub:
Thankfully the GitHub Search URL scheme is easy to work with. The query parameter
p (short for page I presume) acts as the only variable here:
However I quickly came upon an issue:
As it turns out, GitHub Search shows 10 results per page, and limits searches to 100 pages of results. That’s right, there’s no page 101 or 102 (though the data exists). Nonetheless, that’s still 1000 repositories per search.
Another option is to segment advanced searches by selecting a range of stars:
Then you could conceivably scrape repositories in chunks of 1000.
That’s the approach I went with… until I decided scraping almost 2000 repositories that was probably enough. It was time to get some READMEs.
When you browse files in a repository, there’s a toolbar at the top. The option “Raw” lets you download the file directly. This URL scheme is very simple:
So given a user and repository name, the potential readme could be found at:
There were several caveats though:
- Not all README files are named
- Not all master branches are named
master(the GitHub default)
- Not all README files are in the root folder
That said, using a list of possible README file names and assuming all README files were in the root directory of the master branch was sufficient to get most files.
After running this script, I ended up with 1919 READMEs out of 1949 repos, not bad. Next step: data processing.
- I needed to simplify READMEs into plain text files
- I needed a uniform format without punctuation
- This process was complicated but successful: MD > HTML > TXT
From nearly 2000 READMEs I ended up with this file extension distribution:
So the vast majority were markdown (mostly
.md with some
.mdown). I also ended up with a few
.rst files, a format I’d never heard of. It stands for reStructuredText and is common among Python libraries.
Using libraries like
BeautifulSoup, I converted these documents to HTML, then extracted the plain text. I applied a few more filters (no punctuation, unicode characters, or multiple spaces). The results look something like this, from Apple’s Swift repository:
swift programming language architecture master package macos x86 64 ubuntu 14 04 x86 64 ubuntu 16 04 x86 64 ubuntu 16 10 x86 64 swift community hosted ci platforms os architecture build debian 9 1 raspberry pi armv7 fedora 27 x86 64 ubuntu 16 04 x86 64 ubuntu 16 04 tensorflow x86 64 ubuntu 16 04 ppc64le ubuntu 16 04 aarch64 ubuntu 16 04 android x86 64 welcome to swift swift is a high performance system programming language...
Not perfect, but at least it’s uniform and easy to parse. The last step to constructing frequency counts was to read these text files, split on spaces, and increment word counts in a dictionary.
I choose to count unique and total words. Finally, I used
nltk (natural language toolkit) to filter out stop words and map words to parts of speech as the focus was on adjectives and adverbs as project descriptors.
I used to think I knew all the answers. Then I thought I knew maybe a few of the answers. Now I’m not even sure I understand the questions. Nobody knows anything.
– Pete Nelson
In the end, I couldn’t fully address my question with the words from READMEs alone. To look at trends in when certain buzzwords emerged, I would need time series data. Similarly, if I wanted to know which adjectives correlated with high ratings, I’d need to decouple a few features including programming language popularity. Words like
React should show up more often than
This project still proved interesting in many ways. I had no idea so many of the top repositories on GitHub pertained to education. Books, interview questions, technical challenges, and language learning resources show up often in the Top 2000 list. In fact, the most starred repository is educational: freeCodeCamp, with nearly 300k stars at the time of writing.