Quine
Published in

Quine

Machine Learning / Open Source / GitHub

A taxonomy of open source software

or how we’re using ML to automatically classify GitHub repos 🤖

Quine’s Open Source Topic Map
Our map of open source software; see interactive version at https://quine.sh/repoverse

Introduction

Encountering a new project in open source can be a daunting experience. A key problem is contextualization. For example: what problem is this project solving? What area does it belong to? Do these keywords help my understanding, or are they false friends? Is it an isolated repository, or are there thousands like it? Is there a popular alternative with better documentation and maintenance? The manual solution (searching, Q&A sites) to this problem doesn’t scale well when considering the millions of projects in the OSS space. A practical option for many is to find an appropriate high-level label, such as devops or security, and throw the project onto a mental heap of other loosely associated projects. This ‘problem’ of new projects extends even to specialists: keeping on top of progress in a domain can be hard work, a problem that is dramatically worsened when returning after a hiatus.

Mental heaps of engineering tools
Photography credit: Pixabay/vkingxl
Pie chart of labels given to ‘deep learning’ projects on GitHub
The labels provided by GitHub owners of projects our taxonomy deems as “deep-learning”. Where multiple relevant labels are present, we choose the most popular; “<other>” encompasses more than 450 relevant tags.

Our taxonomy

To summarize, we are looking to provide a tool that promotes accessibility of OSS: to find relevant resources and projects, to aid those wanting to contribute, to help developers understand the contributions of others, and to find other devs with whom to collaborate. This section describes some high-level properties of the taxonomy before we discuss some applications.

Properties

At a high level our taxonomy achieves:

  • a comprehensive coverage of all significant areas of OSS (we consider all labels on GitHub with more than 5 tagged repos; c. 40,000 labels)
  • a reduction in the number of unique labels by a factor of ~400 (from over 240,000 to 600)
  • unambiguous and distinct topic names which can be rolled up into higher level categories
  • automatic inference of topic labels for any open source project¹
  • a data-driven framework to add to or modify branches²

The taxonomy

The figure below shows a high-level tree view of the taxonomy, which has approximately 600 topics. The topic areas are a result of analyzing and clustering GitHub topic labels, and are typically either a specialism (e.g. osdev), the ecosystem of a single technology (e.g. react), or an application area (e.g. sport). Certain areas are excluded from the scope — for example information about programming language or runtime which is freely available elsewhere.⁴ Choosing the number of topics is a trade-off between precise topic specification (more topics) and user experience (less topics). Similar topics are recursively grouped via semantic embeddings and positioned within a hierarchical taxonomy; we also applied some fine tuning.

Our taxonomy tree (high level)
Our taxonomy; take a closer look using this visualization.
Coverage of GitHub topic labels vs our inferred topics: number of labels per repository.

Examples of use

There are a wide variety of uses of this taxonomy: we provide four here, but we’re exploring a few more within Quine.

Information retrieval for projects

An immediate application of this taxonomy is for use in thematic browsing of open source projects. While some organizations are currently looking at recommendation for OSS (including ourselves), we believe targeted browsing can be more useful. To demonstrate what our index allows, let’s look at the example of molecular-biology. It’s hard to find a reliable list of tools in this space at present, and if one were to search relevant tags on GitHub, one would currently find perhaps only 35 / 100 of our top listed tools. For our top 20 ranked projects, shown below, any project with a “❌” in the “labelled” column would not be retrieved (at the time of writing) when browsing projects under GitHub’s folksonomy for relevant tags. For example, AlphaFold has no labels at all on GitHub (at the time of writing) and so could not be found under a relevant topic search there.

Top 20 ‘molecular biology’ projects as scored by our algorithm
Molecular-biology: top 20 projects according to our automated taxonomy inference
Screengrab of Quine’s Explore feature
Browsing GitHub projects at https://quine.sh/d/find-tools

Browsing / finding developers

Another information retrieval application is in browsing top developers for a topic. This might be useful for finding developers to follow, or to work with. This is not facilitated by GitHub, and would not be advisable anyway, since most work that a developer does is not tagged, or not tagged in a standardized way. Using the repos scored against our taxonomy, we can create a score for a (developer, topic) tuple, for example, as a function of the number of contributions to relevant projects, the topic score we assign, and the popularity of that project (via GitHub stars). See examples below for testing and circuit-design. These results consider contributions to any repository with ≥ 10 stars; in the table ‘#commits’ is the total number of commits made; ‘#repos’ is the number of distinct projects.

Top 12 developers for ‘testing’ as scored by our algorithm
Testing: top 12 developers according to our automated taxonomy inference
Top 12 devs for ‘circuit-design’ as scored by our algorithm
Circuit-design: top 12 developers according to our automated taxonomy inference.

Visual profiles: developers

Developers can be described by their skills (as demonstrated through contributions) but also their interests and learning aspirations. Our taxonomy allows us to combine and locate these aspects on a standardized 2D topic map, creating a visual summary. Examples are given below for some well known OSS developers; here we focus just on open source contributions.

Developer profiles: Linus Torvalds (Linux); Kelsey Hightower (Kubernetes); Jeffrey Wilcke (Ethereum)
Developer profiles: Sophie Alpert (React); Soumith Chintala (PyTorch); Mike Bostock (D3)
Developer profiles: Max Howell (Homebrew); André Staltz (RxJS); Paul Sokolovsky (MicroPython)

Visual profiles: programming languages

Different domains tend to favour different programming languages. These are largely known within the community (such as frontend — Javascript; embedded — C/C++; iOS — Swift; data science — Python), but we can use our topic map to provide a more comprehensive visual overview.

Language profiles: JavaScript; Python; Julia
Language profiles: C/C++; Rust; Go
Language profiles: Haskell; Scala; C#

Methods

This section elaborates on how we derived the above assets, and is more technical in nature than the discussion so far. At a high-level this consisted of a data-informed, machine-learning-assisted manual effort in annotating and categorizing the space. Creating the taxonomy is a sensitive activity, and can be viewed as a kind of ‘meta-labelling’: the presence of a label within a topic can result in the automated tagging of many thousands of additional OSS projects downstream. Here we provide more details about that journey: our data, approach, and reasoning for the task of constructing the taxonomy; as well as an overview of how we created the topic map.

GitHub topic label counts — scatter plot
GitHub labels have a long tail, approximately following a Zipf law

Automatic label organization

Our first question was whether the number of topics can be substantially reduced by standard lexicographical analysis. E.g. tokenization, spellchecking etc. We chose to answer this question with a ‘no’. For instance wsus (Windows Server Update Services) is very close to asus in edit distance, but very different in meaning. There are many compound tokens (e.g. machine-learning, haskell-learning) that may be broken down to consistuent parts, but we were unable to reliably tell when this was permissible. As another example, stemming proved problematic (e.g. for Snowball stemming, we have visualizevisual, or worse, iosio).

Definition of pointwise mutual information
Visualization of weighted PMI graph for subset of labels
A small subset of the weighted graph of candidate labels visualized via d3-force: edge weight differences can be inferred from their width.
Two examples of automatic taxonomy construction
Examples of automatic topic extraction and organization. (Interior nodes are named automatically as the medoid labels of its leaves, and may not be optimal.) LHS - suboptimal: Clusters contain lots of related concepts, but are not entirely coherent. RHS — reasonable: largely coherent and well-positioned clusters.

Manual review

With the automated organization done, it is now onto the most demanding stage of manually verifying the location of 40,000 labels: a large endeavour. In order to aid this, we used a metric to score the ambiguity of a label based on its mutual information with our clustered topic labels. (This was normalized by the label entropy in order to remove differences of scale.) Very infrequent labels which comprised a negligible proportion of clusters were also dropped to reduce the size of the problem. For the verification, we sense-checked results with subject matter experts within our organization where possible, but the work was mostly performed via large numbers of queries to a search engine.

Taxonomy map

We conclude this section with a brief overview of how we created the topic map in the previous section. The goal was to embed each topic in our taxonomy into a circle via t-SNE (Van der Maaten & Hinton, 2008) while avoiding graph crossings. More explicitly, we use a t-SNE objective for the embedding of the topics, respecting the following constraints :

  • Each embedding lies within the unit circle
  • Each embedding is as far as possible from its closest neighbor
  • Each embedding is ‘close’ to its parent node in the taxonomy
  • The resulting tree graph embedding of the taxonomy has no edges which cross
Visualization of final topic embedding at convergence
Optimized embedding of the topic nodes of the taxonomy, with tree structure overlaid.

Final thoughts

To the best of our knowledge, this work provides the first systematic organization of open source software. This is a significant milestone, and unlocks great potential for the visibility of many millions of open source projects. This can underpin many important future services, such as feeds, subscriptions, search, finding collaborators, stimulating contributions, and organizing / collecting achievements of developers. The taxonomy may further prove useful as a mental map of the OSS ecosystem.

Endmatter

Footnotes

  1. Our topic inference is currently only available for GitHub, but the approach extends directly to other platforms such as GitLab or Bitbucket.
  2. Details of our method for adding new topics into the taxonomy are omitted for brevity, but the method we use for constructing the taxonomy can be extended without much difficulty into an online approach.
  3. English Language projects on GitHub with ≥ 10 stars as of January 2022, c. 1.23m projects.
  4. See list of exclusions in the section below.
  5. Robust taxonomy creation may well be an AI-complete problem, and perhaps even then remains somewhat “subjective” — there are many possible self-consistent taxonomies. As such, our taxonomy cannot help be opinionated, even after all flagrant errors are fixed, and some devs may prefer to organize or catgorize things differently. In this respect we are grateful that the developer community is known to be fairly mild in its opinions and forgiving of those with whom they disagree.
  6. The data we used to construct the taxonomy was from an earlier datapull in 2021, hence the difference in size to footnote 3. README files were scored for language using lid.176 from fastText.
  7. We considered a number of ideas to extract clusters of topics from this graph. Perhaps most obviously: spectral clustering. However this suffered from extracting “giant component”-type clusters from the graph. This is probably warranted — there is often a continuity of concepts rather than clear and defined boundaries — but this is not useful for our purposes. Our next idea was to find a set of prototypes from the label set which maximized a mutual information criterion with the label population. This is somewhat similar to a p-hub location problem. While this performed well for popular topics, it failed to pick up smaller topic clusters for which the labels had low support. The approach that worked best for us was the simplest: factor the PMI matrix (we used 250 components) and cluster the latent factors via weighted k-means.

Exclusions

Many labels have been omitted explicitly from topic clusters when constructing the taxonomy. Reasons include:

  • a relatively small support within the GitHub folksonomy, and insufficiently many similar terms to cluster with (adversarial-attacks, aeronautical-engineering, affective-computing, ajax, ambisonics, artistic-animation, bloatware, bloom-filters, broadcasting, busybox, civil-engineering, clipboard-manager, cncf, codepush, cookies, coreutils, deep-linking, differentiable-programming, digital-humanities, digital-preservation, economics, edge-computing, emotion-detection, error-correcting-code, expert-systems, feature-flags, fog-computing, game-ai, gender-detection, hdmi, hooking, human-computer-interaction, introspection, issue-tracking, keyboard-emulation, kiosk, literate-programming, mechanical-engineering, multi-tenancy, mvi, pastebin, phonology, product-design, psychology, random-number-generation, reflection, restart, resume, sd-card, session-management, shutdown, sketching, sleep, spatial-audio, stack-overflow, stdio, libc, steering-behaviors, systems-programming, transactions, tunnel, usenet, uuid, web-portals, web-workers, wsl)
  • intrinsic ambiguity or lack of a concrete referent (e.g. automation, cache, community, data-analytics, devops, devsecops, low-level, management, persistence, pipelines, privacy, remote-control, secops, simulation, sre, transaction)
  • a practical ambiguity or overloaded term (e.g. crud, decoding, delay, encoding, map-reduce, photography, socket, timing, transaction)
  • ‘real-world’ or complete projects, especially personal web-pages or open-source organizational sites (e.g. personal-website, real-world, resume)

References

  • Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).
  • Van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research.
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems.
  • Wang, C., He, X., & Zhou, A. (2017). A short survey on taxonomy learning from text corpora: Issues, resources and recent advances. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.

Additional Writing Credits

Dimitrios Athanasakis, Rodrigo Mendoza-Smith. Thanks also to David Butler for proof-reading.

--

--

Quantifying dev experience from code and OSS data ✨

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store