Maintaining a healthy open source project can entail a huge amount of toil. Popular projects often have orders of magnitude more users and episodic contributors opening issues and PRs than core maintainers capable of handling these issues.
Consider this graphic prepared by the NumFOCUS foundation showing the number of maintainers for three widely used scientific computing projects:
We can see that across these three projects, there is a very low ratio of maintainers to users. Fixing this problem is not an easy task and likely requires innovative solutions to address the economics as well as tools.
Due to its recent momentum and popularity, Kubeflow suffers from a similar fate as illustrated by the growth of new issues opened:
Coincidentally, while building out end to end machine learning examples for Kubeflow, we built two examples using publicly available GitHub data: GitHub Issue Summarization and Code Search. While these tutorials were useful for demonstrating components of Kubeflow, we realized that we could take this a step further and build concrete data products that reduce toil for maintainers.
This is why we started the project kubeflow/code-intelligence, with the goal of increasing project velocity and health using data-driven tools. Below are two projects we are currently experimenting with :
- Issue Label Bot: This is a bot that automatically labels GitHub issues using Machine Learning. This bot is a GitHub App that was originally built for Kubeflow but is now also used by several large open-source projects. The current version of this bot only applies a very limited set of labels, however, we are currently A/B testing new models that allow personalized labels. Here is a blog post discussing this project in more detail.
- Issue Triage GitHub Action: to compliment the Issue Label Bot, we created a GitHub Action that automatically adds/removes Issues to the Kubeflow project board tracking issues needing triage.
Together these projects allow us to reduce the toil of triaging issues. The GitHub Action makes it much easier for the Kubeflow maintainers to track issues needing triage. With Issue Label Bot, we have taken the first steps in using ML to replace human intervention. We plan on using features extracted by ML to automate more steps in the triage process to further reduce toil.
Building Solutions with GitHub Actions
One of the premises of Kubeflow is that a barrier to building data-driven, ML-powered solutions is getting models into production and integrated into a solution. In the case of building models to improve OSS project health, that often means integrating with GitHub where the project is hosted.
We are really excited by GitHub’s newly released feature GitHub Actions because we think it will make integrating ML with GitHub much easier.
For simple scripts, like the issue triage script, GitHub actions make it easy to automate executing the script in response to GitHub events without having to build and host a GitHub app.
To automate adding/removing issues needing triage to a Kanban board we wrote a simple python script that interfaces with GitHub’s GraphQL API to modify issues. Using GitHub Actions we can automate executing this script in response to issue events by including the below YAML file in your repo’s
As we continue to iterate on ML Models to further reduce toil, GitHub Actions will make it easy to leverage Kubeflow to put our models into production faster. A number of prebuilt GitHub Actions make it easy to create Kubernetes resources in response to GitHub events. For example, we have created GitHub Actions to launch Argo Workflows. This means once we have a Kubernetes job or workflow to perform inference we can easily integrate the model with GitHub and have the full power of Kubeflow and Kubernetes (eg. GPUs). We expect this will allow us to iterate much faster compared to building and maintaining GitHub Apps.
Call To Action
We have a lot more work to do in order to achieve our goal of reducing the amount of toil involved in maintaining OSS projects. If you are interested in helping out here’s a couple of issues to get started:
- Help us create reports that pull and visualize key performance indicators (KPI). https://github.com/kubeflow/code-intelligence/issues/71. We have defined our KPI here: issue #19
- Ensemble repo specific and non-repo specific label predictions: https://github.com/kubeflow/code-intelligence/issues/70
In addition to the aforementioned issues, we welcome contributions to these other issues in our repo.