Analyzing and Improving the Open-Source Health of your Repository

Royal Jain
DeepAffects
Published in
5 min readFeb 25, 2018

Traditionally software companies have relied on their code to provide a competitive advantage and hence source code was kept in huge secrecy. We have observed a radical shift in philosophy in recent years. Nowadays, a lot of companies and organisations are open sourcing their codebases. Facebook’s React, Microsoft’s Visual Studio Code, Google’s Tensorflow library are some of the most popular open source repositories. In this blog, we analyse the benefits of open sourcing, provide metrics to measure the success of open sourcing and suggest ways to improve it.

Why open-source?

Intuitively sharing the things which earn you money seems bizarre. However, lots of companies today rely on data and scale of supporting infrastructure to provide them with competitive advantage, rather than their source code itself. Thus, at places where sharing the code doesn’t create competition and security issues the pros of open sourcing outweigh the cons as it provides lot of compelling advantages for business, like, Security.

Given enough eyeballs, all bugs are shallow — Linus Torvalds

Given enough testers and developers even complex vulnerabilities can be detected quickly. Bug fixes are also very quick in open source repositories as compared to proprietary software. Knowing that your code will be looked and discussed by hundreds of developers results in developers making more efforts to maintain and improve code quality. 10 reasons for open-sourcing lists down various reasons why open sourcing is good idea even for large companies.

Life cycle and role of contributors in a repository

In this section, we analyse the role of contributors at various stages in the lifecycle of a repository. From the figure, we observe that most of the issue reports are created by the contributors, this means lot of bugs and defects, especially when the repository reaches some maturity, are detected by the open source developers. This saves a lot of time and resources in testing and qa. This also means that many of new features are requested by the users of the repository rather than the developers, thus helping in bridging the developer-user gap.

We now focus on the amount of code contributed from the open source community. We take the number of commits as the measure of contribution. Admittedly using commits is not very accurate if the goal is to measure the percentage of contribution. However, it is very competent when we want to compare various repositories and also the same repository at different times. We observe that barring some spikes, most of the repositories follow this pattern —
1. In nascent stage, contributor commits are very low.
2. Project increases in popularity, contributor commits increase rapidly.
3. Project becomes mature, contributor commits stabilize.

Moreover, the contribution from open source members forms a majority in the latter stages of the product, thus reducing support cost when the team members have moved onto newer things.
We obtained commits, issue and employees info using the GitHub API.

Lifecycle of some open source repositories

Reasons for contributor attrition?

We list down the number of developers with greater than threshold number of commits in various repositories (see table below). We see that huge fraction of people who started contributing don’t continue for long. WHY IS THAT ?

Attrition rate in various repositories

Each person is different, they have different motivation and skill level and hence it is almost impossible to accurately predict all the reasons which led to developer attrition. However, there are few reasons which are more responsible than others, like -

  1. Entry Barrier — Many times the process to start contributing is either unclear or cumbersome.
  2. Slow and unresponsive community/owners — Developers who have to wait for long to get their queries answered are less likely to contribute further. Research from Mozilla suggests that maintainer responsiveness is a critical factor in encouraging repeat contributions.
  3. Rejected Work — One thing which a developer hates more than anything else is watching their work go to waste. Developers who raise a PR which doesn’t merge into the master branch often lose interest. As a concrete example we look at the Elasticsearch repository. Out of 523 contributors whose PR was closed without merging, 410 didn’t submit code to that repository again.
  4. Absence of active engagement — There are contributors who have put in significant effort in the repository but after a while absence of engagement from repository owners they move onto new projects. These are the people who are most important as they are experienced with the code base and procedure and are likely to be more productive. They also possess some unique knowledge and perspective about the repository which might be very useful.

Increasing contributors retention

The first step in this direction would be to make contributing easier. This involves having comprehensive guidelines, effortless installation process and uncomplicated submission task. Then owners have to take up the mantle of answering queries in quick and satisfactory manner. Lot of times owners directly reject the work/PR done if they think it is not adhering to guidelines or it won’t add much value, rather than that, the attitude should be of guiding the developer to rectify it so that it adds to the repository. It might be time-consuming but it creates a better community. These practices are followed in all repositories in varied amount.

Open source is more than just code. Successful open source projects include code and documentation contributions together with conversations about these changes. — @arfon, “The Shape of Open Source”

With the advancements in Machine learning and NLP, we can strive for something higher. Some ways these technologies can be applied to improve the community engagements are:

  1. Identifying queries among the comments in PRs/Issues — this way owners can be quickly notified when there are unanswered questions saving time in checking all the issues.
  2. Identifying indicators of negative emotions — When some developer is not happy with some things his comments might reflect it. Identifying them can be very helpful in prioritizing things.
  3. Engaging advanced contributors — Based on issues, labels and code data present in the repository we can suggest open source contributors who are familiar with those type of issues to the owners. External validation of their expertise makes developers feel more welcomed and appreciated and owners/managers also get a much larger workforce to solve the issues.

We at DeepAffects, use these and many other useful insights to improve collaboration and productivity in software projects. Do check it out if you found the blog interesting.

--

--

Royal Jain
DeepAffects

Founder @ CodeParrot AI. Building the future of UI Development using AI