Popular Java projects on GitHub that could use some help (analyzed using BigQuery and Dataflow)
Many Java programmers would like to contribute to open-source projects, but don’t know where to start. There are so many projects on GitHub. Which ones actually need your help? Ideally, you want to start out with small, localized commits that are likely to get accepted.
Many Java programmers add tagged comments like FIXME and TODO to their code. If we look for these tags in code, then we will have a pretty good idea of which projects need help.
Of course, lots of projects need help, but perhaps you’d like to spend our time working on a project that has wide adoption. You can find which projects get used a lot by seeing how often classes from that project get imported into other projects.
Google recently announced that all GitHub code (more than 2.8 million open source GitHub repositories and over 2 billion files)is available for analysis in BigQuery and published several examples of SQL statements to analyze the data. But SQL is inherently limited — you can’t do programmatic things in it and the more User-Defined-Functions and regular expressions you add to the SQL, the harder it becomes to understand what’s going on.
Dataflow (an implementation of the Apache Beam API) allows you to write simple, straightforward data pipelines in Java or Python. Very usefully, Dataflow will allow you to build a pipeline starting from a BigQuery query and then write Java code to process the data in steps.
My pipeline (full code here) starts with a simple query:
String javaQuery = “SELECT content FROM [fh-bigquery:github_extracts.contents_java_2016]”;
and then goes through these steps:
Once I get the Java content using the BigQuery query, I split the content field into lines, and then do two things with those lines:
- Figure out which packages have calls for help. To do this, I have to first figure out which packages the Java file belongs to. One simple way is to look for the package statement at the top of the Java file. If a class is in com.google.training.data-analyst.flights, I associate it with the full package name, but also with com.google.training.data-analyst, com.google.training, com.google and com. Then, I count the number of times FIXME or TODO appears in the file — this gets added to the number of calls for help on the packages this file belongs to.
- Figure out which packages this file makes use of. This is very similar, except that I look for import statements and maintain the total use per package.
I then derive a composite score of how popular a package is and how much help is needed as log(numUses)*log(numHelpNeeded). The top 1000 packages as measured by this composite score are written to the output file. The reason for the logarithm is because both these are subject to significant long-tail effects and the log() serves to scale the values.
So, which popular Java projects on GitHub could use your help? Discarding top-level domains like com, uk, and fi and large organizations like org.apache and com.google (that may have multiple projects), I end up with:
The code to run the pipeline on Dataflow on the Google Cloud Platform is here (change the output bucket name to be something that you have write permissions on).