There will definitely be an IBM undercount when doing domain counting, because a lot of active contributors use their own addresses due to challenges in the mail infrastructure vs. open source mailing lists. I also know that to be the case for a lot of active Open Source contributors I know in other organizations (especially if they’ve moved between companies before).
In years of keeping an eye on statistics in OpenStack, we found this also tends to be true of Chinese organizations whose contributors nearly always commit from gmail.com, 163.com, or 126.com. While it’s totally possible to commit with a different address than you use for email discussions, people and tools tend to key off the email discussions such that it’s a pain if they aren’t the same.
There is something else curios in here. When I play with your bigtable queries until I drop the star value to 5, nodejs doesn’t show up in the list. That seems like a very big project to be missing. I wonder what else is missing in the 5–20 space? Was there a strong reason for picking 20?
Also, is there a reason that you picked number of repositories committed to as the key metric? vs. commits vs. PRs merged vs. LOC vs. issues closed?