Analyzing Go code with BigQuery

Recently my colleague Felipe Hoffa told me about a new public dataset in BigQuery: ALL THE PUBLIC GITHUB CODE!

Counting Go files

As a gopher, my first reaction was to check how many Go files are in that dataset. My SQL is not amazing, but I’m able to do that!

SELECT COUNT(*)
FROM [bigquery-public-data:github_repos.files]
WHERE RIGHT(path, 3) = ‘.go’

Running that query I see that there are more than 12 million files with a .go extension in the dataset. That’s a lot! But wait … I just ran that query on TWO BILLION ROWS and it finished in 6 seconds? Wow! 😮

Counting all the files with .go extension on GitHub

Ok, so that’s awesome! But I also processed 105GB, and since I’m the cost of the query is proportional to the size of the data queried (even though the first TB per month is free) it’s probably a good idea to create a new dataset and a new table containing just the files with a .go extension to minimize the cost.

Done! In 21 seconds I created a new table containing only the Go files. Counting how many Go files there are is now simpler:

SELECT COUNT(*)
FROM [go_files:files]

And the best part is that it processes zero bytes, since metadata doesn’t count.

The most popular file name

To figure that out we need to extract the file name from the path. BigQuery has some amazing text processing functions that makes this easy. Using SPLIT and LAST we can easily get the name for each file.

SELECT LAST(SPLIT(path, ‘/’)) as filename
FROM [go_files.files]

Now that we have the filenames, I’ll just count how many times each different filename appears and order the results by descending count. Or in SQL:

SELECT filename, COUNT(*) as n
FROM (
SELECT LAST(SPLIT(path, ‘/’)) as filename
FROM [go_files.files]
)
GROUP BY filename
ORDER BY n DESC
LIMIT 10

And here’s the result:

Top ten names for Go files on GitHub

Clearly gophers like documentation.

The most popular package name

Next, I’d like to determine the most popular package name. To do so, we need to read the contents of the files. To avoid going over all the files on GitHub (there’s many of them, apparently) I’ll create a table with the contents of only the Go files.

WARNING: this is an expensive query (around $7) and might take minutes to complete, instead of creating your own just use this table right here.

SELECT *
FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (SELECT id FROM go_files.files)

Great, now I can simply find all the lines in those files that contain “package xxx” and group by it … easy!

SELECT SPLIT(content, ‘\n’) as line
FROM go_files.contents
HAVING LEFT(line, 8) = ‘package ‘

And now let’s extract the package name from those lines and count them.

SELECT SUBSTR(line, 8) as name, count(*) as n
FROM (
SELECT SPLIT(content, ‘\n’) as line
FROM go_files.contents
HAVING LEFT(line, 8) = ‘package ‘
)
GROUP BY name
ORDER BY n DESC
LIMIT 10

And here we have the results:

Top ten names for Go packages on GitHub

The most imported package

Ok, so this should be easy right? Just find all the lines that start with “import” as we did before?

SELECT line, count(*) as n
FROM (
SELECT SPLIT(content, ‘\n’) as line
FROM go_files.contents
HAVING LEFT(line, 7) = ‘import ‘
)
GROUP BY line
ORDER BY n DESC
LIMIT 10

We get:

Ok, so let’s analyze the results:

  1. import statements can span multiple lines, and people love that apparently.
  2. some people use package aliases … even when it’s not needed?

Something weird is going on: there are two results that begin with “import (“, why is that? Let’s look at the result of this query in JSON:

[
{ “line”: “import (“, “n”: “745978” },
{ “line”: “import \”fmt\””, “n”: “14673” },
{ “line”: “import \”C\””, “n”: “10235” },
{ “line”: “import \”testing\””, “n”: “8682” },
{ “line”: “import math \”math\””, “n”: “4298” },
{ “line”: “import (\r”, “n”: “4027” },
{ “line”: “import \”unsafe\””, “n”: “3833” },
{ “line”: “import \”time\””, “n”: “3709” },
{ “line”: “import fmt \”fmt\””, “n”: “3054” },
{ “line”: “import \”os\””, “n”: “2654” }
]

We have “import (“ and “import (\r”. Ah, line breaks are awesome. We should also take that into account.

At this point an episode from XKCD comes to mind.

http://xkcd.com/208/

I can extract the list of imported packages in a multiline import statement using the following regular expression:

r"(?s)import \(([^\]*)\)"

The r at the beginning of the string is to let BigQuery know this is a regular expression. The (?s) is used to allow the regular expression to match over multiple lines, which we need. The rest is capturing all the text in between two parentheses.

Using that regular expression we can extract the import statements using this query:

SELECT SPLIT(
REGEXP_EXTRACT(
content,
r'(?s)import \(([^\)]*)\)'
), '\n') as pkg
FROM go_files.contents
HAVING pkg IS NOT NULL

We extract the list of imported packages, then we split those by line. Nice! We can then extract the import path (ignoring possible aliases) by wrapping this query like this:

SELECT REGEXP_EXTRACT(line, '.*\"(.*)\".*') as pkg
FROM (
SELECT SPLIT(
REGEXP_EXTRACT(
content,
r'(?s)import \(([^\)]*)\)'
), '\n') as line
FROM go_files.contents
HAVING line IS NOT NULL
)

What about the import statements with only one package? Well, we can use a different regular expression:

r'import.*\"(.*)\"'

In the following query we divide the content of the file into lines, and apply the regular expression on each one of them.

SELECT REGEXP_EXTRACT(line, r'import.*\"(.*)\"') as pkg
FROM (
SELECT SPLIT(content, '\n') as line
FROM go_files.contents
)
HAVING pkg is not null

What’s left? Well, we are going to join both sets of import paths, count them, and order them by popularity!

SELECT pkg, count(*) as n
FROM
(
SELECT REGEXP_EXTRACT(line, '.*\"(.*)\".*') as pkg
FROM (
SELECT SPLIT(
REGEXP_EXTRACT(
content,
r'(?s)import \(([^\)]*)\)'
), '\n') as line
FROM go_files.contents
HAVING line IS NOT NULL
)
), (
SELECT REGEXP_EXTRACT(line, r'import.*\"(.*)\"') as pkg
FROM (
SELECT SPLIT(content, '\n') as line
FROM go_files.contents
)
HAVING pkg is not null
)
GROUP BY pkg
ORDER BY n DESC
LIMIT 10

And the result, after only 7.2 seconds and 5.96 GB processed is:

fmt FTW!

What about the most popular packages with an import path starting with github? Just add the line in bold:

SELECT pkg, count(*) as n
FROM
(
SELECT REGEXP_EXTRACT(line, '.*\"(.*)\".*') as pkg
FROM (
SELECT SPLIT(
REGEXP_EXTRACT(
content,
r'(?s)import \(([^\)]*)\)'
), '\n') as line
FROM go_files.contents
HAVING line IS NOT NULL
)
), (
SELECT REGEXP_EXTRACT(line, r'import.*\"(.*)\"') as pkg
FROM (
SELECT SPLIT(content, '\n') as line
FROM go_files.contents
)
HAVING pkg is not null
)
WHERE LEFT(pkg, 6) = 'github'
GROUP BY pkg
ORDER BY n DESC
LIMIT 10

And the result is:

Top 10 of the most imported packages from GitHub in GitHub

UPDATE: I modified this analysis shortly after talking with a friend, see the new results at the end of post.

Exported functions returning unexported types?

Interestingly enough, while finishing up the details of this post I had an interesting conversation on twitter. Start here:

I wondered, does anyone do that? Returning a private type from a public function seems counterintuitive …

How hard can it be to figure it out? With BigQuery, not much 🎉 Just find all the functions with an uppercase name that returns a type that starts with lowercase and is not one of the predeclared types in Go.

I also joined the files table to obtain the repo_name and path. Kind of a git blame, if you wish.

SELECT line, repo_name, path
FROM
(
SELECT line, id
FROM (
SELECT SPLIT(content, "\n") as line, id
FROM [go_files.contents]
HAVING length(line) > 2
)
WHERE REGEXP_MATCH(line, r"func [A-Z].*\(.*\) [a-z][a-zA-Z]* {")
AND REGEXP_EXTRACT(line, r"func [A-Z].*\(.*\) ([a-z][a-zA-Z]*) {") NOT IN
(
"bool", "byte", "complex64", "complex128", "error",
"float32", "float64", "int", "int8", "int16", "int32",
"int64", "rune", "string", "uint", "uint8", "uint16",
"uint32", "uint64", "uintptr", "interface"
)
) AS lines
JOIN [go_files.files] as files
ON files.id = lines.id
LIMIT 100

Interestingly enough, it seems like Kubernetes uses this style often!

Want more?

There are many things one could analyze, but soon you’ll realize that using regular expressions to parse Go is pretty hard. Fortunately BigQuery supports more than SQL!

With BigQuery User Defined Functions you can execute arbitrary JavaScript! Which makes me think … could we transpile “go/types” to JavaScript with GopherJS and run it on BigQuery?

Go check this post where Felipe tells us more about what you can do with this amazing dataset and the power of BigQuery.

Update on the updates

Chatting with my friend Matt Aimonetti we realized that if a big organization was using a package then it would be very high on my ranking. And that probably it’d be interesting to know how many different GitHub usernames have imports to each package.

I agree, so I created a new query:

SELECT pkg, COUNT(distinct REGEXP_EXTRACT(repo_name, "(.*)/.*")) as n
FROM (
SELECT pkg, id
FROM
(
SELECT REGEXP_EXTRACT(line, '.*\"(.*)\".*') as pkg, id
FROM (
SELECT SPLIT(
REGEXP_EXTRACT(
content,
r'(?s)import \(([^\)]*)\)'
), '\n') as line, id
FROM go_files.contents
HAVING line IS NOT NULL
)
), (
SELECT REGEXP_EXTRACT(line, r'import.*\"(.*)\"') as pkg, id
FROM (
SELECT SPLIT(content, '\n') as line, id
FROM go_files.contents
)
HAVING pkg is not null
)
WHERE REGEXP_MATCH(pkg, "github.com/.*")
) as imports JOIN [go_files.files] as files
ON files.id = imports.id
GROUP BY pkg
ORDER BY n desc
LIMIT 10

Nothing really fancy, just extracting the GitHub user name from the repository and using those to count without repetitions (hence the DISTINCT).

The results are interesting too!

Most popular packages counting only once every user name!