Analyzing Go code with BigQuery

Francesc Campoy
Jun 29, 2016 · 8 min read

Recently my colleague Felipe Hoffa told me about a new public dataset in BigQuery: ALL THE PUBLIC GITHUB CODE!

Counting Go files

As a gopher, my first reaction was to check how many Go files are in that dataset. My SQL is not amazing, but I’m able to do that!

SELECT COUNT(*)
FROM [bigquery-public-data:github_repos.files]
WHERE RIGHT(path, 3) = ‘.go’

Running that query I see that there are more than 12 million files with a .go extension in the dataset. That’s a lot! But wait … I just ran that query on TWO BILLION ROWS and it finished in 6 seconds? Wow! 😮

Image for post
Counting all the files with .go extension on GitHub

Ok, so that’s awesome! But I also processed 105GB, and since I’m the cost of the query is proportional to the size of the data queried (even though the first TB per month is free) it’s probably a good idea to create a new dataset and a new table containing just the files with a .go extension to minimize the cost.

Image for post

Done! In 21 seconds I created a new table containing only the Go files. Counting how many Go files there are is now simpler:

SELECT COUNT(*)
FROM [go_files:files]

And the best part is that it processes zero bytes, since metadata doesn’t count.

The most popular file name

To figure that out we need to extract the file name from the path. BigQuery has some amazing text processing functions that makes this easy. Using SPLIT and LAST we can easily get the name for each file.

SELECT LAST(SPLIT(path, ‘/’)) as filename
FROM [go_files.files]

Now that we have the filenames, I’ll just count how many times each different filename appears and order the results by descending count. Or in SQL:

SELECT filename, COUNT(*) as n
FROM (
SELECT LAST(SPLIT(path, ‘/’)) as filename
FROM [go_files.files]
)
GROUP BY filename
ORDER BY n DESC
LIMIT 10

And here’s the result:

Image for post
Top ten names for Go files on GitHub

Clearly gophers like documentation.

The most popular package name

Next, I’d like to determine the most popular package name. To do so, we need to read the contents of the files. To avoid going over all the files on GitHub (there’s many of them, apparently) I’ll create a table with the contents of only the Go files.

WARNING: this is an expensive query (around $7) and might take minutes to complete, instead of creating your own just use this table right here.

SELECT *
FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (SELECT id FROM go_files.files)

Great, now I can simply find all the lines in those files that contain “package xxx” and group by it … easy!

SELECT SPLIT(content, ‘\n’) as line
FROM go_files.contents
HAVING LEFT(line, 8) = ‘package ‘

And now let’s extract the package name from those lines and count them.

SELECT SUBSTR(line, 8) as name, count(*) as n
FROM (
SELECT SPLIT(content, ‘\n’) as line
FROM go_files.contents
HAVING LEFT(line, 8) = ‘package ‘
)
GROUP BY name
ORDER BY n DESC
LIMIT 10

And here we have the results:

Image for post
Top ten names for Go packages on GitHub

The most imported package

Ok, so this should be easy right? Just find all the lines that start with “import” as we did before?

SELECT line, count(*) as n
FROM (
SELECT SPLIT(content, ‘\n’) as line
FROM go_files.contents
HAVING LEFT(line, 7) = ‘import ‘
)
GROUP BY line
ORDER BY n DESC
LIMIT 10

We get:

Image for post

Ok, so let’s analyze the results:

  1. import statements can span multiple lines, and people love that apparently.
  2. some people use package aliases … even when it’s not needed?

Something weird is going on: there are two results that begin with “import (“, why is that? Let’s look at the result of this query in JSON:

[
{ “line”: “import (“, “n”: “745978” },
{ “line”: “import \”fmt\””, “n”: “14673” },
{ “line”: “import \”C\””, “n”: “10235” },
{ “line”: “import \”testing\””, “n”: “8682” },
{ “line”: “import math \”math\””, “n”: “4298” },
{ “line”: “import (\r”, “n”: “4027” },
{ “line”: “import \”unsafe\””, “n”: “3833” },
{ “line”: “import \”time\””, “n”: “3709” },
{ “line”: “import fmt \”fmt\””, “n”: “3054” },
{ “line”: “import \”os\””, “n”: “2654” }
]

We have “import (“ and “import (\r”. Ah, line breaks are awesome. We should also take that into account.

At this point an episode from XKCD comes to mind.

Image for post
http://xkcd.com/208/

I can extract the list of imported packages in a multiline import statement using the following regular expression:

r"(?s)import \(([^\]*)\)"

The r at the beginning of the string is to let BigQuery know this is a regular expression. The (?s) is used to allow the regular expression to match over multiple lines, which we need. The rest is capturing all the text in between two parentheses.

Using that regular expression we can extract the import statements using this query:

SELECT SPLIT(
REGEXP_EXTRACT(
content,
r'(?s)import \(([^\)]*)\)'
), '\n') as pkg
FROM go_files.contents
HAVING pkg IS NOT NULL

We extract the list of imported packages, then we split those by line. Nice! We can then extract the import path (ignoring possible aliases) by wrapping this query like this:

SELECT REGEXP_EXTRACT(line, '.*\"(.*)\".*') as pkg
FROM (
SELECT SPLIT(
REGEXP_EXTRACT(
content,
r'(?s)import \(([^\)]*)\)'
), '\n') as line
FROM go_files.contents
HAVING line IS NOT NULL
)

What about the import statements with only one package? Well, we can use a different regular expression:

r'import.*\"(.*)\"'

In the following query we divide the content of the file into lines, and apply the regular expression on each one of them.

SELECT REGEXP_EXTRACT(line, r'import.*\"(.*)\"') as pkg
FROM (
SELECT SPLIT(content, '\n') as line
FROM go_files.contents
)
HAVING pkg is not null

What’s left? Well, we are going to join both sets of import paths, count them, and order them by popularity!

SELECT pkg, count(*) as n
FROM
(
SELECT REGEXP_EXTRACT(line, '.*\"(.*)\".*') as pkg
FROM (
SELECT SPLIT(
REGEXP_EXTRACT(
content,
r'(?s)import \(([^\)]*)\)'
), '\n') as line
FROM go_files.contents
HAVING line IS NOT NULL
)
), (
SELECT REGEXP_EXTRACT(line, r'import.*\"(.*)\"') as pkg
FROM (
SELECT SPLIT(content, '\n') as line
FROM go_files.contents
)
HAVING pkg is not null
)
GROUP BY pkg
ORDER BY n DESC
LIMIT 10

And the result, after only 7.2 seconds and 5.96 GB processed is:

Image for post
fmt FTW!

What about the most popular packages with an import path starting with github? Just add the line in bold:

SELECT pkg, count(*) as n
FROM
(
SELECT REGEXP_EXTRACT(line, '.*\"(.*)\".*') as pkg
FROM (
SELECT SPLIT(
REGEXP_EXTRACT(
content,
r'(?s)import \(([^\)]*)\)'
), '\n') as line
FROM go_files.contents
HAVING line IS NOT NULL
)
), (
SELECT REGEXP_EXTRACT(line, r'import.*\"(.*)\"') as pkg
FROM (
SELECT SPLIT(content, '\n') as line
FROM go_files.contents
)
HAVING pkg is not null
)
WHERE LEFT(pkg, 6) = 'github'
GROUP BY pkg
ORDER BY n DESC
LIMIT 10

And the result is:

Image for post
Top 10 of the most imported packages from GitHub in GitHub

UPDATE: I modified this analysis shortly after talking with a friend, see the new results at the end of post.

Exported functions returning unexported types?

Interestingly enough, while finishing up the details of this post I had an interesting conversation on twitter. Start here:

I wondered, does anyone do that? Returning a private type from a public function seems counterintuitive …

How hard can it be to figure it out? With BigQuery, not much 🎉 Just find all the functions with an uppercase name that returns a type that starts with lowercase and is not one of the predeclared types in Go.

I also joined the files table to obtain the repo_name and path. Kind of a git blame, if you wish.

SELECT line, repo_name, path
FROM
(
SELECT line, id
FROM (
SELECT SPLIT(content, "\n") as line, id
FROM [go_files.contents]
HAVING length(line) > 2
)
WHERE REGEXP_MATCH(line, r"func [A-Z].*\(.*\) [a-z][a-zA-Z]* {")
AND REGEXP_EXTRACT(line, r"func [A-Z].*\(.*\) ([a-z][a-zA-Z]*) {") NOT IN
(
"bool", "byte", "complex64", "complex128", "error",
"float32", "float64", "int", "int8", "int16", "int32",
"int64", "rune", "string", "uint", "uint8", "uint16",
"uint32", "uint64", "uintptr", "interface"
)
) AS lines
JOIN [go_files.files] as files
ON files.id = lines.id
LIMIT 100

Interestingly enough, it seems like Kubernetes uses this style often!

Image for post

Want more?

There are many things one could analyze, but soon you’ll realize that using regular expressions to parse Go is pretty hard. Fortunately BigQuery supports more than SQL!

With BigQuery User Defined Functions you can execute arbitrary JavaScript! Which makes me think … could we transpile “go/types” to JavaScript with GopherJS and run it on BigQuery?

Go check this post where Felipe tells us more about what you can do with this amazing dataset and the power of BigQuery.

Update on the updates

Chatting with my friend Matt Aimonetti we realized that if a big organization was using a package then it would be very high on my ranking. And that probably it’d be interesting to know how many different GitHub usernames have imports to each package.

I agree, so I created a new query:

SELECT pkg, COUNT(distinct REGEXP_EXTRACT(repo_name, "(.*)/.*")) as n
FROM (
SELECT pkg, id
FROM
(
SELECT REGEXP_EXTRACT(line, '.*\"(.*)\".*') as pkg, id
FROM (
SELECT SPLIT(
REGEXP_EXTRACT(
content,
r'(?s)import \(([^\)]*)\)'
), '\n') as line, id
FROM go_files.contents
HAVING line IS NOT NULL
)
), (
SELECT REGEXP_EXTRACT(line, r'import.*\"(.*)\"') as pkg, id
FROM (
SELECT SPLIT(content, '\n') as line, id
FROM go_files.contents
)
HAVING pkg is not null
)
WHERE REGEXP_MATCH(pkg, "github.com/.*")
) as imports JOIN [go_files.files] as files
ON files.id = imports.id
GROUP BY pkg
ORDER BY n desc
LIMIT 10

Nothing really fancy, just extracting the GitHub user name from the repository and using those to count without repetitions (hence the DISTINCT).

The results are interesting too!

Image for post
Most popular packages counting only once every user name!

Google Cloud - Community

Google Cloud community articles and blogs

Thanks to Sandeep Dinesh, Jack Wilber, and Felipe Hoffa

Francesc Campoy

Written by

VP of Product at Dgraph 🏳️‍🌈 Catalan

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Francesc Campoy

Written by

VP of Product at Dgraph 🏳️‍🌈 Catalan

Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store