YES! I Compiled 1,000,000 TypeScript files in Under 40 Seconds. This is How.

I’ll Give You a Hint: You Can Do Some Pretty Amazing Things With BigQuery…

A few months ago, I created TypeWiz, a tool that helps migrating large code bases to TypeScript by trying to add missing type annotations automatically. Now that I’ve got it working, I am constantly looking for ways to improve it and make it even more powerful and useful.

For example, TypeWiz can figure out the shape of complex objects (i.e., what properties they have), but if you had multiple places in the code where you pass a complex object, TypeWiz currently repeats the same object definition multiple times. In case of nested objects with lots of properties, this become long, verbose, repetitive… and mainly just a pain to read and maintain.

The obvious solution would be to declare an interface for these objects, and then use this interface for the type instead of repeating the object declaration every time. But how would we name this interface? Of course, we could go with something like IAutomaticallyGenerated1, increment the number, and let the user refactor and name the interfaces later — but can we do better?

Harder, Better, Faster… Strong(er) Type!

By using a few of my favorite tools, I was able to figure out a way to quickly find names for interfaces, and discovered along the way how to make this process fast: and I mean fast! And sure, I’m still working out a way to automatically suggest names for TS interfaces, but doing all of this code-crunching really got me thinking of some other neat uses for such quick compiling.

But to get to the fun part, this is how I compiled 1,000,000 TypeScript files in under 40 seconds.

Automatic Interface Naming

So to return to our original problem, we have the list of properties that comprise an interface, and we want to name that interface. What if we could look at a massive database of interfaces that cover many use-cases and find something similar to use as a guide?

For instance, if some interface contains username and password fields, both of the string type, a name like ICredential or IUserCredentials would be a good guess for this interface.

But this then leads us to another question — where would we find such database? For large projects that already has some type info, we may find the relevant interface within the project. Still, this doesn’t solve the problem in general.

Fortunately, we have all the data we need in the open, in a very large database: GitHub! And there is (apparently) an easy way to search it: no, it’s not the GitHub API…it’s Google’s BigQuery! But with a twist ;)

All Your Open Sources Are Belong to BigQuery

BigQuery is a managed, highly scalable data platform which can perform queries over a vast amount of data in matter of seconds. Last month I demonstrated how BigQuery can even mine Bitcoins (or should I call them BigCoins?) as part of a research project. This time, we are actually going to use BigQuery the way it was intended (well, almost…).

BigQuery offers a collection of public data sets you can query, including the entire GitHub data set, which includes information about all the files hosted on GitHub, their content, and every public commit ever made to a GitHub project.

Writing a query that searches for all the TypeScript files on GitHub is as simple as:

SELECT 
id, repo_name, path, ref
FROM
`bigquery-public-data.github_repos.files`
WHERE
path LIKE '%.ts'

(Tip: When running the above query, make sure you uncheck the Use Legacy SQL option, or just add a line that says #standardSql at the top.)

We can extend this query further to return the contents of these files as well, by looking at the contents table as well:

SELECT
*
FROM
`bigquery-public-data.github_repos.contents`
WHERE
id IN (
SELECT
id
FROM
`bigquery-public-data.github_repos.files`
WHERE
path LIKE '%.ts')
AND binary = FALSE
AND NOT STARTS_WITH(content, '<')

Note that the above query processes approximately 2.4TB, and as BigQuery bills you 5$ per TB (though the first 1TB is free), running this query can get quiet expensive! To avoid this, I opted to save the result as a new table in order to make the next queries cheaper. I called this table tscontents, and it ended up being “just” 5.3GB, or about 400 times smaller than the original table.

Searching the Data Set

Now, how to extract all the interface declarations I can find in this massive collection of TypeScript files? While a simple text search for the word interface would probably suffice to find the relevant files, that wouldn’t necessarily get us the information we’re looking for. In order to extract the complete interface definition, as well as distinguish between real interface declarations and the word “interface” appearing in a string (and other such cases), would require a much more sophisticated solution.

I’ve seen many developers opt-in for using regular expressions for parsing source code (like in this project), but regular expressions are not the best tool for the job, as they are limited in their ability to handle nested constructs and complex grammars. Or as some clever guy put it: it quickly becomes a mess.

A better approach would be to just use the parser built into TypeScript for parsing the source file, which would give us the Abstract Syntax Tree (AST), an easy to work with representation of the original source code (you can learn more about the TypeScript AST in my TypeWiz Internals blog post).

So in order to find the interface declarations, my game plan became to download the entire data set of TypeScript code into my computer, and then write a short program that will use the TypeScript compiler to generate the AST and search for InterfaceDeclaration nodes.

But then it hit me: BigQuery is not just a regular database, it also has a vast processing power. Could I run the TypeScript compiler inside BigQuery and just get the results? Could I eventually turn this into a generic code search engine that would be able to run complex AST queries over every TypeScript file ever published to GitHub?

Running TypeScript Inside BigQuery

When I tried to mine Bitcoins with BigQuery, I stumbled upon a feature called UDF — User Defined Functions. UDF allows you to write custom functions with JavaScript and then use these functions inside your queries.

The TypeScript compiler is written in JavaScript, so it sounded like a good match. The documentation said that it is even possible to include third-party libraries and use them within the UDF, so I quickly grabbed a fresh build of TypeScript 2.8.3, uploaded it to Google Cloud Storage, and ran the following query to check if this worked:

This SQL query creates a UDF called testTypescript that loads the TypeScript library from the Google Cloud Storage, then calls the compiler with the given source code to create the AST, and returns a JSON representation of it. We invoke this function with a simple SELECT statement in lines 9–10, giving it some one-line TypeScript source code to work on.

The Folly in First Attempts…

However, this attempt failed miserably with an error message, “User-defined function resource(s) exceed file size limit of 1048576 bytes”. Seems like TypeScript was too big for BigQuery. Even a much older version was too big — about 1.4MB for the minified 2.4.2 version (and I didn’t want an older version anyway).

Giving up is never my thing, and as I actually only needed the parser part of TypeScript, I decided to try compiling TypeScript itself and remove all the parts I don’t need (Type checking, for instance). I had a very good experience with the TypeScript code base when I replaced all the semicolons with dragons, so I decided to just clone their repository and try to remove parts of TypeScript until I’m left with just the bare minimum.

An hour later, I ended up with 690kb build of the latest version of TypeScript, built by the following script:

node lib/tsc.js --pretty --preserveConstEnums --out built\local\tsc-small.js --sourcemap --newLine LF --target es5 --lib es5 src/compiler/types.ts src/compiler/performance.ts src/compiler/core.ts src/compiler/sys.ts src/compiler/diagnosticInformationMap.generated.ts src/compiler/scanner.ts src/compiler/utilities.ts src/compiler/parser.ts src/compiler/binder.ts src/compiler/symbolWalker.ts src/compiler/moduleNameResolver.ts src/compiler/factory.ts src/compiler/visitor.ts src/compiler/sourcemap.ts src/compiler/comments.ts src/compiler/builderState.ts src/compiler/commandLineParser.ts
uglifyjs -c -m -o built/local/tsc-small.min.js built/local/tsc-small.js

The first line just runs the TypeScript compiler to compile the TypeScript compiler into JavaScript (yes, I know…), and since I was able to get rid of over half of the source files, there are now only 17 files left to compile!

From here, we minify the result using uglifyjs, to strip comments, whitespaces, etc., from the result, and then like magic, we’ve got our 690kb build of TypeScript!

OMG! It worked!!!

I uploaded the result to Google Cloud Storage again, ran my SQL query, and it worked like a charm. On the first try, too! (Okay, I had to recompile TypeScript, so technically the second try, but still…)

Query ran successfully and returned the AST

Running the TypeScript compiler on 1 million files!

Finally, I was ready to give BigQuery the final challenge — writing a function that would extract the interfaces, and running it on the files in my data set.

The query defines a UDF name getInterfaces(), that compiles the given source code, traverses the AST looking for InterfaceDeclaration nodes, and returns an array with the name of all found interfaces. You could also return the actual interface declaration by changing line 8 to read result.push(ts.getTextOfNode(node));.

I hit the red “RUN QUERY” button and waited impatiently. How long would it take BigQuery to compile more than 1 million TypeScript files and search for the interfaces? Will the query even succeed or will it die with some obscure “Out of memory” error?

10 minutes later the nerve-racking wait was over —

618.8 seconds to compile and query 1 million files. It worked!

A sigh of relief — it worked. My efforts paid off, I was finally able to run complex queries over the entire open-source TypeScript repertoire.

My next thought was — okay, I did it, is there a way to make it faster? I ran another query to check the distribution of file sizes in the data set:

The graph shows the cumulative number of files smaller than a given size. We can see that the majority of the files are between 128 bytes and 16k, where the bar touches the 1 million line, which means that there are about million files that are 16kb or smaller.

Given this new information, I run the query again, but this time limiting the query to just files under 16kb, based on the assumption that the few larger files could have slowed down the query — and it seems my intuition was right, the query time went down to 140 seconds, just by adding WHERE size < 16384 at the end of the query.

From 140 to (Less Than) 40 Seconds

I kept experimenting with different query variations trying to get the number even lower, when as some point I looked at the query execution plan (a nifty tool that shows how BigQuery actually executed your query) and a small detail caught my eye:

That Parallel Inputs column on the right also appeared while the query was running — and I remembered seeing larger values when I worked with larger tables. I started experimenting with different table sizes (by just duplicating the data) and noticed a pattern: the number of parallel inputs were roughly two times the size of the table in GB, and the more parallel inputs, the faster queries ran.

Duplicating the data was not, however, a good strategy, as BigQuery bills you by the amount of data you read. So if I had 10 times the data, the queries would go faster, but would also be 10 times more expensive. But then I remember that if you only query specific columns of a table, BigQuery only bills for the data in these columns.

So I had this crazy idea — what if I added a dummy column with a lot of junk data to the table, just to bloat its size, and then query the table without referencing this column? Would I still get a bigger number of parallel inputs without being billed for the larger data size?

I run the following query and saved the result as a new table:

This created a 80GB table. I ran the original query against the new table and…

To my surprise, my plan worked! Just by bloating the table with data I didn’t include in my query, I was able to increase the number of Parallel Inputs from 12 to 167, which made the query finish in merely 35 seconds!

And there we have it: I compiled (and queried) 1 million TypeScript files in less than 40 seconds!

Taking it to the next level

This year at ng-conf I met craig., who told me about his new pet project — TSQuery. It is a query language, similar to CSS selectors, that allows powerful queries over the AST of both TypeScript and JavaScript. So instead of having to write custom code to search for all the Interface Declaration nodes as we did above, we could just query using the selector InterfaceDeclaration, and TSQuery will find all these nodes for you. It’s that simple.

TSQuery is built on top of the TypeScript compiler, and since I already had the compiler running inside BigQuery, I decided to try and bring TSQuery as well. I created a custom build of TSQuery bundled with all its dependencies using rollup, and after some tinkering, managed to run it on BigQuery as well.

I spent the weekend creating a proof of concept project called BigTSQuery, a web application where you can query the TypeScript data set using TSQuery:

BigTSQuery

BigTSQuery is built using Angular and Angular Material for the frontend, along with Firebase Hosting and Firebase Functions for the backend, and runs all the queries in BigQuery, using Firestore as a caching layer for repeated queries.

Try if yourself: https://bigtsquery.firebaseapp.com/

And did I mention it’s also open-sourced?

So, We Made a Fast Query. But What Can We Do With It?

At the beginning of this post, I presented a use case for doing all of this: extracting all the interfaces I could find in open sourced TypeScript code in order to build a system that could guess the names of interfaces. I’ve yet to experiment with building it, but in the meanwhile, Shai Reznik and I brainstormed a few more use cases for this new technology.

For example, it could be used to find out how popular certain language features are, e.g., if we wanted to see how developers use Intersection Types in the wild, we’d simply have to query for IntersectionType.

Similarly, if we wanted to find all usages of the “keyof” operator, we could query for TypeOperator. Seems like some people have been doing some crazy stuff with this operator (I especially like the comment there — “TSLINT REALLY HATES THIS INTERFACE… but it works!”. Famous last words 😉).

Framework/library authors could see how common certain features are, or how they are used in the wild. For instance, the following query finds all Angular Components that use “OnPush” Change Detection Strategy:

Decorator>CallExpression[expression.name=Component] PropertyAssignment[name.name=changeDetection] Identifier[name=OnPush]

And of course, you can use it as an educational tool — to learn from what others have been doing. Say you are an RxJS Developer looking to get into Marble Testing, you can run the following query which looks for all the it() methods that contain calls to either hot() or cold():

CallExpression[expression.name=it] CallExpression[expression.name=/^(hot|cold)$/]

Apparently, running this query and filtering out RxJS’s own tests, I get less than 100 results — Ben Lesh, we need to do a better job educating developers about Marble Testing!

I love it when my experiments end up being useful :)

Now it’s Your Turn to Have Some Fun!

I love being able to tinker and do things like turn BigQuery into a powerful search engine for TypeScript code, and it’s always fun to stretch my creative muscles a little and create a nice front-end for running queries and navigating through the results.

Still, it’s even more fun when you get involved! Hopefully I’ve explained my thinking through this process and made it clear how to use BigQuery alongside your TypeScript projects and experiments, and so I encourage you to go play with BigTSQuery and see what neat things you can come up with!

Think you can make a faster query? Do you have some ideas for use cases for this technology? Want to help make BigTSQuery better? Leave a comment below, or open an issue 🤗

I can’t wait to see what we’ll come up with next!