Refactoring Thrift schemas at Pinterest
Suman Karumuri | Pinterest tech lead, Infrastructure
The Pinterest tech stack consists of hundreds of microservices written in a variety of languages that use Apache Thrift to communicate with each other. Thrift schemas are used to define our service interfaces, constants and serialized data across all levels of the stack, including mobile apps, front-end services, back-end services, big data pipelines and machine learning models. Over time, our Thrift repository has organically grown to contain thousands of Thrift schemas spread across hundreds of tightly coupled Thrift files. In all, it’s about 1,500 files with over 100,000 lines of code. The tight coupling of the files has not only created a web of tangled dependencies among the Thrift files, but also among the applications that use them. This leads to a slow and error-prone development cycle, slowing down developer velocity across the company.
To address this problem, we recently refactored our Thrift schemas, leading to a huge reduction in code build times while improving code quality and developer velocity. In this blog post, we’ll share our motivation for the project, our approach to solving the problem and the wins we’ve observed.
Pinterest’s core business schemas are spread across 16 large (3–10k LoC) tightly coupled Thrift files. Since they include other files in the repo, taking a dependency on any of these files, directly or indirectly, would pull in ~450 Thrift dependencies (>50 percent of all Thrift schemas), as shown in the dependency graph in Figure 1. Since every project at Pinterest relies on a core business struct, directly or indirectly, every project at the company was dependent on half of our Thrift schemas repo. This tangled dependency graph caused several seemingly unrelated problems.
To compile the schemas locally, we needed larger development machines. Even on these high-end machines, the Thrift compilers would run out of memory and linters would time out when working on our Thrift repo. Even when everything worked, the builds were slow, consumed a lot of resources and generated huge binaries. For example, our generated C++ Thrift dependencies would compile to 500 MB even though the application code was only 20 MB. These binaries were slow to deploy and would consume a great deal of memory on our production machines.
Since including any large Thrift file depended on more than half of the Thrift schemas, any change in the Thrift repo triggered a full build of all of our code. Our Thrift repo is changed several times per day, which means we build our entire code base multiple times per day (which is obviously inefficient). Furthermore, this slowed down our developers, because any Thrift change would block them from merging new changes to our monorepo until the full build was landed. The fact that each build could take a couple of hours underscored the magnitude of this problem.
Slow and error-prone development and testing
It often takes a long time to build and test local changes. It took more than two hours to validate, test and consume a new Thrift version to major repos, because the tangled dependencies force a complete run of all unit and integration tests in all repos. Since we integrate Thrift into all of our language-specific monorepos, any bugs in the Thrift integration would break builds across that entire repo, blocking hundreds of developers from committing their changes to it. We had Bazel installed, but couldn’t take advantage of its incremental build capabilities.
Changes have ripple effects, causing excessive builds and unnecessary releases. This made it harder to verify all the deploys and easier to introduce production errors. The large and tangled schema file served several business uses, so breaking a build in one part of the schema blocked unrelated changes from being pushed to production, necessitating cross-team coordination to fix.
Since the code is tightly coupled, developers often resort to copying code to avoid taking circular dependencies (Figure 2) or large dependencies (Figure 3).
To understand the cause of tangled dependencies, we looked at the Thrift code and realized our Thrift dependency graph was tangled, because we have a “skinny include / fat dependency” problem: to access a small amount of functionality, the code takes on a large dependency. This is a common issue in large codebases and is usually fixed by refactoring the included schema into a new file and taking dependency on the new file instead of a large Thrift file.
However, refactoring Thrift schemas is complicated. Doing this incorrectly can break backward compatibility. And in some cases, refactoring can require extensive code changes, which most developers wish to avoid.
To solve this problem, we had to refactor the Thrift files with the following goals in mind:
- Change the narrow and deep tree into a wider and shallower one.
- Minimize impact on production and developer workflow.
Wide and shallow dependency tree
The tangled Thrift dependencies caused our dependency tree to become narrow and deep. To get a wider tree, we refactored large files into smaller ones. To get a shallower tree, we minimized the number of “include” declarations in a Thrift file. A wider and shallower build tree solves the skinny include / fat dependency problem, leading to faster incremental and parallel builds that use fewer computing resources (since we are compiling smaller files). By allowing targets to depend only on the needed schemas, we also reduce unnecessary builds. The smaller dependencies will also lead to smaller binaries, yielding faster build, release and deployment workflows.
To break down large files into smaller files we performed the following refactoring steps:
1. Moved enums into their own files.
2. Constants come in various flavors like (a) one-line constant definitions such as strings; (b) typedefs; (c) simple constant data structures consisting of primitive types; (d) constant data structures with enums; and (e) constant maps consisting of multiple types. We handle each of these flavors differently:
- Any constants that are generally applicable or too small for their own files — like those described in (a), (b) or (c)— were reorganized into different constants files based on their business logic, like pinterest_constants, ads_constants, etc.
- Constant data structures consisting of enums (d) became enums in the same file.
- Any constants that depend on multiple types (e) were treated as structs.
3. Since structs can be nested, their refactoring is a bit more subjective and complicated than simply putting each struct in its own file. The following guidelines helped keep the number of files and their dependencies to a minimum:
- All structs that logically belong together were kept in the same file to help keep nested structs together.
- We preferred that any struct consumed by multiple Thrift files via includes be in its own file.
- If a struct needs a schema from a different Thrift file, we preferred that it be declared in a separate file. This way only the code that needs this struct will take the dependency on the included Thrift file.
4. All service interfaces were placed in their own files with any associated structs, like request/response structs.
Minimizing production impact
When we started refactoring the code based on the guidelines above, we realized that refactoring some schemas would break backward compatibility or require large application modifications, because we had relied heavily on non-standard Thrift features, like reusing namespaces across Thrift files, including language-specific code into Thrift files, etc. Because of this, we added one more goal: to perform this refactoring with minimal production impact.
To minimize production impact, our refactoring had to work with the following constraints:
A. Maintain backward compatibility so new code works with old code, and vice versa.
B. The code changes should not have any runtime overhead.
C. Allow incremental code rollouts, since it’s impossible to synchronize code deployments across thousands of services.
D. Minimize the code changes needed, so the code is easier to review, test and deploy.
E. Avoid data migration, since these serialized core data structures can be contained in petabytes of data.
F. Make code changes in which we have high confidence.
G. Empower other developers in the company to refactor code.
Since we wanted zero runtime overhead, we couldn’t employ any techniques like schema translation to maintain backward compatibility. To understand the impact of refactoring on code while satisfying constraints A-E, we refactored two enums in our codebase. During this refactor, we found Thrift code generators were the main cause of breaking backward code and data compatibility. The generated code was breaking backward compatibility, because the code generators in varying languages used the declared Thrift namespace for the language differently. Since the namespace was used differently across languages, the serialized structure of the code also differed. For example, the Python and Go code generator uses the namespace as the filename of generated code, whereas for Java and C++ it maps the namespaces to the Java package name and C++ namespace, respectively.
After looking at our code, we realized we could selectively refactor a large portion of Thrift schemas in our codebase without breaking backward compatibility just by making namespace changes in Python and Go. As Python and Go didn’t use the namespace of the package in the serialized Thrift struct, the namespace change was backward compatible as long as the code compiled. The only change we had to make to our code was to change the namespace of the package. As a result, some refactorizations would need the refactored code to be added to a new namespace. In such cases, the import paths needed to be updated to use the new namespace in the code.
Table 1 below shows the Thrift refactor change and how it impacted every language. To minimize the changes and avoid breaking any serialized Thrift structs, we kept the namespaces for Java and C++ the same across files. For the remaining languages, we decided to add the new files into the new namespace and update the namespace in the code. The changes to Go/Python/JS didn’t change the serialized data, which meant the changes could be incrementally rolled out.
To ensure the code worked once deployed, we needed a solution that only required compile-time modifications. These changes are a huge effort — changing one enum can lead to changing 6,000 references in our Go code alone — and a single team can’t take on this project by itself, so we wanted to automated the process as much as possible so other developers across the company could pitch in and help with the process. To satisfy these constraints, we implemented tools to automate the renaming across code repos. This kind of tooling makes the code changes mechanical, which in turn makes it easier to review and deploy. Furthermore, the tooling can be used by other developers in the company to perform their own refactoring.
Based on this analysis, we developed tools to refactor our Thrift code in several languages automatically. Each of the refactorizations performed by our tools touched code across all our services, including the mobile application, the back end and our data jobs. We performed several refactorizations with a focus on improving build time for our projects. The results are shown below.
Broad and shallow dependency tree
After the refactor, the services can depend on a specific Thrift struct instead of depending on the entire file. As shown in Figures 4 and 5, depending on a specific Thrift schema reduces the number of Thrift targets for service J.
After the refactor, the dependency tree for our services became broad and shallow instead of deep and narrow, as shown in Figures 6 and 7. This change is being rolled out into production in a piecemeal fashion.
Build time improvements
Since refactoring, build time has improved for several services, because the build tree is now shallow and broad, and a service can depend on a single struct instead of a large file containing thousands of structs. We have seen overall build time improvements of 15–35 percent for several projects.
Unnecessary builds reduced by 600x
After refactoring, we updated some services to only depend on their Thrift structs instead of the entire file. Before the refactor, we were building all the projects in the repo for every Thrift schema change. Now that we’re only depending on a specific Thrift struct that rarely changes, we’ve reduced the number of unnecessary builds for a target by a massive 600x.
Binary size reduced by 22–35%
These changes also resulted in smaller binary sizes, because we only include the needed schemas instead of all the schemas in the entire file. After some refactorizations we saw reductions in binary sizes of 22–35 percent. The reduced binary sizes result in faster application startup time, smaller resident size in memory and reduced infrastructure needed to build, store and deploy the generated application binaries.
We started the Thrift refactor project to untangle our narrow and deep dependency graph of Thrift schemas, because it was causing a variety of inefficiencies, including oversized binaries, unnecessary builds, a need for more computing resources, slowed-down development, difficult testing and an inability to do incremental builds. Once we realized that our dependency tree was tangled because of the “skinny include / fat dependency” problem, we picked the right abstractions to refactor our schemas. The implementation of the project was challenging, because we had to maintain backward compatibility while incurring zero runtime overhead across all projects in our repo. Refactoring the Thrift schemas and taking dependencies on specific structs not only enabled incremental builds, but also improved our build time by 35 percent while reducing unnecessary builds by 600x. We’re gradually rolling out our changes into production.
Acknowledgements: The contributors to this project are Baogang Song, Qi Zhou and Nick Zheng.