Building a Python monorepo for fast, reliable development

Published in

Pinterest Engineering Blog

7 min readOct 20, 2017

Suman Karumuri | Pinterest technical lead, Visibility & Ruth Grace Wong | Pinterest engineer, Core Site Reliability

More than 200 million people discover and do what they love on Pinterest every month. We rely on several hundred Python services and tools to power these experiences. The code for these services lives in 100+ Git repositories (except for our Python frontend monolith). Overtime, we found that developing Python applications across a growing number of repos was causing friction and slowing down our developers. We built Python commons to provide a seamless experience for our Python developers. In this post, we’ll share a few challenges we encountered managing Python code at scale, and how Python commons provides a fast and reliable code development environment.

Challenges managing Python code at scale

While Python tools work great for managing code in a single repo, the tools aren’t designed for managing code across repos. Even in a single repo, there’s a steep learning curve to correctly set up and use tools and utilities, like requirements, setup.py and tox for a reproducible build and test environment. Given the complexity involved, few developers take the time to do it right. Below, we’ll explain a few issues our developers face when building, testing and deploying Python code across 100+ repos.

Managing virtual environments: Each Python project has its own virtualenv, and the developer needs to be mindful of using the correct virtualenv while working in a project and branch. Using the wrong virtualenv leads to hard-to-trace errors in the development, build and deploy process.

Running unit tests with tox: For test integrity, developers are advised to run their tests in a virtualenv using tox. Given the complexity of managing virtual envs and setting up tox correctly, few projects do this in practice. (Some developers skip writing unit tests entirely.)

Package pinning: If packages aren’t pinned to specific versions they might break in production when their dependencies are upgraded. Even if each repo pinned the version of their packages, reusing code across repos leads to conflicting package versions and breaks the package during deployment.

Deploying security fixes: Upgrading packages to fix a security issue across hundreds of repos is a hard, boring and tedious process.

Pip install: Most of our developers deploy Python packages using pip install. In practice, we found pip install isn’t a robust deployment mechanism for the following reasons:

Pip install isn’t atomic. A failed pip install may leave some packages upgraded and others an old version. This occasionally causes deployment outages.
Pip can fail silently on production machines which leads to production outages.
Pip’s command line options are inconsistent across minor version changes, which can cause a pip install to fail when pip is upgraded along with new OS versions.
Pip downloads each dependency recursively. While this is harmless at small scale, doing it across tens of thousands of machines several times every day is inefficient.
Pip install wasn’t ideal for deploying internal tools because inconsistent dev environments was becoming hard to support. Most tools came with custom scripts that setup virtual envs and deployed the tool there. While this worked, it was a tedious and error-prone process.

Consistent development environment: Since developers set up their own repo, over time there’s little consistency in development, build, test and deployment setups. Several projects didn’t have continuous integration setup for their build process while coding conventions and quality varied across repos. Even minor issues, like failing to correctly namespace a package, led to namespace clobbering issues when the code was reused resulting in complicated workarounds. This additional complexity discouraged code reuse across the repos.

Our takeaway is the standard Python toolchain needs a lot of work upfront to create a consistent and reproducible build environment in a single repo. Even if we set up the tooling carefully, the standard tools can’t ensure a consistent build and deploy pipeline across repos.

Python commons

We had one primary goal as we designed our new solution — we wanted it to be easy to do the right thing while enabling developers to quickly ship code. So we built a monorepo called Python Commons using Pants build tool. To streamline our release process, we use a Python EXecutable(PEX) file as our release primitive.

Python commons monorepo

The first decision was to start using a monorepo for all our tool’s code. This provides a single place for all code and allows us to enforce healthy development practices over a multi-repo solution. A consistent development, build and test environment also encourages modular code and code reuse. A monorepo is a more natural workflow for us since we have several language-specific monorepos, and it’s common for several tools share the same repo.

Since we already have a Python monorepo for our frontend application code, our first instinct was to move the tool’s code into that repo to create a single repo for all Python code. However, that didn’t work, because the development workflow was heavily customized for building our monolithic Python web frontend. So, we decided to build a separate monorepo called “Python commons” for our tools and services.

Pants

While deciding on the monorepo was easy, the hard part was setting up a development workflow suitable for a wide-range of Python applications, from web apps to services, libraries and command line tools. To make managing and using the monorepo easier, we use Pants as our build tool. Pants helps enforce a uniform development workflow for building, testing and packaging apps while keeping our configuration DRY.

Figure 2: A snippet of our Python requirements file that lists external dependencies. The versions of packages are pinned for the entire repo, and the package versions are conflict-free, so all code in the repo can be reused.

Figure 3: Each project in our repo contains its own folder. Each folder has a BUILD file, which lists the internal and external dependencies for the projects. In the BUILD file above, _bot target depends on argparse project. The srebot binary depends on _bot target.

Figure 4: A user can run predefined Pants goals on the targets.

The code layout we used in the repo provides a consistent development workflow for every project in the repo.

The folder structure shown in Figure 1 ensures source and tests are separated, and all internal code is in the Pinterest namespace. This separation safeguards us from shipping tests or their dependencies into production.
Pants comes with a built-in Python linter that enforces code style for the repo.
Standard build targets provide an intuitive and consistent development workflow to build, test, run and release packages (as shown in Figure 4).
The pants repl option provides an interactive repl to play with the code.
Pants creates a virtualenv for every run based on the dependencies in the BUILD file. If the dependencies change between Git branches, developers don’t have to switch virtualenvs to make sure their code works correctly making virtual env management seamless.
Since tests are run in a virtualenv, developers don’t have to learn or use tox.
Pants test target automatically creates a test runner, so there’s no need for a separate script to run tests.

Pants simplifies dependency management across projects using repo and version pinning.

Pants controls which external repos we download our packages from. When our access to PyPi repo was blocked, we pointed the repo to an internal mirror with a one line configuration change to the pants.ini file.
We use the same set of pinned dependencies for the entire repo (as shown in Figure 2). This is the only place in the repo for defining our external dependencies and simplifies our dependency management. Pants builds a virtual environment for every build, so any dependency conflicts are detected right away.
A single place for pinned dependencies allows us to upgrade the package for all the projects in the repo at once. This greatly simplifies doing security audits and package version upgrades.

By enabling fast reproducible builds, Pants simplifies build and release management.

Pants run target in a BUILD file can be used for running the program locally, eliminating the need for scripts.
Pants provides fast, reproducible builds for our packages. Pants performs incremental builds on its targets, so only changed modules are rebuilt which speeds up the build process. Running all the build targets in a virtual envs ensures builds are reproducible.
Pants python_library target can include a setup.py definition (as shown in Figure 3). By using this target, developers don’t have to learn setup.py to publish Python eggs.
Pants binary target generates a standalone pex binary for the project.

PEX

A monorepo with pants streamlined our development and test process. We observed our developers preferred their own repos, because it offers them control over the distribution of their code as a Debian package, Docker container, Python egg or script. To cater to these use cases and streamline our package release and deployment process, we needed a mechanism to easily export packages into various formats. Exporting an egg was easy since Pants natively supports it. To package our code into other formats, we used PEX as a basic packaging primitive for our code. A PEX is a self-contained, cross-platform, Python executable format with packaged dependencies, so it only needs a Python interpreter on the machine it’s running on. A PEX can be packaged into a Debian package, Docker container or uploaded to S3. The last deployment option is great for shipping internal tools, which are hardest to deploy and manage.

Our multi-format package release process is powered by a Jenkins script (as shown in Figure 5). It uses the project name and release type to generate the necessary files (Dockerfile, Debian package, Python egg, PEX binary) and makes the build available for deploy by uploading them to their respective repos. The release process not only relieves our developers of understanding Docker, Debian package management or Python egg format, but it also enforces best hygienic and secure package management practices.

Figure 5: Jenkins release workflow takes the package name and release type as input and generates a Docker container, Debian package, a Pypi package or a PEX binary.

Conclusion

Using this development setup we take care of all the boilerplate code a developer writes before working on a project. This helps our developers focus on code without having to worry about setup.py, tox, virtualenv. It also eliminates the need to create scripts to setup and run the project locally, scripts to release a Docker or Debian packages or scripts to test code locally or in Jenkins. We rolled out Python commons almost a year ago and have already migrated 35 projects to it.

Acknowledgements: We’d like to thank Evan Jones, Yongwen Xu and Nick Zheng for their help and feedback on the project. We’d also like to thank the pants community for their support.