Type annotation via automated refactoring

Published in

Building Carta

8 min readOct 11, 2022

Python is a powerful language. It is easy to code and has broad community support, so many companies, large and small, use it extensively. But as a dynamically typed language, Python code can suffer from type errors when an operation fails because a value is not of the expected type.

Carta’s Python codebase has grown steadily since we launched in 2012, but our 2+ million lines have been steadily suffering more and more type errors in production: things like Python’s TypeError, AttributeError, and ValueError. We hit more than 12,000 of them in just one seven-day period in November 2021. Type errors not only degrade our customers’ experience, but also distract developers and hinder their progress.

Of course, we wanted to catch type errors as early in the development cycle as possible. Carta made the mypy type checker an option for individual teams in 2018, but the burdensome manual work of adding missing type annotations meant that few adopted it. Mypy would have generated an unmanageable number of errors if we had required it universally.

Carta’s infrastructure team saw this as an opportunity to improve the quality of life for our engineers and enhance our overall code quality and ownership. We built an automated refactoring framework to add those missing types, particularly useful for repaying tech debt in a large codebase. We estimate that it has saved over four years of manual work and greatly helped engineers focus on product development.

Measuring the problem

To get mypy working, we needed to provide type annotations in each function definition. We began by looking at how much work was needed in our codebases by measuring our type coverage with a weekly ETL job that uses LibCST to analyze code. It analyzed every function in every commit in our more than 100 code repositories, generating data for every parameter and return type in an Airflow DAG; the historical data was available in a Looker dashboard.

We suspected that the most challenging of our more than 100 Python codebases would be our monolith. It was (and remains) our most active GitHub repository, with over 1,000 commits per week from more than 200 engineers. We found that out of its 120,000 Python functions, only 14% were fully typed. The scale of the problem was massive. If we could add 30 annotations a day, it would take over 10 years to add them all.

The automation

We identified several common scenarios our automation could easily correct:

Functions with no return statement (such as __init__) should have a return type of None.
Functions implementing built-in interfaces should always return the same type ( __str__ should return str; __hash__ should return int).
Functions that return simple values should have an appropriate return type (for example, a function that returns only True or False should return bool).

In 2021, we built a tool to enforce Black formatting in our codebase which let us go from 5% Black coverage to 100% in just four months, quickly giving us consistent formatting and better readability. Our tool’s CircleCI pipeline fully automated integration tests, added code reviewers, merged pull requests after approval, and closed pull requests when detecting conflicts. Implementing Black taught us the importance of making minimally disruptive changes and accounting for shared code ownership. At first, the automation opened large pull requests, but they were hard to review. We adjusted them to be smaller but that generated a larger number of them. Through experimentation, we found pull requests of 100–200 lines to be a good compromise. We also accounted for GitHub’s CODEOWNERS classification and opened pull requests for single teams whenever possible. Small, targeted pull requests were easier to review and less likely to cause conflicts — at an adjustable but steady cadence. We were confident we could adapt it to identify and add missing types.

Our automated refactoring handles the steps in red, shepherding the pull request until it can ultimately merge it.

Our automation integrates several open source tools. It uses GitPython to check out new working branches, make commits with templated messages, and rebase branches. It interacts with GitHub using PyGithub to create pull requests with templated content, add reviewers, and merge pull requests. We also use LibCST for static analysis and refactoring Python code in the syntax tree. We collect information from different visit_ functions for different CSTNode types in a CSTTransformer, then apply refactoring in leave_ functions.

The automation pipeline consists of small periodical jobs that each perform a simple task to move pull requests from one state to the next. The life cycle of a code change starts with applying a refactoring application on a code path and creating a pull request. A task adds reviewers based on CODEOWNERS or commit history. Another task sends Slack notifications or adds more reviewers when the review has been pending for too long. Those independent periodical jobs work like a production line that creates and merges automated refactoring changes to our codebase.

Automating a discrete code change is relatively straightforward, but that was only a small part of our overall work. Our goal was to make it more generic and more broadly useful as a framework. A new refactoring application can simply implement the provided APIs and reuse the automated full pull request life cycle — including running continuous integration tests, adding code reviewers, merging pull requests after approval, or abandoning them when conflicts occur. Our system automatically promotes a pull request for review after it passes all tests and adds reviewers based on CODEOWNERS or recent commit history.

We knew that some tests could be affected by temporary problems, so our system automatically retries failed tests after 24 hours. For pull requests that remain pending review for too long, the system adds additional reviewers. If it’s still pending reviews after a set length of time, the system sends automated reminders to the code owner’s Slack channel.

Our automation also accounts for pull requests that need to be rebased on master to rerun tests. When it detects a merge conflict, it abandons the pull request and generates a new one based on the latest code. As engineers approve the automation’s pull requests and their files become fully typed, they can manually enroll them in mypy (and we’re working on automating that, too). Initially, we processed our code paths randomly, knowing that most of the codebase was untyped. As we approached 50% coverage, we began using recurring Airflow jobs to prioritize files with the most missing types.

How we built our refactoring apps

We used static analysis to build our initial set of refactoring applications, then iteratively built more sophisticated ones. For the __init__ case, we used LibCST matcher to find all __init__ functions and added None as their return annotation. For the simple return case, we analyzed all return statements in a function to make sure their return values all belonged to the same type, then added the single type as the return. In the example of get_val, we could add str as the return type. In the example of get_test_foo, we could add TestFoo as the return type and also add the future annotations import for the forward reference.

When their types were easy to determine, we could also add the missing parameter types. For example, we use the pytest mocker parameter frequently in our codebase, so we could add MockerFixture as a type and also add the import statement if needed.

For more complex functions that couldn’t be handled by static analysis, we used MonkeyType to collect types from running tests and applied the collected types, such as the create_profile example below.

We also made some enhancements to MonkeyType. Our version rewrites module names as the full package name when the name isn’t complete. It also removes the duplicate name when MonkeyType adds a duplicated import. With those enhancements, we fix common linter errors like Flake8 F401 (imported but not used), F811 (redefinition of unused name), and F821 (undefined name), so the automated refactoring changes don’t require further manual work prior to review.

Giving our developers a framework

As we enhanced our automation to add those 100,000 missing types, we refactored the framework, modularized it for reusability, and added more unit tests. We added APIs that our developers can use for their own automation, and added support for other languages.

Our ink team, responsible for our design system, uses our automation to replace outdated legacy components with their modern counterparts. They run JavaScript codemods when replacing legacy components and when updating the API of a component. Previously, they ran the codemods locally, split up changes by team domain, created pull requests, determined who the appropriate code owners were, and closely coordinated reviews by those people. Automated refactoring eliminated all of that manual effort. (For more about ink, see Jessica Appelbaum’s post about headless APIs in design systems.)

Encouraging participation with gamification

The automation’s pull requests are extra work for our developers, so we knew we needed to solicit their help in order to get this initiative going.

We’re heavy Slack users here at Carta, and it can be challenging to sift through so many existing channels and message threads. We believed that it was important to come up with something more eye-catching than your standard automatic notification. We built a weekly leaderboard to highlight and recognize those who added missing types and reviewed automated pull requests, giving wide visibility into how much progress everyone is making (and what still remains).

Weekly automated Slack messages summarize progress and recognize top contributors.

Our impact so far

Since August 2021, automation has handled more than 2,000 merged pull requests, adding missing types to over 50,000 functions. We have drastically reduced our overall errors, from a weekly average of about 8,000 to less than a thousand now. From a mere 14%, we have increased type coverage to 67%, and it is steadily growing.

Type annotation coverage in our monolith codebase since 2018

Given how complex our code is, we may never reach full coverage, but being relentless is one of Carta’s identity traits. We will move ourselves as close as possible to 100% coverage by the end of 2022 and continue to improve our code’s readability and ownership. The strong relationships we have built with our colleagues will help us improve our automation’s functionality as we move into its second phase: we want to increase enrollment in mypy to 100%, and further demonstrate another Carta identity trait, being helpful, by finding more tech debt it can help pay down, like API migrations and removing deprecated code.

If you’d like to learn more about our refactoring automation, EuroPython has posted the video of a presentation I gave at their 2022 convention. And if you’d like to help us solve large engineering problems, we’re hiring!