Automating refactoring across teams and projects

Conquering breaking changes chaos: Leveraging automatic refactoring tools across teams and projects.

Moenes Bensoussia
datamindedbe
7 min readJul 12, 2023

--

Dependencies need to be updated after releasing new features, whenever new security updates are available or to simplify future upgrades. Generally the longer an update is delayed, the more challenging it becomes due to increased complexity. Even after a small release you probably still had to fix breaking changes and adjust configurations. This can be time-consuming, repetitive and take up a significant portion of the maintenance budget.

As software evolves, breaking changes often emerge. This applies to your own software as well as for popular, well-known libraries. At some point, you had to manually update the DataFrame.append method to DataFrame.concat and make the appropriate changes across multiple repositories after upgrading to Pandas 1.4.0. Or you probably needed all your organization’s teams to move to latest version of an internal shared library so that you can stop supporting older versions.

Your favorite IDE could have assisted you to refactor your current project and then you could later build, test and deploy the new artifact. This is already a lot of manual and repetitive work to do for one project. Imagine how tedious is making the same process across tens or even hundreds of projects.

Wouldn’t it actually be interesting to have a tool that allows you to define specific updating rules and automates their application across all your projects and repositories. This would undoubtedly save you from the burden of tedious, manual work when updating multiple repositories.

At Data Minded, we prioritize automation of various tasks, including automated testing, continuous deployment, and leveraging tools like Dependabot for automatic dependency updates. We have observed that updates are currently being carried out on an ad-hoc basis at our clients on multiple projects, requiring a significant amount of manual, tedious and repetitive effort. As a result, we are working on a R&D project aimed at developing a tool that can assist and ease our Data Engineers in automating the process of updating codebases whenever necessary.

This blog post will explore some different and powerful libraries to automate your Python code update process and showcase some of the work we have been doing.

In order to explore the different refactoring libraries, we need a demo Python file where we introduce some small breaking changes and find out how could each library help automate the code update.

Who doesn’t enjoy working on Pet Store Apps ?

from typing import Optional

from fastapi import FastAPI
from petstore_lib.manage_pets import ManageCats
from pydantic import BaseModel

app: FastAPI = FastAPI()
cats = ManageCats()

class RegisterPet(BaseModel):
pet_name: str
age: int
pet_type: Optional[str] = "cat"

@app.post("/register_pet")
async def register_pet(register_data: RegisterPet):
cats.register_cat(register_data)
return f"{register_data.pet_name} added to the app"

Imagine your team is working on a pet store microservice architecture application that heavily relies on a library called petstore_lib to manage the Pet Store. Initially, the library only supported Cats. However, a new version has been released 🎉 , extending its support to all kinds of Pets.
Unfortunately, after the release the library introduced some breaking changes and your codebase needs an update.

In the given code example of the pet registration microservice, ManageCats object must be renamed to ManagesPets and the register_cat method to register_pet. However, the end goal is not to only update this python file or this microservice but to make the appropriate changes across all microservices.

To minimize manual and repetitive work, how could the maintainers from the petstore_lib automate and make the migration a better user experience.

Let’s check a few candidates that could be useful for this use case.

Possible tools

  1. Bowler

Bowler is built on top of “fissix” (a fork of “lib2to3”) and allows developers to define transformation rules and automate repetitive tasks like renaming variables and applying consistent formatting across the codebase. With Bowler, developers can perform large-scale renaming, ensuring that changes are propagated consistently across files.

from bowler import Query

(
Query("path/to/python_file")
.select_var("ManageCats")
.rename("ManagePets")
.execute()
)
(
Query("path/to/python_file")
.select_var("register_cat")
.rename("register_pet")
.execute()
)

Using Bowler, you only need to specify the path of your Python files, the target class/method name, and the new name using the Query object.

Bowler stands out as one of the most user-friendly refactoring libraries available in Python. Its API is designed to be intuitive, uncomplicated, straightforward, and yet powerful in its capabilities.

Unfortunately, lib2to3 is getting deprecated in newer python versions making Bowler considered currently as an “inactive package” in terms of active development and maintenance. The future of Bowler is uncertain, and it remains unclear if there will be further updates or support for the library. Therefore, it’s essential to be aware of its status and carefully consider your long-term reliance on the library.

2. Rope

Another popular refactoring library, Rope, enables code navigation, refactoring, and generation. It provides advanced features such as method extraction, identifier renaming, and fixing imports.

By integrating with IDEs, Rope enhances code analysis and refactoring capabilities, allowing developers to make changes more efficiently.

from rope.base.libutils import path_to_resource
from rope.base.project import Project
from rope.refactor.rename import Rename

rename_project = Project("path/to/module")
resource = path_to_resource(rename_project, "path/to/python_file")

class_offset = resource.read().index("ManageCats")
class_refactor = Rename(rename_project, resource, class_offset)
class_changes = class_refactor.get_changes("ManagePets")

method_offset = resource.read().index("pet_store_service.register_cat") + len("pet_store_service.")
method_refactor = Rename(rename_project, resource, method_offset)
method_changes = method_refactor.get_changes("register_pet")

To rename a class or a method using Rope, you need to specify the module and the Python file. The challenging part is to find the correct offset for the specific subset of code you want to update. Once it’s determined, you can just use the Rename object from Rope.

Working with Rope is definitely a bit more complicated compared to Bowler. Rope is primarily designed to be used as a plugin for editors, and refactoring is based on file offsets to represent the user’s edit cursor. In case programmatic refactoring, you will need something a bit more robust than str.index() and a bunch of offset math.

3. LibCST

LibCST is a compromise between Abstract Syntax Tree (AST) and a traditional Concrete Syntax Tree (CST). If you are not familiar with Syntax Trees, I recommend referring to the provided document for a detailed comparison.

It provides a robust framework for parsing, analyzing, and modifying Python source code while enabling precise and fine-grained manipulation of code structures. Its APIs allow developers to filter, modify nodes, and generate modified code.

from typing import Union

from libcst import CSTTransformer, ImportAlias, FlattenSentinel, RemovalSentinel, parse_module, Name, BaseExpression, \
Attribute


class RenameTransformer(CSTTransformer):
def __init__(self, changes_dict: dict):
super().__init__()
self.changes_dict = changes_dict

def leave_ImportAlias(
self, original_node: ImportAlias, updated_node: ImportAlias
) -> Union[ImportAlias, FlattenSentinel[ImportAlias], RemovalSentinel]:
if original_node.name.value in self.changes_dict:
return updated_node.with_changes(
name=Name(self.changes_dict[original_node.name.value])
)
return original_node

def leave_Attribute(
self, original_node: Attribute, updated_node: Attribute
) -> BaseExpression:
if original_node.attr.value in self.changes_dict:
return updated_node.with_changes(
value=original_node.value,
attr=Name(self.changes_dict[original_node.attr.value])
)
return original_node


ast_tree = parse_module(open("path/to/python_file", "r").read())
new_code = ast_tree.visit(RenameTransformer({
"ManageCats": "ManagePets",
"register_cat": "register_pet"
}
)).code

To effectively utilize LibCST, you need a class that extends the CSTTransformer, and that defines the logic for code transformations. This entails specifying the classes and methods that require refactoring and determining the depth of the desired transformations.

Compared to Bowler and Rope, LibCST is more complicated to work with. It takes some time to understand the API and its concepts. If you are new to the library or to Syntax Tree based code manipulation, there may be a learning curve involved in getting familiar with and understanding how to leverage its features effectively.

Time for a real world example

Let’s have a look at a real-world use case in our industry. Apache Airflow is a Python-based workflow manager heavily used in the data engineering landscape. A while back Apache Airflow went from version 1 to version 2.

At Conveyor, we are providing a managed Airflow solution that follows an evergreen strategy, ensuring that our customers always have access to the latest version of Airflow while we take care of the migration process and we wanted to be ready for this Airflow upgrade while making it easy for ourselves to define automated updates across multiple customers, teams and repositories. We created a tool that harnesses the capabilities of LibCST and that is more user-friendly.

Let’s check how our tool will behave if we need to migrate a dummy Airflow 1 dag to Airflow 2. To do so, we need to update the import path of the BashOperator, DummyOperator and also rename the schedule_interval argument.

from datetime import datetime, timedelta

from airflow.operators import BashOperator
from airflow.operators.dummy_operator import DummyOperator

from airflow import DAG

default_args = {
'owner': 'airflow',
'start_date': datetime(2023, 6, 13),
'retries': 3,
'retry_delay': timedelta(minutes=5)
}

dag = DAG(
'example_dag',
default_args=default_args,
schedule_interval='@daily'
)

dummy_task = DummyOperator(
task_id='dummy_task',
dag=dag
)

bash_task = BashOperator(
task_id='bash_task',
bash_command='echo "Hello World!"',
dag=dag
)

dummy_task >> bash_task

LibCST is a powerful tool but a bit complicated. When an update is needed, you’d have to spend some time getting the hang of the API and writing hundreds lines of LibCST code for the update. We found that creating a tool that abstracts LibCST could be helpful by allowing our Data Engineers to focus more on defining the changes for the update rather than working on syntax trees and nodes or just making the changes manually.

Here is a brief demonstration of the tool.

    (
Refactor()
.rename_param("schedule_interval", 'schedule')
.rename_import_attribute("dummy_operator", "dummy")
.add_import_attribute("BashOperator", "bash_operator")
.describe_changes()
)

And voila !

-schedule_interval='@daily'
+schedule='@daily'
-from airflow.operators.dummy_operator import DummyOperator
+from airflow.operators.dummy import DummyOperator
-from airflow.operators import BashOperator
+from airflow.operators.bash_operator import BashOperator

If you were to rename DAGs manually across many customers, teams and projects… , it would become a time-consuming and tedious task, especially when dealing with hundreds of DAGs spread across multiple projects. The amount of time it would take to manually update each DAG could also be significant and would depend on various factors such as the complexity of the DAGs, the number of occurrences to be changed, the overall size of the codebase..

Conclusion

This blog post highlights the challenges of adjusting to updates when dealing with multiple repositories and how that can be time-consuming and frustrating.

Additionally, I explored various popular Python refactoring libraries. For the current Python landscape, I would personally continue exploring LibCST since it provides a robust, stable and fine grained control framework over your syntax tree.

Finally, I presented some of the work our team has accomplished so far in this field. The open-source tool is just around the corner, ready to revolutionize the way you manage and maintain your projects.

--

--