Welcome to pre-commit heaven

Raphaël Hoogvliets
Marvelous MLOps
Published in
9 min readJun 28, 2023

Stay tuned for the best pre-commit hooks for python and MLOps below.

If you’re applying CI and deployment best practices chances are you might be using a linter or two. Linters originally used to check source code for programmatic and stylistic errors. When the code did not adhere to all the protocols in the linter, it would pass successfully. Over the years linters got more and more complex, and well to be honest, they got a little out hand in a good way. At one point developers figured out a useful way of using linters is to run them before you actually commit code. This ensures that “bad” code never makes it to the repository, and voila: the pre-commit hook was born.

Nowadays pre-commit hooks are used in a variety of ways, but most notably in two ways:

  1. ) Running checks before code is allowed in local and/or remote repositories
  2. ) Running checks on code that passes through a pipeline

For both cases it usually goes that if not all the hooks (code checks) pass successfully, then the code is not allowed in the repo or through the pipeline. Of course there are many different use cases and exceptions to think of, but the general idea is: if our tests fail, we halt the process. And what could be nicer than having a vast opensource ecosystem of thousands of hooks and code protocols automatically checking our code. Making sure it is consistent, readable, secure, and up to standards and best practices. Personally I could not think of anything, I love me some pre-commit hooks! And boy are we lucky to live in this day and age where there seems to be a pre-commit hook out there for everything.

Well not everything, but the offer is amazingly rich and growing. On top of that every self-respecting hook has plenty of configuration options, allowing them to be tailored to every need and allowing for hundreds of thousands of combinations. With so many options out there, what hooks should you be using? And where to start? To help you upgrade your CI practices we have made a selection of some of the best and easy to use hooks for python and MLOps out there. To make sense of the types of hooks available we have divided them into five categories: guard rails, formatters, code checkers, code correctors and git helpers.

Note that in the code examples below the hooks and their arguments are presented as part of the .pre-commit-config.yaml file which is used to configure automated running of all hooks by the pre-commit package. Again, not applicable to all use cases, but from our experience certainly making life easier in most. The configuration file ensures that different hooks, from different suppliers, can easily be managed and configured in one central place as part of the project and repository.

Guard rails

Guard Rails hooks focus on enforcing certain rules and safeguards to ensure code quality, security, and adherence to best practices. They include checks such as verifying file formats, syntax validation, and executable script validations. These hooks help catch potential issues early on and provide safeguarding to prevent common mistakes or vulnerabilities. Some of our favourite guard rail hooks are:

  • check-ast: Checks the syntax and structure of Python code using the Abstract Syntax Tree (AST).
  • check-added-large-files: Ensures that large files are not accidentally added to the repository.
  • check-json: Validates JSON files for syntax errors.
  • check-toml: Validates TOML files for syntax errors.
  • check-yaml: Validates YAML files for syntax errors.
  • check-shebang-scripts-are-executable: Verifies that shebang scripts (scripts starting with #!/) are executable.
  • bandit: Checks for common security issues in Python code.
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
hooks:
- id: check-ast
- id: check-added-large-files
- id: check-json
- id: check-toml
- id: check-yaml
- id: check-shebang-scripts-are-executable
- repo: https://github.com/PyCQA/bandit
rev: 1.7.4
hooks:
- id: bandit

Formatters

Formatters are hooks that automatically format code according to specific style guidelines. They help maintain consistent code formatting within a project, ensuring readability and adherence to the chosen coding conventions. These hooks can fix issues such as trailing whitespace, mixed line endings, and missing end-of-file markers. Popular formatters like Black and Isort are commonly used in this category.

  • end-of-file-fixer: Adds an end-of-file marker at the end of files if missing.
  • mixed-line-ending: Corrects inconsistent line endings in files.
  • trailing-whitespace: Removes trailing whitespace at the end of lines.
  • black: Automatically formats Python code according to the Black code style guidelines. Many of the guidelines can be adjusted by arguments such as line length in the example below.
  • black-jupyter: Formats Jupyter notebooks using the Black code style guidelines.
  • isort: Sorts Python imports according to the defined style, in this case, the "black" profile to make in compatible with the black hook above.
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
hooks:
- id: end-of-file-fixer
- id: mixed-line-ending
- id: trailing-whitespace
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
language_version: python3.11
args:
- --line-length=128
- id: black-jupyter
language_version: python3.11
- repo: https://github.com/pycqa/isort
rev: 5.11.5
hooks:
- id: isort
args: [ "--profile", "black" ]

Code checkers

Code Checkers hooks are focused on analyzing code for potential issues, such as code smells, anti-patterns, and common mistakes. They perform static analysis on the codebase to catch problems like missing docstrings, debug statements, or violations of code style guidelines. Tools like Flake8 and MyPy are commonly used in this category to provide comprehensive code checking capabilities. And for good reason! Flake8 and MyPy are the heavy hitters of code checking, combining for hundreds of rules when used with plugins. We recommend extending Flake8 with Bugbear, Comprehensions and Simplify as a best practice.

  • check-docstring-first: Checks if module-level docstring is present and located at the beginning of the file.
  • debug-statements: Identifies the usage of debugging statements like print, pdb, etc.
  • flake8: Performs various code checks and style enforcement. The wonderful thing about Flake8 is, that it returns a violated protocol code and description, along with a hyperlink directly to the line of code where it was found.
  • pyupgrade: Upgrades Python code to newer syntax versions. In the example below specifically targeting Python 3.9 and above.
  • yesqa: Automatically removes unnecessary # noqa comments. These comments can be used to ignore lines of code that fail linting. When the lines are passing the comment is removed.
  • pycln: Removes all unused import statements. Be wild, live a little!
  • mypy: Type checks Python code using the MyPy static type checker. We recommend to ignore missing imports in MyPy.
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
hooks:
- id: check-docstring-first
- id: debug-statements
- repo: https://github.com/pycqa/flake8
rev: v5.0.4
hooks:
- id: flake8
args:
- "--max-line-length=128"
additional_dependencies:
- flake8-bugbear
- flake8-comprehensions
- flake8-simplify
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.991
hooks:
- id: mypy

Code Correctors

Code Correctors hooks aim to automatically correct or upgrade code to improve code quality, maintainability, and compatibility. These hooks can perform automated refactoring, apply modern syntax improvements, and fix common code issues. They help streamline the codebase by automatically applying changes that would otherwise require manual intervention.

  • pyupgrade: Upgrades Python code to newer syntax versions. In the example below specifically targeting Python 3.9 and above.
  • yesqa: Automatically removes unnecessary # noqa comments. These comments can be used to ignore lines of code that fail linting. When the lines are passing the comment is removed.
  • pycln: Removes all unused import statements. Be wild, live a little!
repos:
- repo: https://github.com/asottile/pyupgrade
rev: v3.7.0
hooks:
- id: pyupgrade
args: [--py311-plus]
- repo: https://github.com/asottile/yesqa
rev: v1.4.0
hooks:
- id: yesqa
additional_dependencies: &flake8_deps
- flake8-bugbear==22.8.23
- flake8-comprehensions==3.10.0
- flake8-docstrings==1.6.0
- repo: https://github.com/hadialqattan/pycln
rev: v2.1.1
hooks:
- id: pycln
args: [--all]

Git helpers

Git Helpers hooks provide assistance and enforce certain rules related to the Git version control system. They help ensure that commits follow specific guidelines, such as commit message formats, and perform validations during various stages of the Git workflow. These hooks can enhance collaboration, code review processes, and maintain a consistent history within a project. Hooks like Commitizen and No-Commit-To-Branch are examples of Git Helpers.

  • commitizen: Enforces the usage of the Commitizen commit message format for consistent and standardized commits.
  • commitizen-branch: Performs commit message validation. With the argument below it works specifically for branch pushes, but other options are available.
repos:
- repo: https://github.com/commitizen-tools/commitizen
rev: v2.35.0
hooks:
- id: commitizen
- id: commitizen-branch
stages: [push]
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
hooks:
- id: check-merge-conflict
- id: no-commit-to-branch

Find the code in https://github.com/marvelousmlops/precommit-heaven and delight us with a PR if you think we should add something to the collection!

Configuring and ignoring

Sometimes you might let a specific hook bypass a line of code, because you believe it is a false positive or because you have a valid reason for not following the recommended style or convention. Most hooks have different ways of configuring this. You can for example configure who files to be ignored in your setup files, configuration files or directly inline in the code. If you want specific lines to be passed over by hooks you can add a comment in the line (right of your code).

Different hooks have different comments for ignoring, but a general comment returning in multiple hooks is # noqa. There are hook specific ignore comments as well though. In the example below we configure just the hook pycln to ignore a line of code with its own hook specific # nopycln: import.

from pandas import ( # nopycln: import
read_csv,
DataFrame,
concat
)

But for same effect the generic # noqa could be used.

from pandas import ( # noqa
read_csv,
DataFrame,
concat
)

When then generic #noqa is used multiple hooks might ignore the line. For example #flake8 is also receptive to #noqa, so tread cautiously.

War of the hooks

Different hooks might try to do the same or similar things in different ways. When this happens you can make sure your hooks are working together by unifying them with arguments. Below is a classic example of two popular hooks which can have different default settings for line length. Note that the two arguments across the hooks configuring the same thing have different names. You could try to be clever and let the code checker run before the code formatter, but the next developer (or pipeline) up using the exact same hooks would run into a problem. So make sure your hooks work as a team.

repos:
- repo: https://github.com/psf/black
rev: 22.10.0
hooks:
- id: black
language_version: python3.11
args:
- --line-length=128
- repo: https://github.com/pycqa/flake8
rev: v5.0.4
hooks:
- id: flake8
args:
- "--max-line-length=128"

Many hooks nowadays even give a nod to each other. For example isort has a specific setting that adheres to the style of black.

- repo: https://github.com/pycqa/isort
rev: 5.11.5
hooks:
- id: isort
args: [ "--profile", "black" ]

For configuration we again recommend using the pre-commit package and configuring arguments in the .pre-commit-config.yaml ensuring clarity, oversight and version control.

When to use hooks (and when not to)

So when should you use them? Each and every hook is a story onto its own. Use cases can also greatly differ. However, we recommend using hooks in your pipelines, starting with the CI. Using hooks in your pipeline jobs at an early stage can ensure the pipeline fails early, saving valuable time and money. Ultimately, deployment costs can vary from fractions of a cent to millions. So money is not always the argument. What might be even more important is that using the right hooks will free up valuable time and headspace for developers to focus on other more important and less obvious tasks. Tasks that cannot be automated by an automated protocol in a hook. For example: why should a human have to check if imports across all scripts line up with the rest of the code? Such tasks can be very important, but can become tedious and underuse our cognitive capacity.

Stay in control

While folks out there are worried about AI taking over the world, you as a developer should worry about hooks taking over your code. Some hooks are opinionated, which makes them a trusty and enjoyable advisor. However, when those views are not aligned with your own, opinionated can quickly turn into sassy! You’ll be adding # noqa’s all over the place. Sometimes the hook can be configured efficiently in a central way with the right arguments. When it becomes too much though, you have to ask yourself if this hook working for you or against you. Remember that less can be more. Happy linting!

--

--

Raphaël Hoogvliets
Marvelous MLOps

Building data science and MLOps teams // fostering great culture