How to review ETL pySpark pipelines

Defining a code review process to comply with python standards, guarantee data quality and keep your code extensible.

Published in

inganalytics.com/inganalytics

6 min readMar 23, 2020

The WBAA tribe is composed of several agile self-organized teams were code reviews are used to comply with: the security four-eyes principle, facilitate knowledge sharing across the team and guarantee the extensibility /consistency of the codebase in our products. To make this process easier across teams, we standardized a general process divided into four different stages that allow validating our pipelines efficiently: pre-commit automation, CI/CD automation, business logic validation, and data quality confirmation.

Pre-commit automation

In the direction of making review tasks efficient, a local machine automated workflow is configured for each developer in the team. This implies a standardized template for project configurations and a general pre-commit workflow that validates code format and python standards, removing the human-in-the-loop and allowing the focus on business logic and architecture validation. Thus, the following steps are followed:

a. Create a general project template

For those cases when a new pipeline line is created, we share a general project template using cookiecutter. Therefore, we avoid concerns due to project structure.

b. Automate a pre-commit workflow

As part of the project template, files that configure python linters are defined. Hence, in our case, we use black and flake8 to check and enforce the code-style, format, and pep8, among others. Following you will see the template defined for our tox.ini file,

[flake8]
exclude = .git
max-line-length = 120
per-file-ignores =
    */__init__.py: D104

and .pre-commit-config.yaml

repos:
-   repo: https://github.com/ambv/black
    rev: stable
    hooks:
    - id: black
      language_version: python3.7
-   repo: https://gitlab.com/pycqa/flake8
    rev: 3.7.9
    hooks:
    - id: flake8
      args: ['--config=tox.ini']
      additional_dependencies: [
          'flake8-deprecated',
          'flake8-docstrings',
          'flake8-tidy-imports'
      ]

Then, developers working on git feature branches should pass the black format check and the flake8 general validations before the code is accepted. Also, some extra minimal flake8 validations are added to validate at least quality on the doc-strings, not deprecates usage and import cleanness. Following this structure, we guarantee a good code rate over 9.5 always while running the pylint report.

pre-commit workflow passed

pylint report

CI/CD automation

After the commit is pushed one extra stage is configured using GitLab CI/CD pipelines, the idea is to apply validations to our feature branches before this could be reviewed and merged into our master/acceptance branches. Then, the .gitlab-ci.yml, looks like this:

.my_template: &regular_job_template
  before_script:
    ...
  image: ...
  tags:
    # set runner
    - docker-tagstages:
  - lint
  - test
  - publishlint:
  <<: *regular_job_template
  stage: lint
  script:
    - flake8test:
  <<: *regular_job_template
  stage: test
  script:
    - cd lib
    - pip install -r requirements.txt --quiet
    - export PYTHONPATH="$PWD/src:$PWD/test:$PYTHONPATH"
    - coverage run
    - coverage report --fail-under=95
...

As a result, before a pipeline passes to be reviewed and merged should always comply with the lint validation, this time applied to the code into the repository as a double-check, and all the tests should pass with coverage higher than a minimum threshold of 95%.

Business logic validation

Finally, after these two previous steps passes we are ready to start validating the pipeline business logic and architecture. So, the following process is followed:

a. Business validation

As a company focused on products the first step is always to validate if the code is still solving a business need and whether in general there is a benefit added to the product. Second, during this step, we validate missing requirements and add them to a further sprint iteration. Besides, also documentation changes are validated in our case on the readme.md, answering the question: In case you join tomorrow, could you understand how to install and maintain this project?.

b. Dependencies and CI/CD structure

The first code files that are validated are those related to the python packaging configuration:

The requirements file/conda configuration is validated to confirm the correct versions of libraries are used.
The CI/CD pipeline files are validated in case of changes to confirm that they are still complying with our rules and bash commands are well defined.

c. Pipelines architecture

The pipeline architecture is validated before to start seeing the internal part of the code. Thus, we guarantee that the best approach for the use cases is always followed. Then, we validate the testability and extensibility of the pipeline by checking that patterns like the SOLID and DRY are fulfilled. In general, for python, we inspect that all the steps defined on our guide for scala pipelines are also followed. Consequently, we are validating:

Readability (even when we have black above formatting our code) and maintainability. For this, pythonic approaches are always preferred.
Testability, every important transformation should be testable in isolation.
we are not repeating ourselves when an abstraction could be applied.
Big-O complexities of algorithms are appropriate.
we are commenting only when is necessary, preferring always meaningful names for functions and variables. However, when we have comments because a further explanation is needed, they are clear and well defined.
spark good practices are used, like for instance: avoid UDFs when column operation could be applied, shuffling and joining operations are efficient.

d. Tests completeness

Even though the coverage threshold defined above > 95 % is a good warranty. The unit tests structure should be validated to see:

Every important transformation is validated against different edge cases. For this complete mock data is used on a local spark unit test.
Asserts are reliable enough to guarantee the expected behaviors.

Thus, we implement small functions using spark and chain them to make the code testable and extensible. For a real work example, you can follow the scala guide, but here a small (dummy) example using python is shown.

class GeneralDFValidator(ABC):
    """Class with general handlers functions."""

    def transform(self, f):
        """Wrap the transform spark function non available for python."""
        return f(self)

    @staticmethod
    def rename_cols(df, transformation_map):
        """Rename a set of spark columns within a df using a transformation_map dictionary.

        Example:
        df:
            +--------+--------+
            |   col1 |   col2 |
            |--------+--------+
            |     15 |     76 |
            |     30 |     97 |
            +--------+--------+
        transformation_map :
            {col1: id, col2: code}
        return:
            +--------+--------+
            |   id   |   code |
            |--------+--------+
            |     15 |     76 |
            |     30 |     97 |
            +--------+--------+
        """
        return reduce(
            lambda internal_df, col_name: internal_df.withColumnRenamed(
                col_name, transformation_map[col_name]
            ),
            transformation_map.keys(),
            df,
        )DataFrame.transform = GeneralDFValidator.transform

a small unit test for the previous function could be:

def test_rename_cols(self):
    """Test the GeneralDFHandler.rename_cols function.

    1.Assert count after applying the filter.
    2.Assert schema correctness.
    """
    df = self.df.select(Constants.SELECT_COLUMNS)
    df = df.transform(
        lambda inner_df: GeneralDFHandler.rename_cols(
            inner_df, Constants.COLUMNS_TRANSFORMATIONS_MAP
        )
    )
    expected_rows = [
        Row("samplea1", "samplea2"),
        Row("sampleb2", "sampleb2")
    ]
    expected_schema = StructType(
        [
            StructField("ROW_EXAMPLE_1", StringType(), nullable=True),
            StructField("ROW_EXAMPLE_2", StringType(), nullable=True),
        ]
    )

    self.assertEqual(df.schema, expected_schema)
    self.assertEqual(df.collect(), expected_rows)

As a result, our jobs show reports with proper test coverage.

Data quality confirmation

After a pipeline passes all the previous steps, it is merged first to an acceptance branch. Then, an external pipeline applies business rules validation to check data completeness (counts are OK) and correctness (schemas and data are OK). For this, an airflow dag runs a job that compares checksums of production data (expected) vs staging data (modified) and validates the correctness of every column in a dataset. As a result, a dashboard in superset shows data quality reports available for the team.

After this step, the code review process is finished and the code is ready to be merged and deployed in production.

Conclusion

As a very short conclusion, a good process to review spark pipelines could improve the productivity for your teams, allow them to focus on product/business capabilities and guarantee data quality.

Therefore define a multistep process that suits your team, save time and keep focused on building great products.