Transferring software into open-source space

Benjamin Gutzmann
Otto Group data.works
3 min readJul 5, 2023

Introduction

BQuest has been our internal solution for running unit tests for Google BigQuery queries for quite some time. It leverages well-known libraries like pandas to create an end-to-end testing experience allowing users to define input tables (using pandas DataFrames), BigQuery queries and expected output tables (again using pandas DataFrames) and manages all intermediate steps including pushing input tables to BigQuery, executing the given query on the input tables and extracting output tables. The only remaining effort is running pandas assert_frame_equal method against the received ouput. Recently we decided to open-source bquest allowing anybody to install it e.g. via PyPi so now you

pip install bquest

and you’re good to go and write your first BigQuery tests.

The open-sourcing was part of our internal TechInno days which come with a scope of 2 days. What we’ve learned throughout the process is shown in the next paragraphs.

Project/Repository Structure

After a short introspection of the repository we applied the following changes:

  • The repository previously was cluttered with some old configuration files, e.g. some bash scripts which were not necessary any longer. We removed all of those files to make a clean cut.
  • The actual Python related code was put in a subfolder “python”. We moved all code to the top level, with the “bquest” folder containing the actual code and the “tests” folder containing test related code.
  • The README was updated, and library-relevant badges were added to give the user a glimpse of important repository properties. A warning was stated that until a future version 1.0 breaking changes could happen, and that the library is in beta phase.
  • A changelog was added to document important version changes.
  • A contributors file was added to track extraordinary contributions of people who are not part of the core maintainers but should be acknowledged.
  • A license was added to the repository to make it usable for others.
  • The pyproject.toml configuration file was updated to comply with the latest standards. The package was attributed with matching trove classifiers and keywords to make it findable among the vast amount of PyPi packages.

Sensible information

To make sure the git history doesn’t contain sensible information we did look at the most recent file versions and some stages in former git history. We used bfg cleaner to remove files with sensible content from the entire git history. Although this may be enough already, we finally decided to get rid of the entire git history and push the latest version as one commit. All historic contributors were added to the CONTRIBUTORS file accordingly.

State of the art

We updated internal toolings to use modern standards:

  • Pylint is replaced by ruff and configured with a sensible ruleset.
  • Black code formatter is configured to accept a line length of 120 characters.
  • Poethepoet is used to add easily executable development tasks.

License

For the library to be usable by others an OSS license is required. Two frequently used licenses are MIT License and Apache Software License. We opted for the latter, which allows others to use the library if a noticeable acknowledgment is placed somewhere in the project.

Documentation

Previously there was no real documentation. We decided to go with Github Pages along with mkdocs, which allows for static documentation generation in a dedicated gh-pages branch that is placed in the same repository. Each new release from the library creates a new version of the documentation. The documentation uses markdown files along with a folder structure that is being replicated in the documentation. The documentation is partially automatically generated from the Python source code using the MkDocs plugin “mkdocstrings”.

Testing

Integration testing of bquest requires access to a Google Cloud instance with BigQuery activated. We use a dedicated Google Cloud bquest test project with the sole purpose of testing and an identity provider that allows for authentication with the Google Cloud in regards of CI pipelines.

Conclusion

Two days are fairly enough for laying the foundation for open-sourcing a small library. Obviously the bigger task is the aftermath of working with the community and continuing the development with suitable amount of workload. Anyway it was a lot of fun discussing all the meta stuff such as licensing, documentation etc.

Credits

Special thanks go to @MikeCzech, who came up with the idea of bquest!

Authors: @Felix Theodor @Nils Christian Weisbach @Benjamin Gutzmann

--

--