Testing migration to Google Datastore with pytest-bdd and testcontainers

Vitalii Karniushin
Qodea Google Cloud Tech Blog
6 min readOct 5, 2022
(image credit)

In the scope of a digital modernisation project for one of our clients, we migrated data from one persistence model to another, i.e. MS SQLServer to Google Cloud Datastore. In general, when the nature of the source and target storage models is different (e.g., SQL -> NoSQL), you usually build some logic for the mapping between those. Testing and having fast development feedback loop becomes very important. To test our solution, we decided to onboard such practices as Behaviour Driven Development (BDD) and dependency virtualisation with containers. Here we will explore how to organise testing for such a solution with the help of pytest-bdd and testcontainers in python. All sources are available here.

Infrastructure

A high level overview of the client environment is as follows:

As we can see from the above diagram, the data source lives in Azure. Migration involves copying relevant data from the MSSQL DB to the Google Cloud Datastore. The API is deployed via Cloud Build as a Cloud Run service.

Data

While analysing the schema, we identified a set of domains that should be migrated into new storage. A group of tables represents each domain. The data from these tables shouldn’t migrate as a whole: sometimes, it’s a slice of data, and sometimes it’s a slice of specific columns. In our example, we are going to use domain Posts. Information about posts lives in the following tables:

  • post (id, title, summary) — common data about posts
  • category (id, title) — common data about categories
  • post_category (id, postId, categoryId) — categories linked to posts
  • post_meta (id, postId, key, content) — meta information related to a post

In Datastore, we store all information in one kind: Posts (Kind is the unique string identifier of the type of entity, where Entity is an individual record that gets stored and retrieved from the datastore). Note that categories and meta information in the original data source might be stored in multiple rows, but in Datastore, it should be saved in specific properties of a single entity — categories and meta:

We can see that the development of such a solution leads to the careful design of SQL queries of different levels of complexity, some transformations (post_meta case), and correct data mapping.

How can we test that?

One of the possible approaches to consider is smoke tests. In that case, for each SQL query or other logic update, we have to redeploy API to trigger smoke tests and, if tests failed, understand the reason from the logs. From client infra (in our case, Azure VPN + MSSQL) we could decouple our setup using Cloud SQL. That would help us to get rid of the multi-cloud environment and proceed with the single cloud. However, this feedback loop is not something that we are looking for because of the attachment to the Cloud provider, which leads to:

  • timing costs for deployment
  • additional resources to support infra code
  • manual log scanning

How can we completely decouple from the Cloud? Apparently, with a local setup. Here, we have two key components — MSSQL and Datastore. Let’s see how we can manage such a setup locally.

Local setup

We can use testcontainers — a thin docker-based framework that is very convenient for mocking all kinds of external dependencies when writing integration tests. It includes a predefined set of classes that wraps up some popular solutions including MSSQL. But it also has quite a convenient API for your needs. Let’s have a look at the following code:

Here, we prepare the MSSQL container. Note that we use a fixture with scope=session. By default, pytest uses scope=function, which means that the fixture will be executed before each test. We could use that scope, but running a new container per every test is quite expensive. However, in that case, we don’t care about test data isolation since we have a new container without any data each time. But we are looking to minimise timings for test execution as the more tests we have — the more time it costs. That’s why we use session. As a result, the container will be created once before tests run. But in that case, we must take care of test data isolation. Since we don’t need to keep data after the test, we can solve this problem by cleaning up — let’s track which tables have been updated before the test and clean them after.

To containerise Datastore, let’s write a custom class. Google provides a set of emulators for some services. That’s why our class is based on an official Google image which includes gcloud sdk with all required dependencies. Let’s prepare the container:

Thus, we have two fully operable containers before running our test suite. Now we can move to testing.

Testing

Let’s explore our approach to testing. BDD is an evolution of TDD practice where a new concept Behaviour is introduced. Let’s highlight two exciting things about BDD:

  • We still follow the Test First approach
  • Tests become documentation that non-tech people can use

Cool! Two birds with one stone: testing code and writing documentation. (If you want to know more about the hows and whys of BDD, check out some of the docs here).

How to design behaviours depends on the framework of your choice. The python ecosystem provides some alternatives, but in general, the approaches are pretty similar. Our choice is pytest-bdd — apytestplugin that implements a subset of Gherkin language. The plugin allows us to describe tests as features in a human-friendly language with the help of the Given/When/Then syntax and use pytest runner to launch it.

The testing algorithm can be decomposed with the following basic steps:

  • Presence of data in source storage (Given data in MSSQL)
  • Running migration service (When migration triggered)
  • Validating new data in Datastore (Then check data exists in Datastore)

Let’s have a look at our feature file:

We have two scenarios here:

In the first scenario, we migrate post for which we have assigned two categories. This data is captured in three different tables. In the “Given” phrase, we specify the table’s name and use JSON format to describe row data. This information is enough to insert the required entry into the table:

We fill the tables and trigger migration. After that, we expect the new entity with specified data exists in Datastore, again we use JSON to describe the data:

As noted above, we should clean up test data after every execution. For that, we can use the mechanism of pytest hooks. pytest-bdd plugin introduces some hooks, and one of them is pytest_bdd_after_scenario. Another option could be adding the final Gherkin phrase into a test scenario. It’s a matter of taste. We’ve chosen the second option, where we check the number of migrated entities before data cleanup. In our example, we use only one entity, but there could be several; that’s why the final phrase is parameterized with the expected number of entities:

In the second scenario, we migrate a Post that has some metadata. post_meta table captures this in SQL and meta property in Datastore.

Final thoughts

This approach allows testing data migration effectively and, at the same time, helps to organise documentation. We managed to implement our test cases without the need of leverage actual cloud services directly, which significantly reduces timing costs for a test run and helps to save resources on infra. I hope you find this post helpful in addressing similar challenges. Happy Coding!

About CTS:

CTS is the largest dedicated Google Cloud practice in Europe and one of the world’s leading Google Cloud experts, winning 2020 Google Partner of the Year Awards for both Workspace and GCP.

We offer a unique full stack Google Cloud solution for businesses, encompassing cloud migration and infrastructure modernisation. Our data practice focuses on analysis and visualisation, providing industry specific solutions for; Retail, Financial Services, Media and Entertainment.

We’re building talented teams ready to change the world using Google technologies. So if you’re passionate, curious and keen to get stuck in — take a look at our Careers Page and join us for the ride!

Disclaimer: This is to inform readers that the views, thoughts, and opinions expressed in the text belong solely to the author.

--

--

Vitalii Karniushin
Qodea Google Cloud Tech Blog

An engineer driven by a love for coding, diving into cloud-based innovations, and investigating emerging technologies.