Speeding up our monolith onboarding from 1 day to 15 mins by improving local data setup.

Anton Safonov
HeyJobs Tech
Published in
3 min readJan 30, 2023
Photo by Anton Filatov on Unsplash

The Issue: Not enough data locally after the project setup to start development right away.

Onboarding engineers to an existing project is a pretty important process — you want to bring them up to speed ASAP with the least effort. In this post, I’d like to talk about the tech side of the onboarding process, specifically some approaches that we use here at HeyJobs, which help us with the project’s local data setup automation.

As a new joiner, you’ve just installed a new backend project, and the tests are passing (hopefully). Then you launch the project and realize that you don’t have enough data in your local DB to start working on your feature, or maybe the bug that you need to fix requires specific data (e.g. from production) to be reliably reproduced, and so on.

To solve this problem, you need to generate missing data, which can be pretty time-consuming, especially when you’re new to the project, or your project has tons of relations and data constraints.

In simple use cases, when you don’t have much data, you can use seeds to populate DB. That kinda works. But when you have a lot of data, and especially when parts of it are generated and maintained by e.g. operations teams on production, the seed file becomes less effective, and its maintenance becomes problematic.

The Solution: Safely populate a subset of anonymized production data to your local DB

We’ve figured out an alternative approach that solves mentioned problems but also gives us some additional perks, like ensuring PII data anonymization/encryption or data integrity across development environments. Internally we call it PADD — Production Anonymized Database Dump. Essentially it’s a subset of production data that we anonymize and use for the setup of local projects and disposable environments.

PADD is generated in two steps: first, we subset the production data and then we anonymize/obfuscate it so we can safely use it locally. We’ve built a small python app that performs both: a subsetting with the help of the condenser tool, and anonymization with our own scripts. The schematic below roughly represents the generation process.

PADD generation is scheduled to run daily but can also be triggered on demand. It has a configuration file where engineers can specify subsetting instructions, e.g. for the given table we need only 1% of production data, or copy the table fully. Referential integrity is preserved automatically during the subsetting process, thus configuration file is pretty lightweight and does not need as much maintenance as seeds. We’re also adding tests that check the configuration file and ensure that PADD generation is working as expected.

When the application is being built with docker locally or in the disposable environment, PADD is automatically populated as a part of the building process. We also have additional scripts which allow us to quickly reset DB with the latest PADD version on demand with one command.

No need to manually generate missing data locally anymore.

After integrating PADD with our docker setup, we have observed a few improvements:

  • Faster and smoother onboarding experience, as we’ve fully automated and reduced project setup time from ~1 day to 15 mins
  • Improved developer happiness and engagement
  • Reduced operational load: no need to manually fix/generate data across all dev/qa environments, now data quality, integrity, and freshness are handled automatically
  • New possibilities to reproduce tricky production data-related bugs
  • Improved security: to make PADD usage safe, we had to ensure that sensitive data is encrypted/obfuscated/anonymized therefore we’ve added more checks, tests, and internal processes to achieve that

Interested in joining our team? Browse our open positions or check out what we do at HeyJobs.

--

--