Revolutionize Your Cloud Architecture with a Self Service Platform

Gena Kartashevskyy
MyHeritage Engineering
6 min readJun 28, 2024

My precious

— We’re now triggering offline flow, sending a message with an object key in AWS S3 bucket to AWS SQS, which will be consumed by the AWS ECS cluster of CPU instances and processed, and the response will be returned via AWS SNS with a HTTP subscription back to our application.

— sigh It will take weeks to implement this infrastructure, DevOps already have their entire Sprint planned.

This was a typical dialog of any new feature tech design in MyHeritage back in 2021. MyHeritage wasn’t born in the Cloud, and historically DevOps were in charge of all infrastructure, and stayed in charge of it even after migration to Clouds. Although the Cloud brought Infrastructure-as-Code, DevOps were almost exclusively in charge of writing and manually applying it. Sharing this power with all developers was also complicated by the nature of our business: sensitive information (PII, photos, millions of DNA samples) and precious content (20 billion historical records) which must not be accidentally exposed.

But the more we embraced Cloud capabilities, the more obvious the disadvantage of this approach became:

  • Lack of standardization
  • We can’t scale DevOps team along Developers
  • The “handshake” protocol of DevOps-Developers ping-pong interaction on new infra creates a lot of overhead and latency (JIRA tickets and a triage process are involved in this handshake) with idling of developers
  • Developers are reluctant to write HCLs, cloud formation templates, or any non-code statement
  • DevOps can’t plan their roadmaps properly, as they are flooded with R&D requests
  • All the above led to crucial outcomes: velocity issues and slow rollout of features

Sharing is caring

It was obvious that we should start sharing “applicative” infrastructure management with developers; the next question was “How?” While our infrastructure provisioning was mostly written in Terraform, we wanted to explore other options, bearing in mind the following concerns:

  • Ease of development & testing. Developers should be able to create most typical Cloud elements of infrastructure with little to no friction, yet having the ability to customize these primitives if necessary. Preferably it all should be done in a language most developers are familiar with.
  • Ease of deployment. How can we deploy new infrastructure? How fast? How can we revert/roll it back?
  • Maturity for production use cases
  • Best practices: security, FinOps, golden paths of resource creation (i.e. AWS S3 lifecycle rules, AWS SQS dead letter queues and so on)
  • Integration with existing Terraform codebase — 100s of modules and many more existing stack states
  • Ecosystem: supported cloud and non-cloud resources, multi-cloud, community
  • Pricing
  • Multi-environment (Production, Staging, Development) support

The group round of this competition started with Pulumi, Terraform CDK, AWS CDK, and AWS Proton. Terraform CDK took the winner’s trophy. Keep in mind, dear reader, that it was 2021, and it was specific to our use case, and your mileage may vary. Here’s what we got with Terraform CDK:

  • Supports any Cloud and non-Cloud (I.e. Vault, Kafka) resources
  • Fully compatible with Terraform — smooth migration path for DevOps and integration with existing modules we do not want to invest in replacing
  • “Object oriented Terraform” more friendly for developers
  • Generated code from Terraform providers
  • Multiple language bindings (we use Typescript) with code assistance from IDEs
  • Creates TF hcl/json (synth) from code
  • Works with existing modules

Hide(ous) game

On its own, CDK-for-Terraform still exposes the main disadvantage of pure Terraform: low-level extremely verbose abstractions, exposing resource property as-is, which are hard to work without going to Terraform documentation and they are very error-prone. For example, IamPolicy:

new DataAwsIamPolicyDocument(scope, `policyDoc`, {
statement: [
{
actions: ["SomeAction"],
effect: "SomeEffect"
}
]
})

“actions” and “effect” are just some strings.

The tool doesn’t make infrastructure creation radically different compared to pure Terraform.

We would prefer an API similar to what AWS CDK provides: high-level developer-friendly TypeScript constructs to hide all this complexity, which are easy to use and their APIs are easy to understand, but taking into account our knowledge, specifics, and integrations with existing modules, and avoiding becoming limited by a single provider. Thus, the SelfService toolkit was born.

We decided that each resource should have an opinionated implementation with best practices, enforce org standards, maintain high security, and access control.

We analyzed what Cloud Infrastructure elements were most requested at MyHeritage, and started implementing them in order of descending frequency. From the beginning emphasis was on:

  1. It’s “open-source”. All developers are welcome to contribute.
  2. We don’t plan to implement every element of cloud infrastructure or every use case. The toolkit should be just a starter, inception of a new feature, and developers can build out from it using regular Terraform CDK by either extending the toolkit, or using TFCDK resources directly in their own stack.

Here is an example of Toolkit construct usage:

const myRole = new MhIamRole(this, someRoleName, someRoleName)
const bucket = new MhS3BucketBuilder(this, bucketId, bucketTags).build()
bucket.allowAccess(myRole, {permissions: S3Permission.READ})

These 3 lines of code:

  1. Create a new AWS IAM Role
  2. Configure the assumed policy
  3. Create a new AWS S3 bucket
  4. Configure the Role’s policy with Read access to the bucket
  5. Everything is done with proper tags for FinOps
  6. Both Bucket and Role are configured in accordance with best Security Practices

The Toolkit is not limited to wrappers around primitive elements. If we would like to provision infrastructure from the dialog at the beginning of this article with the Toolkit, it would look like this:

const taskRole = new MhIamRoleBuilder(scope, "feature-ecs-task-role")
.build()
const expirationRuleBuilder = new MhExpirationLifeCycleRuleBuilder(1)
const bucket = new MhS3BucketBuilder(this, "some-bucket")
.withoutVersioning()
.withLifecycleRule(expirationRuleBuilder)
.build()
bucket.allowAccess(taskRole, {permissions: S3Permission.READ_WRITE})

const requestsQueue = new MhSqsBuilder(this, "requests")
.withDeadLetterQueue(2)
.withMessageRetentionSeconds(10800)
.withVisibilityTimeoutSeconds(10)
.build();
queue.allowAccess(taskRole, {permissions: SqsPermission.RECEIVE})

const responsesTopic = new MhSnsBuilder(this,"responses")
.addHttpEndPointSubscription({
endpoint: "https://myheritage.com/someendpoint"
})
.build()
topic.allowAccess(taskRole, {permissions: SnsPermission.PUBLISH})

new MhFargateWithSqsScalingEcsBuilder(this, {}, config)
.withServiceName("feature-service")
.withSqsForAutoScalingTriggers(requestsQueue)
.withRole(taskRole)
.withTaskContainerImage("12345.dkr.ecr.us-east-1.amazonaws.com/feature:1234")
.withTaskEnvVars(new Map<string, string>([
["SOME_ENV", "value"]
]))
.withTaskCpuRequestUnits(2048)
.withTaskMemoryRequestMiB(4096)
.withTaskContainerHealthCheck({
command: ["CMD-SHELL", "curl --fail http://127.0.0.1:$HEALTH_CHECK_PORT/health || exit 1"],
interval: 60,
retries: 3,
startPeriod: 180,
timeout: 60,
})
.build()

The code above creates 48(!) Terraform resources and can be written in a few hours.

Bon Voyage

Now developers can write infrastructure code quickly, but how can they actually ship it to production? We don’t want developers to run cdktf deploy from their laptops, nor from a central bastion server in the production environment. We want infrastructure to become a first-class citizen: properly reviewed, tested, and checked.

In parallel with SelfService tool evaluation, we started evaluating CI/CD tools for infrastructure, choosing between AWS Proton, Terraform Cloud, Atlantis, Spacelift, Env0, and AWS CDK pipelines. Our final choice was Spacelift. Now, the developer’s workflow looks like this:

GitHub integration allows us to do a sanity check without leaving the PR (useful both for reviewers and the author):

Example of summary generated by Spacelift

And the notification policy gives an opportunity to keep an eye on the flow:

Spacelift notifications

We’re sailing

We released the SelfService Toolkit together with Spacelift CI/CD a year ago, and we already have clear results:

  1. 52 IaC repositories created by 16 different teams
  2. In most projects, zero dependency on infra team or devops
  3. Resources are written with human-readable code
  4. Full GitOps process, standardisation, auditing, PRs, Wiz scanning, etc.
  5. Internal survey of developers’ experience shows 92% satisfaction

If you want to to accelerate your processes, make your developers satisfied and free DevOps for much more meaningful work, self-service is the way :)

--

--