AWS Platform engineering at NS

An in-depth look at how we handle AWS account provisioning

Jelle Pelgrims
NS-Techblog
8 min readSep 18, 2023

--

This blog post is meant as accompanying material for the talk “Platform engineering at NS” which will be presented at DevOpsDays 2023 Eindhoven.

There is more IT to NS than meets the eye. You may have come into contact with the NS-app, the NS international app, the NS website, the ticket vending machines, the platform signs, the advertising screens or the gates in the stations. That’s quite a list, and it only contains the publicly visible IT projects of NS. Behind the screens there are dozens, if not hundreds of separate IT applications that enable NS to drive trains from point A to point B in a timely fashion. Without colliding into other trains, of course.

A big part of these IT projects run in the cloud. Not just one cloud, but three. NS uses AWS, Azure and self-managed data centers to run their IT systems. In this post we will be focusing on the AWS side of things. There are dozens of teams at NS working on AWS or looking to move their projects from a self-managed data centre to the cloud.

Enabling other teams

At NS, we want every team to have a streamlined AWS experience, starting from the initial migration up to the point of actual usage. Teams using AWS shouldn’t have to worry about things that aren’t directly related to their mission. To facilitate this, we provide functionality that is shared across AWS teams such as networking, automation, access management and policy enforcement.

You could call this practice “Platform Engineering”. This is the mission of the team I work in. Our main product is the AWS landing zone. We basically pave the way for other teams to come in and “land” safely on AWS.

Organizing accounts

In this blog post we will focus on how we do platform engineering at NS. Specifically, we will take an in-depth look at how we approach account provisioning. By “account provisioning” we mean the creation of AWS accounts and the process of installing and configuring landing zone functionality in those accounts.

The fact that we need an account provisioning process is a result of the decisions made at NS, as well as the size of the company. If we only had one team working on AWS, we might be fine with just one account. But because we have a large number of teams that each need at least one AWS account, we need to automate the process of creating the accounts.

There are a variety of ways to organize AWS accounts, depending on your use case. You could use only one account for all your projects. Within that same approach, you could also give each project its own IAM user. Or you could give each team member a separate IAM user. Things get even more complex when you start considering AWS SSO users. You could have one account per environment, or one account per team. The possibilities are virtually endless.

Possible account organizations

Note: The validity and applicability of each of these methods depends on your use case. Most of these methods work perfectly in one scenario and would be a horrible idea in another.

At NS we have chosen to give each team three accounts to begin with — one per environment (development, staging and production). Often teams have more than just one large project. In that case, a team might end up with more than three accounts. NS has a large and growing number of teams using AWS. As a result, we have a large volume of accounts.

Provisioning AWS accounts

How do we manage all of these accounts in an automated way within a reasonable amount of time? The answer is some CloudFormation, some AWS step functions and a bunch of lambdas. Each part of the account lifecycle is automated using these tools. A typical account lifecycle is exactly what you would expect: an account is created, consequently updated and finally deleted.

The account lifecycle

For context, it might be a good idea to take a closer look at the account lifecycle. The lifecycle starts with a “customer team”. A customer team is a team within NS that we provide AWS services for. They may want a new AWS account for any number of reasons. Often the team wants to migrate to AWS, or they are already running on AWS but have started a new project. Either way, they want a new account and will submit a formal request through our internal web portal.

Once that request reaches us through our API, we automatically start our account provisioning system with parameters taken from the request. We call this system the “Account Vending Machine” (or AVM), for obvious reasons. It takes requests as input and turns those requests into fully deployed accounts.

Once the account is created, changes are rare. During the account creation, we install certain automation systems on each account. Those are updated sometimes, resulting in updates being rolled out to all managed accounts. New automation systems may also be installed when needed. Another possible change is the application of new service control policies. These updates are all done through AWS Control Tower and AWS CloudFormation StackSets.

When the customer team disbands, or the account is no longer needed for other reasons, we will receive a request through our API that is then forwarded to the “Account Shredding Machine” (or ASM), the counterpart of the AVM. This machine checks to see if the account is still in use, and if not it, systematically removes everything from the account before closing it. It is our policy to only remove an account if it is empty (to avoid accidents), so we can easily check if it is still in use by seeing if there are any recent charges associated with the account.

The account lifecycle

Automating the account lifecycle

To summarize, our account lifecycle has three parts: provisioning, updating and removal. The first two parts of the lifecycle are handled by the AVM, while the last part is done by the ASM.

Both systems are completely serverless solutions. They are both primarily built on top of AWS Step Functions as a state machine. In these state machines we use a combination of lambdas and direct AWS API calls. The lambdas are only used when we need some more complex actions that cannot be done by simple API calls, or when the logic is too difficult to express in the state machine language.

This solution has worked out quite well for us. Because it is completely serverless, we have nothing to manage. We have almost zero ongoing maintenance costs associated with the solution, besides the initial manhours required to build it and the occasional upgrades to 3rd-party packages.

However, there are still some annoying aspects such as the long execution time. Unfortunately, this issue is inherent to AWS, and clouds in general. We cannot speed up the provisioning cloud resource deployments. Because we deploy quite a lot of resources, one run of a state machine can easily take over 10 to 15 minutes. This also makes the development process a bit of a hassle. The standard write-compile-test (actually write-deploy-test) cycle takes about 30 minutes to complete.

The state machines are fully deployed in a “Infrastructure as Code” manner using AWS Cloud Development Kit (CDK).

Testing the account lifecycle

So far, we have covered the “raison d’être” of the account provisioning system, the mechanics behind it and the deployment method. But how do we make sure that the system works?

One option would be to create accounts on a schedule to test the system we built — every night for example. This is a good idea, but it comes with an issue that needs to be solved to make it the right idea. The issue here is that every AWS account costs us a small amount of money. Creating accounts on a schedule would leave us with a lot of accounts in our environment. You may be wondering why we didn’t just use the account removal feature we explicitly mentioned earlier to get rid of the accounts. Simple solution, right? Not quite, as we later found out.

It turns out that AWS actually has a hard limit on how many accounts you can delete within a given period of time:

You can only close 10% of member accounts within a rolling 30 day period [with a maximum of 200 accounts]. This quota is not bound by a calendar month, but starts when you close an account. Within 30 days of that initial account closure, you can’t exceed the 10% account closure limit.

We could try to be “smart” and create 2000 accounts to have a 30-day rolling limit of 200 accounts. But if we try that, we will run into the issue that each provisioned account costs a small amount of money, and all those small amounts (x 2000) together cost quite a lot. In addition to that every new account also uses up limited resources such as IP addresses. Most importantly, it is also quite clear that AWS is strongly discouraging frequent account removals with this quota, so we decided against it.

Instead, we solved our problem by reusing the same account, over and over again.

A depiction of the dry-run functionality

We modified the ASM so that it pauses right before calling the AWS “CloseAccount” API call, based on a flag. After adding some functionality to remove left-over resources we could now empty AWS accounts, instead of just closing them.

By adding a flag to the AVM that started the ASM after an account was provisioned, we now finally had a way to test our system. Instead of having to create a new account every test, we could simply reuse the same account, thereby avoiding the whole issue with the AWS account closing quota.

We did not want to do these tests manually, so we created a cronjob using CloudWatch Rules that would run the test every midnight. Right now, if something goes wrong with our account provisioning system, we know about it within a day. We are able to fix it before the issue delays an incoming account creation request, ensuring that we reach our goal of provisioning accounts within the same business day.

With the implementation of our so-called “Account Vending Machine” and the “Account Shredding Machine” we were able to automate the AWS account lifecycle within our company. Together with the recently implemented dry-run testing functionality we can now guarantee that the time required to provision a new account stays within one business day, as was our goal.

We hope that reading about our experience with account provisioning in AWS was helpful to you. We found it to be a deceptively broad subject with a lot of nuances and no one-size-fits-all solutions. With this article we aimed to make the subject more approachable for other teams in our situation. Let us know your thoughts on the matter in the comments!

--

--