Designing APIs to be Modular, Deployable, and Secure

Published in

Expedia Group Technology

11 min readSep 6, 2018

This is the second part in a five part series of blog posts:

We covered how our application is architected in the previous post, Using AWS Managed Services in an Enterprise Environment. In this post, we’ll take a closer look at the applications we built.

CloudFormation Templates

When we first started this project, we only had a lab environment that all of the developers had access to. Since we are a relatively small team, we relied heavily on AWS Managed Services so that we didn’t have to worry about maintaining our own servers, databases, etc. Because of this, we configured everything we needed through the console. This was nice because it made spinning up new resources fast and easy.

Eventually, we set up our production environment. Looking at our current way of building everything out via the AWS Console, we realized our solution was not easy to replicate across AWS accounts. In order to deploy everything in production, we would have to explain to our production support team how to manually configure everything the same way we did in lab. And if any changes were made in lab, we would need to track those changes so they could be replicated in production.

Once we realized the problem we were facing, AWS CloudFormation became the obvious solution. CloudFormation allows you to write your infrastructure as code in the form of CloudFormation Templates. These templates can then be easily deployed using the AWS Console or CLI.

One added bonus of CloudFormation is that once your infrastructure is written as code, it can be stored in source control and tracked like any other project. This allows you to see who made what changes to the infrastructure and when. It also makes it easy to revert templates to a previous version if needed.

At this point, we created CloudFormation Templates for all of the AWS resources we were using. Once these were complete, spinning up new resources went from taking hours of manual configuration to running a CLI command taking only a few minutes to complete. These templates also allowed us to easily deploy our applications to production. Our production support team only needed to know how to run the CloudFormation templates and didn’t need to be concerned with how the application itself worked.

Doesn’t it take a long time to learn CloudFormation and write templates? Is it worth it? Yes, it’s worth it. If the amount of time it takes prevents you from creating CloudFormation templates, then you’re just being a lazy developer. Everything takes time to learn and honestly CloudFormation probably takes less time than most. Yes, it takes more time up front than spinning up a resource in the AWS Console but not by much. If you take the time to do it well in a template, not only can you spin up the resource again if it breaks or you need a duplicate, but you can also spin up any additional resources like it. Take advantage of things like parameters in your templates and it will make spinning up similar resources really fast and easy.

Services

NodeJS API

Our API is one of the biggest pieces of our application. It handles the actual getting and saving of data to our people management software. The API is written as a NodeJS API running in Docker containers on an AWS ECS Fargate cluster. In front of the API we have an AWS API Gateway that allows us to add authorization and authentication to lock down who can access it. We detailed how we setup authorization and authentication on the API in another post in this series, Using API Gateway for Authorization and Authentication.

We created the API to abstract out calls to our people management software because of their complexity. Most information in our people management software can only be read and updated using a SOAP API. However, there is also some information that can only be read and updated using the REST API.

One major limitation we faced is that our application cannot retain any data. Since we are dealing with people data, most of the data is considered sensitive or PII. Storing any of this data would have required extra approvals from our internal security team as well as implementing processes to comply with GDPR. This limitation also extends to how we upload documents.

Normally, the easiest solution would be to upload and store documents in AWS S3. However, to avoid storing PII data and having to comply with GDPR, we chose to use our people management software to store documents. In order to upload documents to our people management software, it requires that the documents are passed in as a Base64Encoded string. This lead to additional complexity especially when it came to document size.

Fargate

We’ve been working on our application for about a year and have tried a number of different container orchestration systems. At first we went the route of spinning up our own Docker Swarm Cluster on EC2 instances. Let’s just say this did not go well. While it did work for awhile, eventually it spectacularly crashed and burned, mostly due to memory issues.

Around this time, Fargate was announced and we intended on switching to it when it became available in our region. Until then, we opted to switch to use ECS since it would make the transition to Fargate easier. We were able to quickly spin up 3 ECS clusters, two in our lab environment and one in production.

This worked really well and honestly there was nothing wrong with it. But we’re a small team and running 15 EC2 instances for one application didn’t really seem worth it. Once Fargate was released in our region, we decided to make the switch. It gave us the ability to run containerized services without having to build out and maintain a standing ECS cluster (or 3). Making the switch from a standard ECS cluster to Fargate was fairly easy and we were able to quickly adapt our ECS CloudFormation templates to deploy to the new Fargate clusters.

Securely Passing Configuration Values

One of the biggest issues we ran into was how to securely pass our service configuration values. This is actually something that took a few iterations to get right. We tried using Consul but it wasn’t secure enough for production. We looked into Hashicorp Vault but discovered it only encrypts data but doesn’t store it. This meant we would need to set up, run, and maintain a Vault and Consul instance, which was too much overhead. And we weren’t able to use Parameter Store due to requirements from our internal security team. Most documentation suggests passing in the configuration to the container as environment variables to the ECS instance. This poses two big issues though: 1) The environment variables are listed in plain text in the AWS Console. 2) Any container running on that ECS instance will have access to the environment variables. While this may be fine for some use cases, this was not something we wanted to do.

Instead, we implemented a solution to pass the configuration as environment variables to the container itself and it was actually fairly simple. This meant the environment variables could not be found in the AWS Console and are only available to the container running the API.

The first step was to create a KMS Key. We created a Customer Managed CMK and added the IAM role for the application as a User for the key.

Next, we created an S3 bucket to store the configuration file. The bucket has AWS-KMS encryption enabled which uses the KMS Key we created. It also has a bucket policy that denies unencrypted uploads and in-flight operations.

Bucket Policy {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "DenyUnEncryptedObjectUploads",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::<BUCKET_NAME>/*",
            "Condition": {
                "StringNotEquals": {
                    "s3:x-amz-server-side-encryption-aws-kms-key-id": "<KEY_ARN>"
                }
            }
        },
        {
            "Sid": " DenyUnEncryptedInflightOperations",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": "arn:aws:s3:::<BUCKET_NAME>/*",
            "Condition": {
                "Bool": {
                    "aws:SecureTransport": "false"
                }
            }
        }
    ]
}

Then, we updated our Dockerfile and added an ENTRYPOINT script to reach out to S3, grab the configuration file, and export the values as environment variables within the container. The script needs to know the name of the S3 bucket where the configuration file is stored and the ARN of the key used to encrypt it. In order to run the same script in lab and production, we passed these two values in as environment variables to the ECS instance. The other key piece of this script is to make sure you can still call your Dockerfile’s CMD. This is accomplished by ending the ENTRYPOINT script with the exec $@.

secrets-entrypoint.sh #!/bin/bash# Check that the environment variable has been set correctly 
if [ -z "$BUCKET_NAME" ]; then 
echo >&2 'error: missing BUCKET_NAME environment variable' 
exit 1 
fiecho 'Loading Secrets:' 
# Load the S3 secrets file contents into the environment variables 
eval $(~/bin/aws s3 cp s3://${BUCKET_NAME}/settings.txt - --sse-kms-key-id ${KEY_ARN} | sed 's/^/export /')echo 'Starting App:' 
exec "$@"

settings.txt KEY1=<value1> 
KEY2=<value2>

Dockerfile FROM node:8.1.4# Install the AWS CLI 
RUN apt-get update && \ 
apt-get install -y python-dev &&\ 
apt-get install unzip -yRUN curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip" 
RUN unzip awscli-bundle.zip 
RUN ./awscli-bundle/install -b ~/bin/aws 
RUN ~/bin/aws --versionWORKDIR appEXPOSE 80ADD . /appRUN cd /appENTRYPOINT ["secrets-entrypoint.sh"]CMD NODE_ENV=$ENV npm start

With all this work done, our application was then able to read in the configuration values from the container’s environment variables and use them accordingly.

Data Loader Lambda

While the Data Loader Lambda is a very small piece of our project, it does play a pretty important role. In order to save data into our people management software, we need some very system specific identifiers. A good example of this is gender. When our application started, the only genders supported by our people management software were ‘male’ and ‘female’. Now there’s work underway to support ‘non-binary’. As soon as this option is available in our people management software, we also want to make it available in our application without needing to redeploy. We want to ensure our app is always up to date and using the most recent values from our people management software; the Data Loader Lambda is how we do that. In order to get these, we export CSV files from our people management software and push them to an S3 Bucket. We then have a lambda function that is triggered by file uploads to that S3 Bucket. The lambda parses through the data and adds it to an DynamoDB table so it is easily available for our application to use.

Employee Loader Lambda

The Employee Loader Lambda is the service that loads employees into our application. Its comprised of two pieces, an SQS queue and a lambda function. The queue receives messages from our people management software and the lambda function processes those messages, adding them to our system.

Due to the nature of our application, we needed to find a way to restrict access to the SQS queue. We didn’t want just anyone sending messages to the queue. Instead, we wanted to restrict it to just our people management software. Normally this would be fairly easy to handle. SQS supports the ability to send signed POST requests to a queue. Signed requests include and IAM role that can grant permissions to the user. This wasn’t feasible for us because our people management software has a very limited ability to make REST requests, which means it can not make signed POST requests. However, it can make GET requests. Because of this, we initially decided to use a queue policy to whitelist IP addresses that had permission to send messages to the queue. This was somewhat easy to set up and enabled us to whitelist entire CIDR blocks. While this did work for awhile, we soon realized it was a difficult way to restrict traffic and wasn’t scalable long term. We weren’t always able to predict what IP address our people management software would use to send requests to the queue.

In order to keep the queue secure, we opted to put API Gateway in front of it. This allowed us to do two things. First, we gave the API Gateway an IAM Role with permissions to send and delete messages from the SQS Queue. Second, we protected the API Gateway with an API Key. Now any user who needed to send messages to the queue could securely do so. We detailed how we setup authorization and authentication on the API in another post in this series, Using API Gateway for Authorization and Authentication.

The second piece of the Employee Loader is the serverless lambda function. The lambda function is scheduled to run every 15 minutes and query the queue for messages. If any messages are found, it looks up additional information about the user, adds them to our DynamoDB table, and sends them an email with access to the application.

One aspect of serverless that we really started to take advantage of while working on this service was the ability to declare things like the schedule and IAM permissions as part of the serverless.yml file. The serverless.yml acts like a much simpler form of a CloudFormation template (really its building one for you). Declaring things like IAM permissions in this file allowed us to track what permissions the lambda function needed and helped ensure it had the same permissions in any environment it was deployed. Since we were deploying this lambda function across AWS accounts, the only tricky thing was making sure not to use environment specific variables in the serverless.yml. In order to combat this, we created a command line parameter to specify the environment then pulled in any variables we needed from an environment specific settings file.

serverless.ymlprovider:
  name: aws
  runtime: nodejs6.10
  timeout: 20  stage: ${opt:env} #command line parameter
  region: us-west-2  iamRoleStatements:
    -  Effect: "Allow"
       Action:
       - "sqs:ReceiveMessage"
       - "sqs:DeleteMessage"
       Resource:  "${file(settings/${opt:env}.json):sqs.queueArn}" #reference to variable in env specific setting file

Why isn’t the Employee Loader just a route in the API? Originally, this was our plan but there were two things we needed to take into account. First was timeouts. If the Employee Loader was a route in the API, our people management software would have to sit and wait for a response from our application. Since our application needs to run through a number of steps to add employees to our system and it can process multiple employees at a time, the lambda function takes a fair amount of time to run. If we had chosen to go with an API, API Gateway times out at 29 seconds. Whereas with lambda, that timeout is 5 minutes.
The second thing we had to factor in was a way to decouple our application from our people management software. If our application was down when our people management software wanted to send it a request, then the request would fail. By splitting this service into two pieces, we were able to decouple our system from our people management software. The people management software can now stack up as many messages as it needs in the queue and when our application is running it will process them.

Conclusion

As you can see, we put in a lot of effort to use AWS Managed Services in a resourceful way that works well to support our applications. If you decide to do something similar, we cannot stress enough the importance and strength of using CloudFormation templates to help build our your infrastructure.

In our next couple blog posts, we’ll talk more about our UI and adding Authorization and Authentication to secure all of our applications.