moovel backend legacy infrastructure

Nelson Oliveira
reachnow-tech
Published in
5 min readJul 3, 2019

In 2015, moovel and car2go were integrated into the same monolithic system. The year after, there was a system carve-out because the moovel team decided to redesign its backend architecture. This is the first blog post of a series of blog posts that will detail how our infrastructure evolved in the last 4 years, while sharing some of the lessons we learned along the way.

lots of “containers” to orchestrate

Back in 2015…

Our main goal was to reflect how each squad (service teams within moovel) has its own goals and concerns, including architecture. Additionally, we wanted it to be scalable, decoupled, secure and easy to operate. To do so, we chose Amazon Web Services (AWS) as our infrastructure provider.

Our AWS-based architecture

If you are new to the AWS world it’s useful to clarify the terminology. We focus on the services that are important for us.

The initial AWS services we started to adopt were:

  • Amazon Elastic Compute Cloud (EC2);
  • EC2 Container Service (ECS);
  • CloudFormation (CFN);

Our idea was to package each service as a Docker container, upload it to Docker Hub and deploy it on an ECS cluster using CloudFormation.

EC2 is the name that Amazon uses for their virtual machine offering. There’s a variety of instance types available to choose from. They differ based on their CPU, memory, storage and networking capabilities.

ECS forms computing clusters by grouping a set of EC2 instances. These clusters can then be used like they were one single machine. ECS is comparable to other container orchestration technologies such as Kubernetes and Mesosphere.

Each service that ECS runs needs to be packaged as a Docker image. At its core, ECS consists of a task scheduler that ensures different services get deployed to suitable EC2 instances. To create a service, a Docker image needs to be specified alongside its CPU, memory requirements and required amount of service instances, allowing ECS to schedule the containers within the constraints of available resources.

We decided to create an ECS cluster per squad to decouple resources so that services operated by different squads don’t contend for the same pool of resources.

To formalize the architecture described above, we decided to use AWS CloudFormation templates. CloudFormation is the Amazon service that creates and manages resources in a reproducible and reusable way. Resources (EC2 image, ECS cluster, ECS services and others) are described in a template file, either in JSON or YAML (our preference being YAML). Using these files, CFN then creates the described resources.

Beyond AWS

We also consider other providers’ tools part of our infrastructure.

One question we wanted to solve was: How do different services talk to each other? For instance, how does the payment system get the up-to-date postal address of a user, so it can render a PDF invoice?

Both services run in different ECS clusters. To exchange information, those services must know in which EC2 instance the other is located. To solve this, we chose Consul as a service directory. Every running service is registered with Consul. If a service wants to know where another service is located, it’ll ask Consul.

To make it as easy as possible for each service to register with Consul we decided to use Registrator, a Docker container that monitors all events from the Docker containers running on a given host. When a new container starts or stops, Registrator will register or deregister it to or from Consul accordingly.

Lessons learned

Our ECS clusters waste too many resources

Consul, has different ways of looking up all running instances of a service. It offers a REST API, as well as a DNS interface. We decided to use the latter approach because it freed us from integrating Consul into every service — one service simply uses ordinary DNS to reach another. To follow the earlier example, the payment system that wants to get the address of a user fetches the address by doing a REST call:

http://user-api.service.consul:8080/api/internal/v1/users/36833f10-a628-49c4-930b-a89071f4c1ac

The .service.consul domain gets resolved by the Consul DNS interface. We faced a challenge with the port number in the URL: 8080. These port numbers were assigned in a static way and every service uses their own. There are two problems with this approach:

  1. Avoidance of port collisions must be coordinated between different service teams using some sort of manual registry, which scales poorly.
  2. Static port numbers limit the number of Docker containers that can run on the same EC2 instance. As the container has to expose the assigned port number, two instances of the same service aren’t allowed to run on the same EC2 instance, ie. for a service using N replicas, N ECS instances need to be spawned. We initially thought scaling container instances with services wouldn’t be a big issue, as we could just use cheap EC2, but read on…

Considerations for EC2 instance types

We started by running every service on the t2.small instance type which are cheap and have 2 GB of memory. t2.small this seemed to be the perfect option for our ECS clusters, until we learned some painful lessons on Burstable Performance Instances.

Using those burstable instance types means that if more than a certain amount of CPU is used the instance draws from a pool of CPU credit points. When an instance’s CPU credit reaches zero limited to its baseline performance, which bit us when traffic to some services ramped up.

As an example, with instances having a “Low to Moderate” network performance, it took more than a second to reach some of our Redis databases at times. Increasing the instance size solved both problems for us..

Deployments with downtimes

When ECS deploys a new version of a Docker container, it will start enough instances of the newer service and then shutdown the older versions. The way Registrator handled docker stop signal was proved unsatisfactory for us because it would wait for the container shutdown to complete before informing Consul, leading to requests being routed to containers that could not longer service them. Fortunately we were able to debug and report this problem.

Looking ahead, even in 2015

We then thought of testing a different approach to the way we set our ECS services up, by experimenting with the reference architecture for the then newly available Application Load Balancers. As this was new technology, we decided to treat it carefully.

As mentioned at the start of the post, this is just the beginning of a series of posts detailing our journey on the evolution of our infrastructure. Expect more to come soon!

--

--