Creating our piece of cloud in AWS

In 2018, all our production infrastructure was running in an established French hosting facility but was still using the traditional way to provision servers. Creating new VMware VMs would take time, they would need to be configured over a few days, then order servers and wait for the delivery if new resources were to be needed in advance of known traffic spikes (and still have to pay for them even if not used after the peak). The traffic was already reaching 27M+ monthly visits but as an example, we often faced some outages during periods of high load if the Marketing team was pushing some TV ads (500% of traffic to handle in a few minutes) as the existing infrastructure wasn’t dynamically scalable in any way. So at the end of 2018, we decided to move from this host to the cloud, to gain some flexibility and scalability. We naturally chose AWS as we were already using it for some other needs (Storage in S3, Datawarehouse, etc), all the services we needed were readily available, and some of the existing team members were already trained or certified.

Apart from the capacity and scaling capabilities, another benefit of the migration was that we had a great opportunity to be able to start from an empty page, and implement all the best practices for everything at the beginning, especially security! On this page, we will present all the choices we made to build our great infrastructure.

Where to put our piece of cloud?

The Fork headquarters are based in Paris, and we had the choice between using Ireland, where some resources were already used and where often the new services in Europe are shipped first, or the brand new Paris region (launched in December 2017), with fewer services available but already having the ones we needed. We finally decided to use the Paris (eu-west-3) region. It was firstly the best compromise with lower latencies* for the majority of our consumers (France, Italy, Spain). Then, we knew that we had to put a network link (direct connect) in place between our French host and AWS. Using this region was easier for the integration with their data centers (also in the Paris region) to perform the migration, but that’s another story.

* We are also using Fastly and Cloudfront CDNs to reduce latencies.

Split the environments

One of our first main choices was to split our environments into separate AWS accounts and Organizational Units. This way, we ensure to never share resources across environments. We are also able to apply a generic policy for most of our accounts but also restrict some sensible accounts more and always with the strict minimum of needed services authorized (thanks to Service control policies (SCPs)).

So we currently have these accounts (one by OU):

  • Root: Mainly where the IAM users are managed
  • Tools: Where all the transversal tools given to developers reside (Log stack (ELK), Grafana, Jenkins, Vault, …)
  • Staging: The Staging/Preprod environment (last stage before Production)
  • Production: Where the production runs
  • External: A second production environment shared with external contributors (mostly blogs)
  • External-staging: The Staging/Preprod environment of external account

At the networking level, we also chose only “peer” accounts (= put a network link between two accounts ) between the Tools account and the other ones. “Tools” needs access to the other accounts, for example, to be able to get metrics from our monitoring servers (Prometheus) of each environment and give all the metrics available in Grafana (running in Tools environment) or be able to deploy the applications from the deployment tool which runs also in Tools. However other environments such as Production and Staging should not be able to communicate with each other so we will never create a link between them to ensure a real and physical separation.

A very high-level overview could be:

High level view of the infrastructure

Find a good network infrastructure and use it everywhere

The second choice was to find the best possible network architecture to reproduce it in every account. First thing, as you can see on the schema above, we have decided to use one VPC per account (and not the default one). I’m spoiling the story a bit but we knew that we wanted to use Kubernetes and EKS to run all our workloads. However, even without this choice, we know these network fundamental choices would have been the same.

Each VPC is a /16 network, split like this:

  • 3 Availability zones (AZ)
  • 1 public subnet on each AZ
  • 1 private subnet on each AZ
  • 1 Nat Gateway (NGW) with 1 elastic IP (EIP) per AZ: to provide internet access to the private subnets
  • 1 Internet Gateway (IGW) per VPC: to provide internet access to the public subnet, and to the NGWs

This gives us something like (Production example):

The public networks will never have anything else other than the Internet Gateway, the NAT Gateways, the public ALBs, and a Bastion.

We are finally ready to create everything we need on top of it as we now have steady and durable network foundations.

Note: Be careful when creating private networks if you intend to use EKS as it needs a LOT of IPs. Prefer using a big mask (for example /20).

Use Kubernetes for almost everything

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. We decided to use it to gain scalability, flexibility, and security. We will not describe in detail how we are using it here (another article will go deeper, so stay tuned and subscribe to be notified when it’s released!). What is important here is that we are using the Amazon Elastic Kubernetes Service (Amazon EKS) managed service. We know that managing a Kubernetes cluster is a day-to-day job that we cannot do, so EKS gives us a highly available cluster with a low maintenance time for the team and all the needed features and security possible.

We are using AWS EKS but we are managing our workers. The “Managed Nodes” feature was not available when we created our first clusters and at the moment we don’t believe that switching to this feature will bring us great value — but we will certainly re-assess it in the future as the number of clusters will grow with any new business needs.

As you may know, when using EKS, AWS is managing the Control Plane in their own VPC and communicating with the workers using specific Elastic Network Interfaces (ENI). The production traffic is coming from a different path, and we built it like that:

  1. Facing the internet, you will find at least one Application Load Balancer (ALB) (in most cases two as one will be only for restricted access, the other for customer traffic). We are using Fastly as a Content Delivery Network (CDN) so one of the ALBs will be the backend of our domains (DNS) pointing to Fastly services.
  2. Behind the ALB, the traffic will be sent to a Target Group where you will find the first EKS Worker Type: the Ingress. We are using Istio as a service mesh in our clusters and the role of these workers is to host the Istio ingress gateway pods only so we are confident that, in case of high traffic, the pods will not have to share any resources with other unnecessary pods. We have at least 1 Ingress worker per Availability Zone to ensure High Availability thanks to our autoscaling groups management (see below), nodeSelector (that ensure that the pods are scheduled on these specific workers), and podAntiAffinity (that ensure that 2 ingress gateway pods are not scheduled on the same worker).
  3. After the traffic goes through the Ingress, Istio will route it to the proper Kubernetes service and pods. These pods will be on the second type of worker: The (Generic) Worker. They are generic workers where any non-specific application can run. This is the main and the larger type of nodes we have as we want as many generic things as possible.
  4. Beside these generic workers, you will find the third type of worker: the Dedicated Workers. They host the pods that need non-shared or specific resources (for example an Elastic Block Store (EBS) or a specific instance type). We are using it to host SolR for example, which is our current search middleware that needs dedicated resources, or Prometheus that needs a huge amount of RAM (especially in production).

All these workers are spread across the three Availability Zones inside Autoscaling Groups (ASG).

Finally, you will find a private ALB which also has the ingress workers in its target group to be able to communicate with the cluster from other environments, internally, from the Tools environment, for example.

The production environment example with EKS:

Use an Autoscaling group per Availability Zone

Something you may have noticed, in the previous schema, is that we are using one Autoscaling Group per Availability Zone and not one global for the three zones. It’s another choice we made to always ensure having the desired number of workers in each zone.

It was done mainly after feedback from a talk at the KubeCon + CloudNativeCon conference (link to the slides here), where the speakers explain that they were using 1000+ Nodes in a single ASG and observed every day that almost all the nodes were at the end of the day in the same AZ and they needed to rebalance them during the night.

Of course for the ingress nodes, for example, it is not anything we want but that is also the case for almost every node type, so we decided to apply this rule for every ASG.

Everything as code especially the infrastructure

“Everything as code” has been our “Motto” from the first day. For infrastructure, it was natural to think about creating a git repository and to put everything related to the resources we are creating in it. The choice was made to use Terraform, as it’s easy to use, proven and the providers are most of the time quickly updated. It’s also a tool that we can use for other needs like Github configuration, database management, Fastly service configuration, etc.

The repository is organized in folders, one by environment (tools, staging, production, …) in which you will find all the .tf files that contain the terraform code for every resource. We are not using modules (for now) as we wanted to keep the usage of this repository simple.

To deploy a change in the infrastructure, you just need to create a Merge Request from your new branch. This will trigger a Jenkins job to launch the CI. This CI will check the format, validate the resources, do some Conftest (a utility that helps to write tests against structured configuration data, take a look at here if you don’t know it) to ensure proper security options are set and give you the “terraform plan” for you to check that the changes are good.

Failing pipeline on a change on staging and production files
Green pipeline on a change on production files only

If the pipeline is green and you have the minimum number of approval for your merge request, you are good to merge your branch in master. A new Jenkins job is triggered, all the tests are done one more time and for every environment and you can check again the “plan” before manually approving and doing the “apply”.

Applying a change on staging

This way, it’s pretty easy to do a change in the infrastructure (create a new bucket, a new RDS, modify a policy, …), but with all the security (auditability for example) needed. We avoid using the Console to make changes in 99% of the cases, mainly only for Duty cases or instance termination.

We can’t manage everything

Besides the scalability, flexibility, high availability, … the cloud provides us something else crucial: A huge number of AWS-managed resources. Our team can’t manage everything (Databases servers + a Kafka cluster and the servers + …). AWS provides a ton of services that allow us to focus on providing good tools to developers (instead of debugging a Kubernetes etcd issue, for example). That’s very important for us to focus on subjects that will ease the developer experience and their ability to ship more features. It’s also doable because, thanks to everything we put in place for the infrastructure, now issues coming from the infrastructure are rare.

Here is a non-exhaustive list of the managed resources we are using in AWS:

  • Amazon Elastic Kubernetes Service (EKS): We already talked about it earlier.
  • Amazon Relational Database Service (RDS): To host all our PostgreSQL and Mysql Databases in every environment.
  • Amazon Managed Streaming for Apache Kafka (Amazon MSK): Used as a broker between filebeat and logstash in our logging infrastructure (Elastic Stack).
  • Amazon ElastiCache for Redis: For the application to store some keys, sessions, …
  • Amazon ElastiCache for Memcache: Same thing but with Memcache.
  • Amazon Elasticsearch Service: To replace SOLR, our search middleware in the applications, progressively.
  • Amazon S3: To store everything we need.
  • Amazon Elastic File System (EFS): Some old internal applications are using NFS to share some files.

Conclusion

Our Production can sustain 4k+ requests per second (8k+ with traffic served by the cache in Fastly) and we already were able to have 3000+ pods running at the same time. As you can see in the graph above, in the last month, we had a maximum of 250+ workers running at the same time in our main cluster, and more than 120+ only in the production one with an x2 capacity between low and high traffic periods (autoscaled). In Tools, we are running a 20 Nodes (13 Data nodes, 21TB used space) Elasticsearch cluster (this one is historically self-managed) in Kubernetes, etc… All of this was only possible thanks to all these choices we made at the beginning of the infrastructure creation and we never deviated from them.

After 2 years of migration, around 150 microservices to dockerize and send in the cloud (another story again), we can now fully benefit from all that the cloud and AWS can give us. We are now very confident when the Marketing team is planning some TV-Ads. We also have less and less duty interventions. We started from an empty page and we know that we have built a very stable infrastructure. Starting from this foundation, we are now able to provide more features to developers. In the end, it’s our customers that will benefit from this infrastructure as we are now better enabled to deliver more and more features to them.

Follow the TheFork Engineering Blog to stay up to date and if you want to be part of our adventures in the cloud, check our opportunities here.

OPS at TheFork since 3y+