How to manage the FINN.no infrastructure where 170 developers deploy 1000 times a week
FINN.no is the largest marketplace in Norway, and one of Norway’s busiest Web destinations. We need stable and real-time consumable infrastructure services to run and innovate our service, and to ensure we evolve finn.no.
This post is about how we do infrastructure at FINN, including some information about the process leading up to where we are.
At FINN, developers commit code to Github thousands of times a day. A code commit may trigger a build, leading our pipeline to build and test a new deployment in the Dev environment. An automated deploy may be triggered as a result of this. This way, we release both small and large changes to production a 1000 times per week. Deployments are performed by FIAAS: an open-sourced FINN infrastructure as a service deployment daemon that we developed in-house. FIAAS — being of a declarative nature — schedules a deployment of the verified artifact as one or more pods and containers in a Kubernetes production cluster. Deployments and verification of new application versions typically happen on the side of the current production version. This makes it easy to point to and verify the new version, with the ability to point back to a previous one in case of failure.
Let’s do a short summary of technologies in use in FINN:
Application development and storage
Developer teams are free to use feasible programming languages. The majority of our existing code is written in Java, new stuff is a mix of Java, Node, Kotlin, Scala, Haskell, React, Ruby, Python, and Go.
Most persistent structured data is stored in PostgreSQL database clusters and in a legacy Sybase database. Kafka messages are used to synchronize the 550 microservices. Solr is the main search technology, while Redis is a popular in-memory storage for fast storage and retrieval of i.e session data. Kibana is the favoured analytics and visualization platform our teams use to get feedback on how applications are performing, and Redshift is the data warehouse platform.
We use Github and Github Enterprise for all code revision control.
Artifactory and Nexus help us store the various artifacts produced in the application development process.
Maven, Bamboo and Travis are heavily used to achieve Continuous Integration and Continuous Deployment throughout the company.
Orchestration and Management
Kubernetes and the FIAAS deployment daemon are used to schedule and deploy the container workload in our dev and prod environments.
All distributed configuration is stored in etcd, while secrets are stored in Consul, and A/B testing and feature toggle roll-out is performed using Unleash.
Log collection is performed using Fluentd and Logstash, and the log aggregation using Elasticsearch. Time series are stored in Influxdb.
RabbitMQ is used by Sensu as it monitors all our services.
FINN load balancers are implemented using HAproxy. All finn.no images are served by Fastly (Varnish technology) and stored in the YAMS image storage service.
Storage required outside Application development is solved by using Ceph block storage and Amazon S3 buckets.
We run Docker network and runtimes, and we use Flannel for virtual networks attaching IP addresses to containers.
Terraform is used to provision infrastructure resources (e.g. compute, storage, network, identity). Puppet and helm are used for orchestrating and configuring these provisioned resources. All secrets are stored in Vault.
Most FINN services operate from on-premise servers located close to Oslo. VMs are created in Mirantis OpenStack clusters, while the Kubernetes clusters run directly on hardware. We run our dev environment in IBM Cloud (Softlayer), and we run data intelligence workloads in Google Cloud. In addition, we use several Schibsted-provided services running in AWS.
The whole picture
Infrastructure as code
FINN has a need for efficient infrastructure management to satisfy the needs of 175 developers.
To limit wait times and provide self-serve infrastructure services, we are walking down the ”Cloud Native” Trail. This means we provision and manage infrastructure through machine-readable definition files, rather than manual hardware configuration / interactive configuration tools.
FINN has two infrastructure as code implementations:
- Terraform & Puppet — legacy
- Kubernetes & FIAAS — the cool stuff
Our use of Terraform and Puppet to provision infrastructure is similar to what you will find in many other places. You edit config files and infrastructure changes happen after one or more Puppet/Chef runs, or after invoking Ansible or Salt. New VMs are created, configured and secured as specified, and applications are installed programmatically. We actively maintain around 350 VMs with Terraform and Puppet, 1/3 being 20–30 PostgreSQL clusters. The ambition is to reduce the number of these legacy servers by 50% year over year.
Using FIAAS to run 500 applications in Kubernetes as 1.500 pods is quite cool! FIAAS provides a declarative interface enabling developers to get something new into production in minutes. All they have to do is add one simple yaml file to their code repository, specifying ports, health and readiness checks, required memory and CPU, replicas, etc:
Kubernetes pods are created using the specified requirements and sensible defaults. Autoscaling, logs, metrics and alerts are automatically available at well defined, predictable addresses.
In combination with adopting Domain Driven Design and Event Driven Architecture, FINN evaluated several cluster orchestration tools in May 2015, the candidates where:
- Mesos + Marathon/Cisco cloud/Mantl
Mesos was judged to have lacking component integrations.
Openshift was viewed to be expensive, not so much improvement over plain Kubernetes.
Kubernetes provided the desired level of flexibility, it was Open Source and the project had excellent traction.
Our Kubernetes setup include a FINN-opinionated deployment daemon (FIAAS) to make everyone’s world easier.
The declarative nature of Kubernetes has proven a great benefit. Kubernetes create resources (Deployment, Service, Ingress, and i.e HorizontalPodAutoscaler), that indicate the desired state of your application. When objects are created, the Kubernetes control plane read resource configuration and start containers, configure network resources, and load balancers to achieve the deployment. All the resources are configured with YAML via the Kubernetes API.
Configuration as Code
Configuration for development, testing and production is part of our code. Since configuration as Code requires proper handling of secrets, we decided to use Vault with persistent storage in Consul. It’s worth mentioning that etcd is extremely important at FINN, where it functions as a shared configuration and service discovery for all containers.
Pipeline as Code
FINN uses an in-house developed Pipeline that builds, tests and triggers deployments either on our legacy servers or as containers in Kubernetes.
Everything as Code
FIAAS sits in front of the Kubernetes APIs. It reads a fiaas.yml input and creates Kubernetes resources, which in turn end up triggering the Kubernetes control plane to deploy your application. Fiaas.yml is described in the FIAAS config format, a sort of abstraction over the Kubernetes API.
Configuring Kubernetes resources require a lot of YAML: deploying an example application requires 303 lines of Kubernetes resource configuration. The corresponding FIAAS configuration requires only 25 lines to deploy the same application, saving developers 92% of the time and hassle.
The configuration you don’t set in FIAAS yaml file is generated by the same mechanism for every application, buying us room to change underlying configuration and infrastructure across all applications without altering any of the applications.
Contracts are an important part of why we at FINN are able to move so quickly. Contracts enforce standards that make it easier for our developers to deploy their applications. Developers are empowered and can focus on the important decisions, knowing that their logs, metrics, and applications will appear in a standard format, under a standard URL, and so on. At the same time, the Infrastructure team only has to maintain one logging solution, or one monitoring solution.
We enforce a contract between Developers and the Infrastructure:
- Traffic ingress
- Service discovery
- Metrics -> Prometheus
- Aggregated logs -> Fluentd
- Feature toggled migration with Unleash + custom ingress controller
We organize for de-coupled communication
The organizing into functional domains in FINN is inspired and derived from:
- Self Contained Systems — separation of functionality into many independent systems,
- 12 factor model — declarative setup automation, lower cost of adding employees, etc.
The goal is De-coupled Event Driven Communication between Domains Services, where Business logic resides entirely inside the domain service, stringing it all together by calling domains in parallel to construct customer facing views.
FINN technology is conceptually divided in 3 tiers: Business area services, Platform services, and Infrastructure.
Infrastructure is further divided into Development support, Observability, Common services, the application platforms, some Custom services and Legacy services:
As shown by data from FINNs Pipeline, it took years (and a lot of effort) for FINN to become able to release to production at a high frequency. In the same period we managed to reduce the build time for an artifact from several hours to minutes, making it easy to release a fix to production, instead of rolling back to a previous version.
In the period from 2010 to 2015, the number of developers increased to ~120, while the number of services increased from a handful to several 100s. One deploy every 3 months changed to 1000 deploys every week. In the same period FINN went from some 10s of managed physical and virtual servers to 100s of self-served virtual servers, and continued to deploy everything new as autoscaling Kubernetes deployments.
We used to run a single Puppet run for all clusters, where failures in one team could block deployments for other teams. Developer teams would ask for more machines, where infrastructure people were a bottleneck struggling to create enough server resources. Our current container-driven infrastructure solves these challenges and has improved velocity of deployments alot. Today, we have no interdependency between deployments, and developers are mostly self-served with yaml declared simple infrastructure concepts. Developer teams can run whatever runtime gets the job done without Ops or other developers needing to worry.
We are ready to move all our application payload to the cloud, or anywhere that can run containers.
Designed for failure!
At FINN, we allow and celebrate failure because we need it to be able to move fast! The infrastructure, the release pipeline, the microservice architecture are all about being able to move more quickly, and to be able to release new features relatively easily and safely with limited impact should a change or update fail.
At FINN, we strive to measure everything. We collect around 2 million time series related to production services and 800.000 time series related to development environments. They provide us with all kinds of real-time information on infrastructure, services, applications, customer behaviour, and economy. In addition, we collect,analyze and aggregate tons of logs in near real-time. Together with metrics, they are used to generate alarms in such a way that most problems arising are small enough to be handled during normal work hours.
FINN in a nutshell
FINN is one of Norway’s largest in-house tech environments, making one of Norway’s largest Web destinations. We have a great working culture focusing on removing dependencies. We organize as autonomous teams with all decision makers as part of the team. We strive to have processes and ways of working that minimize waste and waiting time. Loosely coupled architecture allows teams to move fast without fear of breaking other people’s stuff.
FINN has a strong tradition of creating bold strategies and delivering on them and in our current infrastructure strategy we aim at moving finn.no to the cloud. We will be blogging more about this later.