How to manage the FINN.no infrastructure where 170 developers deploy 1000 times a week

Published in

FINN.no Blog – Product, Design, and Tech

9 min readMar 7, 2019

FINN.no is the largest marketplace in Norway, and one of Norway’s busiest Web destinations. We need stable and real-time consumable infrastructure services to run and innovate our service, and to evolve finn.no.

This post is about how we do infrastructure at FINN, and a bit about how we got where we are.

Introduction

Continuous Deployment

Our developers commit code to Github thousands of times a day. A code commit may trigger a build. The Pipeline build create and test a new deployment artifact in a Dev environment. This may result in an automated deploy to production. We release both small and large changes to production a 1000 times per week this way. The actual deployments are performed by FIAAS. -FIAAS is an open-sourced FINN infrastructure as a service deployment mechanism. It has a declarative nature, scheduling verified artifact deployments as one or more Kubernetes pods. Application updates are typically deployed and verified on the side of running production versions. The systematic verification version switching and auto-rollback make releasing to production quite safe.

To get a more complete picture of how this works, let’s summarise the technologies in use:

Application development and storage

Developer teams are free to use feasible programming languages. The majority of existing code is written in Java. New stuff is a mix of Java, Node, Kotlin, Scala, Haskell, React, Ruby, Python, and Go.
Most persistent structured data is stored in PostgreSQL database clusters. Ads are stored in a legacy Sybase database. Kafka messages are used to synchronize the 700 microservices. Solr is our main search technology. Redis is a popular in-memory storage for fast handling of i.e session data. Kibana is the favoured analytics and visualization platform. Our teams use it to get feedback on how applications are performing. Redshift is our data warehouse platform.
We use Github and Github Enterprise for all code revision control. Maven, Bamboo, Gradle and Travis are heavily used to achieve Continuous Integration and Continuous Deployment throughout the company. Artifactory and Nexus help us store the various CI/CD artifacts.

Orchestration and Management

Orchestration consist of categories scheduling, services support, logs & metrics, monitoring and service management services. — Orchestration and management technologies in use at FINN

The FIAAS deployment daemon schedule and deploy our container workload in Kubernetes dev and prod environments.
Distributed configuration is stored in etcd. Secrets are stored in Consul. A/B testing and feature toggle roll-out is performed using Unleash.
We collect logs using Fluentd and Logstash, and aggregate using Elasticsearch. We store millions of time series in Prometheus.
We have some limited use of RabbitMQ in connection to Sensu monitoring of all our services.
All FINN load balancers are implemented using HAProxy. Some parts of finn.no traffic, like images, are served by Fastly (Varnish technology). The storage of the images is handled by a YAMS image storage service.

Runtime

We solve storage requirements by using Ceph block storage and S3 buckets/cloud storage. We use Docker network and runtimes, and we use Flannel for virtual networks and to attach IP addresses to containers.

Provisioning

We use Terraform to provision infrastructure resources like compute, storage, network, identity. Orchestration and configuration of the provisioned resources is done by Puppet and helm. All the secrets we have are stored in Vault.

Infrastructure

Most FINN services operate from on-premise servers located close to Oslo. VMs are created in a Mirantis OpenStack cluster, pods are scheduled in Kubernetes clusters. We run dev environments in IBM Cloud (Softlayer). Data intelligence workloads run in Google Cloud and data warehouse run in Amazon Redshift. We use additional Schibsted provided services running in AWS.

The whole picture

Technologies involved in development, orchestration, provisioning and releasing at FINN

Infrastructure as code

FINN has a need for efficient infrastructure management to let 175 developers develop at good speed.

We are walking the ”Cloud Native” Trail to provide efficient self-serve infrastructure services. FINN has two infrastructure as code implementations:

Terraform & Puppet — legacy
Kubernetes & FIAAS — the cool stuff

1. Our use of Terraform and Puppet to provision infrastructure is similar to what you will find in many places. You edit config files and infrastructure changes happen after one or more Puppet/Chef runs. Or after invoking Ansible or Salt. New VMs are created, configured and secured. Applications are installed programmatically. We actively maintain around 350 VMs with Terraform and Puppet this way. One third of VMs are 20–30 PostgreSQL clusters. The ambition is to reduce the number of these legacy servers by 50% year over year. Manual or interactive server configuration is a no-go.

The configuration snipplet shows a Consul production instance specification. — Terraform example specification

2. Using FIAAS to run 700 applications in Kubernetes as 1.500 pods is quite cool! Developers use the FIAAS declarative interface to get new stuff into production in minutes. All they have to do is add a simple yaml file to their code repository, specifying: ports, health and readiness checks, required memory and CPU, replicas, etc.

The yaml snipplet shows a resource, port and health check specification. — fiaas.yml example specification

FIAAS create Kubernetes pods using specified requirements and cluster defaults. Autoscaling, logs, metrics and alerts are automatically available, at well defined predictable addresses.

Read more about this in Øyvind Ingebrigtsen Øvergaards excellent open sourcing of FIAAS article.

Around 2015 as FINN adopted Domain Driven Design and Event Driven Architecture, FINN evaluated several cluster orchestration tools. The candidates where:

Mesos + Marathon/Cisco cloud/Mantl
Openshift
Kubernetes

Mesos was judged to have lacking component integrations.
Openshift was viewed to be expensive, not so much improvement over plain Kubernetes.
Kubernetes provided the desired level of flexibility. It was Open Source and the project had excellent traction.

Showing how the yaml file specifies autoscaler, ingress, and deployment parameters. — The FIAAS YAML file is the developer interface to the Kubernetes production cluster

The use of the FINN-opinionated FIAAS deployment daemon make everyone’s world easier. Developers do not have to relate to the Kubernetes setup.

The declarative nature of Kubernetes has proven a great benefit. Kubernetes create resources like Deployment, Service, Ingress, and HorizontalPodAutoscaler approximating the desired state of your application. The Kubernetes control plane start containers, configure network resources, and instruct load balancers to achieve a deployment that correspond to each app’s resource configuration. All the resources are configured with YAML via the Kubernetes API.

Configuration as Code

All configuration for development, testing and production is part of the application code. Configuration as Code requires proper handling of secrets. We use Vault with persistent storage in Consul to provide on demand granular secret management. etcd is an extremely important component, as it functions as a shared configuration and service discovery for all containers.

Pipeline as Code

FINN uses an in-house developed Pipeline that builds, tests and triggers deployments. On legacy servers and as containers in Kubernetes.

Everything as Code

FIAAS sits in front of the Kubernetes APIs. It reads a fiaas.yml input and creates Kubernetes resources. It triggers the Kubernetes control plane to deploy all our applications. Fiaas.yml follows the “FIAAS config format”, an abstraction of the Kubernetes API.

Less Code

Configuring Kubernetes resources usually require a lot of YAML. Deploying an example application requires 303 lines of resource configuration. The corresponding FIAAS configuration requires only 25 lines, saving developers 90% of the time and hassle.

Configuration not specified in FIAAS yaml file is generated by a common mechanism for every application. This provides flexibility to change underlying configuration. We can change infrastructure across all applications without altering any of the applications themselves.

Contracts

Contracts are an important part of why we move so quickly at FINN. Contracts enforce standards. Standards make it easier to deploy applications. Developers are empowered and can focus on the important decisions. They know that logs, metrics, and applications will appear in a standard format, under standard URLs. It helps make the infrastructure easier to manage as communication around features, changes and expectations are clear.

Developer contract

Infrastructure services, ingress, egress, autoscaling, log endpoints, etc are explicitly defined

We enforce a contract between Developers and the Infrastructure:

Deployment
Traffic ingress
Service discovery
Observability

Platform Contract

Kubernetes
Metrics -> Prometheus
Aggregated logs -> Fluentd
Feature toggled migration with Unleash + custom ingress controller

We organise for de-coupled communication

The organising into functional domains in FINN is inspired and derived from:

Self Contained Systems — separation of functionality into many independent systems
12 factor model — declarative setup automation, lower cost of adding employees

Domains communicate events on a common bus, results from the different domains are glued together in the frontend

The goal is De-coupled Event Driven Communication between Domains Services. Business logic should reside entirely inside the domain service. Domain functionalities are called in parallel to construct customer facing views.

Technology tiers

FINN technology is conceptually divided into 3 tiers: Business area services, Platform services, and Infrastructure.

We have divided our technology into 3 tiers: Business areas, Platform and Infrastructure. — The 3 technology tiers

Infrastructure is further divided into: Development support, Observability, Common services, Application platforms, Custom services and Legacy services.

Infrastructure services at FINN: Development support, Observability, Basic services, Application plattform and Custom service — Infrastructure services at FINN

It took years and a lot of effort for FINN to become able to release to production at a high frequency. See data from FINNs Pipeline below. In the same period we managed to reduce the build time for an artifact from several hours to minutes. This changed our way of working. It became easy to release a fix to production, instead of rolling back to a previous version.

The number of production deployments has increased from 1 per week in 2012 to more than 1000 in 2019, peaking at 2500. — *FINN Growth of number of production deployments per week from 2012 to 2018.*

In the period from 2010 to 2015, the number of developers increased to ~120. The number of services increased from a handful to several 100s. One deploy every 3 months changed to 1000 deploys every week. The some 10s of managed physical and virtual servers increased to 100s of self-served virtual servers. We could provide required compute resources. But developers did not have efficient enough and safe enough tooling. We did not achieve the required developer efficiency. The move to containers, Kubernetes deployments combined with the FIAAS abstraction and metrics, logging, alerts and notifications out of the box fixed this.

We used to run a single Puppet run for all clusters. Failures in one team could easily block deployments for other teams. Developer teams would ask for machines. Infrastructure people struggled to create enough server resources. Our current container-driven infrastructure has removed these constraints. The velocity of development and deployments has improved a lot. We no longer have interdependencies between deployments. Developers are mostly self-served, declaring simple infrastructure concepts in yaml. Developer teams are free to run whatever runtime gets the job done.

As a bonus we are free to move our considerable application payload anywhere that can run containers.

Designed for failure!

We allow and celebrate failure. We can not be fast if we can not risk failures! Our infrastructure, the release pipeline, the microservice architecture, it is all about being able to move more quickly. We need to be able to release new features easily and safely. We need to release often to reduce the impact when a change fail.

Measure everything

Showing the number of time series, queries and scrape durations varying over time. — Developers have millions of time series available to track performance and spot problems.

We strive to measure everything. We collect millions of time series related to production services. The data provide real-time information on infrastructure, services, applications, customer behaviour, and economy. We also collect, analyse and aggregate tons of logs in near real-time. All this information is used to generate alarms early in such a way that we can handle most problems during normal work hours.

FINN in a nutshell

Showing the steps involved from developer commits code till it is visible for user, including internal services involved. — Commit -> to user facing product

FINN is one of Norway’s largest in-house tech environments. And we make one of Norway’s largest Web destinations. We have a working culture focusing on removing dependencies. We organise as autonomous teams with all decision makers as part of the team. We strive to have processes and ways of working that minimise waste and waiting time. The loosely coupled architecture allows teams to move fast without fear of breaking other people’s stuff.

FINN has a strong tradition of creating bold strategies and delivering on them. Read about why Polycloud is a central part of our technology strategy. Look out for more posts about how we are moving finn.no from on-premise to the cloud as the work proceeds.