How to manage the FINN.no infrastructure where 170 developers deploy 1000 times a week
FINN.no is the largest marketplace in Norway, and one of Norway’s busiest Web destinations. We need stable and real-time consumable infrastructure services to run and innovate our service, and to evolve finn.no.
This post is about how we do infrastructure at FINN, and a bit about how we got where we are.
Introduction
Continuous Deployment
Our developers commit code to Github thousands of times a day. A code commit may trigger a build. The Pipeline build create and test a new deployment artifact in a Dev environment. This may result in an automated deploy to production. We release both small and large changes to production a 1000 times per week this way. The actual deployments are performed by FIAAS. -FIAAS is an open-sourced FINN infrastructure as a service deployment mechanism. It has a declarative nature, scheduling verified artifact deployments as one or more Kubernetes pods. Application updates are typically deployed and verified on the side of running production versions. The systematic verification version switching and auto-rollback make releasing to production quite safe.
To get a more complete picture of how this works, let’s summarise the technologies in use:
Application development and storage
Developer teams are free to use feasible programming languages. The majority of existing code is written in Java. New stuff is a mix of Java, Node, Kotlin, Scala, Haskell, React, Ruby, Python, and Go.
Most persistent structured data is stored in PostgreSQL database clusters. Ads are stored in a legacy Sybase database. Kafka messages are used to synchronize the 700 microservices. Solr is our main search technology. Redis is a popular in-memory storage for fast handling of i.e session data. Kibana is the favoured analytics and visualization platform. Our teams use it to get feedback on how applications are performing. Redshift is our data warehouse platform.
We use Github and Github Enterprise for all code revision control. Maven, Bamboo, Gradle and Travis are heavily used to achieve Continuous Integration and Continuous Deployment throughout the company. Artifactory and Nexus help us store the various CI/CD artifacts.
Orchestration and Management
The FIAAS deployment daemon schedule and deploy our container workload in Kubernetes dev and prod environments.
Distributed configuration is stored in etcd. Secrets are stored in Consul. A/B testing and feature toggle roll-out is performed using Unleash.
We collect logs using Fluentd and Logstash, and aggregate using Elasticsearch. We store millions of time series in Prometheus.
We have some limited use of RabbitMQ in connection to Sensu monitoring of all our services.
All FINN load balancers are implemented using HAProxy. Some parts of finn.no traffic, like images, are served by Fastly (Varnish technology). The storage of the images is handled by a YAMS image storage service.
Runtime
We solve storage requirements by using Ceph block storage and S3 buckets/cloud storage. We use Docker network and runtimes, and we use Flannel for virtual networks and to attach IP addresses to containers.
Provisioning
We use Terraform to provision infrastructure resources like compute, storage, network, identity. Orchestration and configuration of the provisioned resources is done by Puppet and helm. All the secrets we have are stored in Vault.
Infrastructure
Most FINN services operate from on-premise servers located close to Oslo. VMs are created in a Mirantis OpenStack cluster, pods are scheduled in Kubernetes clusters. We run dev environments in IBM Cloud (Softlayer). Data intelligence workloads run in Google Cloud and data warehouse run in Amazon Redshift. We use additional Schibsted provided services running in AWS.
The whole picture
Infrastructure as code
FINN has a need for efficient infrastructure management to let 175 developers develop at good speed.
We are walking the ”Cloud Native” Trail to provide efficient self-serve infrastructure services. FINN has two infrastructure as code implementations:
- Terraform & Puppet — legacy
- Kubernetes & FIAAS — the cool stuff
1. Our use of Terraform and Puppet to provision infrastructure is similar to what you will find in many places. You edit config files and infrastructure changes happen after one or more Puppet/Chef runs. Or after invoking Ansible or Salt. New VMs are created, configured and secured. Applications are installed programmatically. We actively maintain around 350 VMs with Terraform and Puppet this way. One third of VMs are 20–30 PostgreSQL clusters. The ambition is to reduce the number of these legacy servers by 50% year over year. Manual or interactive server configuration is a no-go.
2. Using FIAAS to run 700 applications in Kubernetes as 1.500 pods is quite cool! Developers use the FIAAS declarative interface to get new stuff into production in minutes. All they have to do is add a simple yaml file to their code repository, specifying: ports, health and readiness checks, required memory and CPU, replicas, etc.
FIAAS create Kubernetes pods using specified requirements and cluster defaults. Autoscaling, logs, metrics and alerts are automatically available, at well defined predictable addresses.
Read more about this in Øyvind Ingebrigtsen Øvergaards excellent open sourcing of FIAAS article.
Around 2015 as FINN adopted Domain Driven Design and Event Driven Architecture, FINN evaluated several cluster orchestration tools. The candidates where:
- Mesos + Marathon/Cisco cloud/Mantl
- Openshift
- Kubernetes
Mesos was judged to have lacking component integrations.
Openshift was viewed to be expensive, not so much improvement over plain Kubernetes.
Kubernetes provided the desired level of flexibility. It was Open Source and the project had excellent traction.
The use of the FINN-opinionated FIAAS deployment daemon make everyone’s world easier. Developers do not have to relate to the Kubernetes setup.
The declarative nature of Kubernetes has proven a great benefit. Kubernetes create resources like Deployment, Service, Ingress, and HorizontalPodAutoscaler approximating the desired state of your application. The Kubernetes control plane start containers, configure network resources, and instruct load balancers to achieve a deployment that correspond to each app’s resource configuration. All the resources are configured with YAML via the Kubernetes API.
Configuration as Code
All configuration for development, testing and production is part of the application code. Configuration as Code requires proper handling of secrets. We use Vault with persistent storage in Consul to provide on demand granular secret management. etcd is an extremely important component, as it functions as a shared configuration and service discovery for all containers.
Pipeline as Code
FINN uses an in-house developed Pipeline that builds, tests and triggers deployments. On legacy servers and as containers in Kubernetes.
Everything as Code
FIAAS sits in front of the Kubernetes APIs. It reads a fiaas.yml input and creates Kubernetes resources. It triggers the Kubernetes control plane to deploy all our applications. Fiaas.yml follows the “FIAAS config format”, an abstraction of the Kubernetes API.
Less Code
Configuring Kubernetes resources usually require a lot of YAML. Deploying an example application requires 303 lines of resource configuration. The corresponding FIAAS configuration requires only 25 lines, saving developers 90% of the time and hassle.
Configuration not specified in FIAAS yaml file is generated by a common mechanism for every application. This provides flexibility to change underlying configuration. We can change infrastructure across all applications without altering any of the applications themselves.
Contracts
Contracts are an important part of why we move so quickly at FINN. Contracts enforce standards. Standards make it easier to deploy applications. Developers are empowered and can focus on the important decisions. They know that logs, metrics, and applications will appear in a standard format, under standard URLs. It helps make the infrastructure easier to manage as communication around features, changes and expectations are clear.
Developer contract
We enforce a contract between Developers and the Infrastructure:
- Deployment
- Traffic ingress
- Service discovery
- Observability
Platform Contract
- Kubernetes
- Metrics -> Prometheus
- Aggregated logs -> Fluentd
- Feature toggled migration with Unleash + custom ingress controller
We organise for de-coupled communication
The organising into functional domains in FINN is inspired and derived from:
- Self Contained Systems — separation of functionality into many independent systems
- 12 factor model — declarative setup automation, lower cost of adding employees
The goal is De-coupled Event Driven Communication between Domains Services. Business logic should reside entirely inside the domain service. Domain functionalities are called in parallel to construct customer facing views.
Technology tiers
FINN technology is conceptually divided into 3 tiers: Business area services, Platform services, and Infrastructure.
Infrastructure is further divided into: Development support, Observability, Common services, Application platforms, Custom services and Legacy services.
It took years and a lot of effort for FINN to become able to release to production at a high frequency. See data from FINNs Pipeline below. In the same period we managed to reduce the build time for an artifact from several hours to minutes. This changed our way of working. It became easy to release a fix to production, instead of rolling back to a previous version.
In the period from 2010 to 2015, the number of developers increased to ~120. The number of services increased from a handful to several 100s. One deploy every 3 months changed to 1000 deploys every week. The some 10s of managed physical and virtual servers increased to 100s of self-served virtual servers. We could provide required compute resources. But developers did not have efficient enough and safe enough tooling. We did not achieve the required developer efficiency. The move to containers, Kubernetes deployments combined with the FIAAS abstraction and metrics, logging, alerts and notifications out of the box fixed this.
We used to run a single Puppet run for all clusters. Failures in one team could easily block deployments for other teams. Developer teams would ask for machines. Infrastructure people struggled to create enough server resources. Our current container-driven infrastructure has removed these constraints. The velocity of development and deployments has improved a lot. We no longer have interdependencies between deployments. Developers are mostly self-served, declaring simple infrastructure concepts in yaml. Developer teams are free to run whatever runtime gets the job done.
As a bonus we are free to move our considerable application payload anywhere that can run containers.
Designed for failure!
We allow and celebrate failure. We can not be fast if we can not risk failures! Our infrastructure, the release pipeline, the microservice architecture, it is all about being able to move more quickly. We need to be able to release new features easily and safely. We need to release often to reduce the impact when a change fail.
Measure everything
We strive to measure everything. We collect millions of time series related to production services. The data provide real-time information on infrastructure, services, applications, customer behaviour, and economy. We also collect, analyse and aggregate tons of logs in near real-time. All this information is used to generate alarms early in such a way that we can handle most problems during normal work hours.
FINN in a nutshell
FINN is one of Norway’s largest in-house tech environments. And we make one of Norway’s largest Web destinations. We have a working culture focusing on removing dependencies. We organise as autonomous teams with all decision makers as part of the team. We strive to have processes and ways of working that minimise waste and waiting time. The loosely coupled architecture allows teams to move fast without fear of breaking other people’s stuff.
FINN has a strong tradition of creating bold strategies and delivering on them. Read about why Polycloud is a central part of our technology strategy. Look out for more posts about how we are moving finn.no from on-premise to the cloud as the work proceeds.