4 Engineers, 7000 Servers, And One Global Pandemic

Inside one of our data centers (tiny COVIDs by CDC on Unsplash).

Who we are

We are a team of 4 penguins who enjoy writing code and tinkering with hardware. In our spare time, we are in charge of deploying, maintaining and operating a fleet of over 7000 physical servers running Linux, spread across 3 different DCs located in the US.

License to use Creative Commons Zero — CC0.

Challenges of Scale

While it may make sense for a start-up to choose to start with hosting their infrastructure in a public cloud due to the relatively small initial investments, we at Outbrain choose to host our own servers. We do so because the ongoing costs of public cloud infrastructure far surpass the costs of running our own hardware in collocated data centers once a certain scale is reached, and it offers an unparalleled degree of control and fault recovery.

Master your Domains

To solve many of these challenges, we’ve broken down a server’s life cycle at Outbrain into its primary components, and we called them domains. For instance, one domain encompasses hardware demand, another the logistics surrounding inventory lifecycle, while a third is communication with the staff onsite. There is another one concerning hardware observability, but we won’t go into all of them at this time. The purpose of this is to examine and define the domains, so they can be abstracted away through code. Once a working abstraction is developed, it is translated into a manual process that is deployed, tested and improved, as a preparation for a coded algorithm. Finally, a domain is set up to integrate with other domains via APIs, forming a coherent, dynamic, and ever evolving hardware life cycle system that is deployable, testable and observable. Like all of our other production systems.

The Demand Domain

While emails and spreadsheets were an acceptable way to handle demand in the early days, they were no longer sustainable as the number of servers and volume of incoming hardware requests reached a certain point. In order to better organize and prioritize incoming requests in an environment of rapid expansion, we had to rely on a ticketing system that was:

  • Could be customised to present only relevant fields (simple)
  • Exposed APIs (extendable)
  • Familiar to the team (sane)
  • Integrated with our existing workflows (unified)
JIRA Kanban board.

The Inventory Lifecycle Domain

You could only imagine the complexity of managing the parts that go in each server. To make things worse, many parts (memory, disk) can find themselves traveling back and forth from inventory to different servers and back, according to the requirements. Lastly, when they fail, they are either decommissioned and swapped, or returned to the vendor for RMA. All of this, of course, needs to be communicated to the colocation staff onsite who do the actual physical labor. To tackle these challenges we have created an internal tool called Floppy. Its job is to:

  • Abstract away the complexities of adding/removing parts whenever a change is introduced to a server
  • Manage communication with the onsite staff including all required information — via email / iPad (more on that later)
  • Update the inventory once work has been completed and verified
Disk inventory management Grafana dashboard.
  • Collects the system logs
  • Generate a report in the preferred vendor’s format
  • Open a claim with the vendor via API
  • Return a claim ID for tracking purposes
Jenkins console output.

The Communication Domain

In order to support the rapid increases in capacity, we’ve had to adapt the way we work with the onsite data center technicians. When at first growing in scale meant buying new servers, following a large consolidation project (powered by a move to Kubernetes), it became something entirely different. Our growth transformed from “placing racks” to “repurposing servers”. So instead of adding capacity, we started opening up servers and replacing their parts. To successfully make this paradigm shift, we had to stop thinking about Outbrain’s colocation providers as vendors, and begin seeing them as our clients. Adopting this approach meant that we would need to design and build the right toolbox to help make the work of data center technicians:

  • Simple
  • Autonomous
  • Efficient
  • Reliable
  • The mechanism validates that this is indeed a server that needs to be worked on
  • The application running on the server is shutdown (where needed)
  • A set of work instructions is posted to a Slack channel explaining the required steps
  • Once work is completed, the tool validates the final state is correct
  • And if needed, restarts the application
An iPad rig at one of our data centers.

The Hardware Observability Domain

Scaling our datacenter infrastructure reliably requires good visibility into every component of the chain, for instance:

  • Hardware fault detection
  • Server states (active, allocatable, zombie, etc)
  • Power consumption
  • Firmware levels
  • Analytics on top of it all
Grafana dashboard of rack power utilization.

And then came COVID-19…

At Outbrain, we build technologies that empower media companies and publishers across the open web by helping visitors discover relevant content, products and services that may be of interest to them. Therefore, our infrastructure is designed to contain the traffic generated when major news events unfold.

  • The hardware in our data centers, for the most part, is not physically accessible to us
  • We rely on remote hands for almost all of the physical work
  • Remote hands work is done asynchronous, autonomously and in large volume
  • We meet hardware demand by method of “building from parts” vs “buying ready kits”
  • We keep inventory to be able to build, not just fix

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store