How Udemy does Build Engineering

Published in

Udemy Tech Blog

13 min readMay 20, 2020

An overview of Build Engineering at Udemy and the way we provide continuous integration at scale.

At Udemy our engineering team is growing at a steady rate, with hundreds of developers that are spread across different teams and many time zones. This diversity has resulted in varying technology stacks depending on the team’s purpose, with some teams possibly having no overlap in technologies. There’s one area that they all have in common, however — the need to test and prove our code. We use Jenkins for all of our CI, but we built a custom frontend and backend for developers to interact with in order to remove all the pain points we all have experienced with Jenkins. Let’s dive into how, and why.

Build Engineering

I work as part of the “Build [Engineering] team” at Udemy — a team with the core mission to empower engineers with the right tools and automation in place to ensure the best developer user experience possible. Our team is small, but we all have different career backgrounds that enhance our empathy towards frustrations developers may experience. With backgrounds from SRE at Microsoft to a GNU contributor, we cover many bases. I come from a feature development, frontend-heavy background, but I’ve always liked to dive into tooling, infrastructure and operational work as much as I can. I’m also a huge open source fan and contribute to webpack and their webpack-contrib projects where possible.

Our primary focus is the management of some development environments that deal with the core Udemy application (there are a few teams that manage their own) and to provide the infrastructure for all teams to run continuous integration no matter their needs.

When it comes to continuous integration, it goes without saying that each team, each project, and each service requires a different scale of complexity when testing their code. For example, our main application has 20,000+ tests running across varying test runners (E2E, frontend testing, unit, integration, etc). At the time of writing, the average duration of a CI run for our main application is around 21 minutes — a number we’re always looking to beat. Improvements are made in places, but then more features and tests are added.

This has caveats — through no fault of their own, developers rely on CI too much to know if their code has introduced regressions as it’s not viable to run the test suites locally. Developer time is a precious commodity that shouldn’t be wasted — not only should we support every team being able to test their code, we should also enable them to do it as quickly as possible. Having to wait for 45–60 minutes for CI to run on every commit is frustrating, and any minute we can optimise to improve that experience stacks up, increases productivity and more importantly, saves developer’s time.

Jenkins

Ah, Jenkins. We all love to hate it, but it’s without a doubt the most established open-source CI platform around. We rolled out our own Jenkins cluster running in AWS, as a third-party CI provider would be too expensive for our use case. For example, to be able to run CI for our main application in a reasonable amount of time, we use a machine with 72 cores and 144GiB of RAM. This allows us enough parallelisation and compute power for each stage. We run around 5,000 of these per month (our developers deploy changes to Udemy.com around 30–40 times per day).

The other pipelines we run account for another 8,000 builds a month, meaning we run around 13,000 builds per month, or nearly 700 per day, excluding weekends.

We don’t require branches to be up to date with master, as this would be extremely annoying when queuing for a release. Instead, we fast-forward merge any build into master, to make sure it’s working with the latest code. As the majority of developers work on widely different aspects of the codebase at any given time, conflicts and regressions not caught by CI are very rare. To reduce the risk, we expire any build that is older than 2 days (excluding weekends, so you don’t have to re-run CI on a Monday morning, eurgh). This means that a pull request is guaranteed to be working against the latest master from at most 2 days ago. Any important large changes to the codebase involve broadcast communication to developers that they will need to rebase master, and we have the ability to set all builds as failed, prompting the developer to push a new commit (therefore fast-forward merging it into master on CI) or for them to rebase with master for the build to pass.

In our experience, there is one major limitation to Jenkins — the UI. As a feature developer at previous companies, the experience in debugging my builds was frustrating and confusing. As an engineer on the Build team at Udemy, the experience in configuring pipelines didn’t differ — it too was frustrating and confusing. At the time of writing this post, we have 63 pipelines defined in Jenkins and managing them all through the Jenkins UI wouldn’t be a pleasant experience.

The configure Jenkins page, with a confusing form for configuring **every** plugin you have installed

Note — outside of our main application’s build configuration (which is massively complex, with parallel steps, steps that rely on other parallel steps to finish, etc.) we offer a feature we call “generic pipelines” to any development team.

We try to make as few assumptions about what these pipelines do as possible, and only require there to be a Makefile with build test and deploy targets (the deploy target is only run on merges into the default branch, such as master). Our agents are very barebones, so we encourage the usage of Docker to provide repeatable environments no matter where they’re run. We scale these agents automatically depending on demand and usage, so they’re not idle for too long before being destroyed. This also enables us to ensure that the host machines aren’t affecting builds (nothing is perfect, and artifacts or processes that we can’t account for that may have been left behind by previous builds could possibly affect future runs) — all machines are terminated and replaced within 24 hours.

We didn’t always have 63 pipelines. When I joined Udemy in January 2019, we had 18. These were defined in a Job DSL file, with a pipeline specific array that we’d have to update every time a team wanted a new pipeline created. It looked something like this:

The old way we used to define pipelines for Jenkins

This wasn’t scalable, both in terms of adding new configurable properties of a pipeline, and adding new pipelines. We ended up blocking teams that had to wait for us to create their pipeline and run the Job DSL for them. We hate blocking teams — we consider our work best done when they don’t need to think about us at all.

Like most engineers, we love to automate anything that’s repetitive. Building on the groundwork from my team, and my experience with frontend development, we decided to create a portal that allows developers to self-serve when it came to creating pipelines.

Build Portal

We created the Build Portal, an internal service where all of our developers can go as a reference point for anything related to our team and what we offer within Udemy.

The original purpose of the Build Portal was to allow anyone to create and manage pipelines. Through many iterations, it’s evolved to a lot more than just that — we will explore how it enables developers at Udemy to not only manage pipelines, but also give a better UI around their builds, and ideally never have to look at Jenkins again.

Pipelines

In addition to allowing developers to create pipelines, we enhance their functionality through configurable options on top of them. We found the GitHub plugin with Jenkins to be very difficult to work with, especially with large repositories, so we removed it and now kick off builds ourselves, using GitHub’s web hooks instead.

We keep an audit trail of who made changes to a pipeline, as we allow anyone to edit any of them

Looking at the screenshot above, there are a few features we offer that enhance our Jenkins experience.

We allow rules for kicking off builds if there are changes inside or outside certain directories in a repository (useful for mono-repo builds)
We can control how long Jenkins keeps the logs for the builds
We allow a different “deploy branch” such as release that will run the deploy stage of the generic pipeline
Different jobs can run on different node labels, which map in our automated Jenkins configuration to a different type of machine (we have all of our infrastructure as code, and can destroy and recreate Jenkins as many times as we want)
The messages for builds passing or failing can go into a Slack channel if desired
Although not shown here, pipelines can define parallelisation for their tests
We can also control which pipelines run at the same time. We can configure it so if two or more pipelines try to run at the same time, the build will automatically fail. This helps with our mono-repo — we don’t allow changes to multiple services at once, and each service has its own pipeline, therefore we fail the build if a pull request is out of scope and edits two services at once

Builds

When opening the Build Portal, we get straight to the point and show users their own builds. We include the status of the builds in this view, so it’s easy for them to get a glance at how their runs are doing without having to dig down into them.

Note that I don’t do this much work, I’ve replaced our developer’s profile pictures and names with my own to protect their privacy

There’s also the option to see all the builds that are currently running, and a brief history of builds over the last few days. You can also refine by the specific pipeline you’re looking for. We archive builds after three days, so developers can still view historical builds when clicking on the details link in GitHub, but we don’t pollute the main view with the thousands of archived builds.

An example view of a build that users are shown when clicking on “Details” in their GitHub pull requests

Clicking on an item in the flow diagram will show/live-stream the relevant logs for that section

The steps that have failed are marked in red, so developers can tell where their build went wrong at a glance

We also show if your branch has merge conflicts

Improving developer user experience

We’re huge believers that developers shouldn’t have to browse through logs to figure out why a build has failed, so we are constantly creating deeper integrations to interpret results from the builds, to then be displayed in the Build Portal.

For example, when an integration test fails, instead of having to browse through the logs (and there are a lot of them), we include a test failures tab in the Build Portal.

The test failures tab also provides useful instructions on how to re-run the tests locally

We understand that tests can be flakey, and it’s annoying to have your build fail because of a flakey test related to code that you haven’t touched in your pull request. We offer the ability to only retry the failed steps of a pipeline, or retry all of it.

Code coverage failure reports are integrated into the build page

How does it all work?

Although we tried, we weren’t able to use the Jenkins API to provide the same level of data and real-time updates that we wanted to implement in the Build Portal. We created our own backend with a database that keeps track of builds and their statuses. From the Jenkins side, we interact with it only where needed, e.g. grabbing the logs to display to developers and to kick off builds.

When it comes to updating our own database to reflect the status of the build, we have configured Jenkins to push the status updates into a RabbitMQ queue, which are then processed by a worker that saves the results to the database, as well as publish an event for any page that’s currently subscribed to the results. We chose this route to provide redundancy — if for any reason the Build Portal’s API went down, we wouldn’t lose status data and potentially provide incomplete/incorrect build results to our developers. If the worker goes down, the messages would then be processed by its replacement.

The frontend is built with React, and the backend API is a custom built framework on top of Koa, utilising GraphQL. We use Apollo’s server core to run the query (no need to reinvent every wheel) and then our framework takes it from there.

GraphQL subscriptions allow us to disconnect the worker processing messages from the server. Whenever an update to a build happens, the worker publishes the result into Redis, which notifies the server, passing on any relevant data to the browser through a WebSocket connection that has an active subscription. This means that the UI updates in real-time, removing any need to refresh the page to see the latest information.

A simplified overview of the architecture behind the Build Portal

GraphQL allows for faster frontend development, as the ability to query what you need may already exist within the schema and require no server changes. Therefore, client and server changes can be rolled out at different times. We also provide the ability for developers to create API access tokens, so they can query historical build data, or pull their build statuses into their editor, or terminal, etc.

The perk of it being an internal tool is being able to use more bleeding edge technology. We use the latest experimental builds of React and Relay, meaning we can test out React’s new concurrent mode, and provide feedback to our main frontend team about the direction React’s heading towards. There isn’t a single React class component — everything is a functional component using the React hooks API, all written in TypeScript.

We provide a Relay-compliant GraphQL API through a custom backend framework that takes our GraphQL schema (defined in TypeScript, utilising helper functions to auto-create the Payload, Input and Node generated types for the mutations and subscriptions) and maps the resolver functions to a dependency-injected controller based approach.

An example resolver for the query to grab an individual build

We also provide caching during queries and remove any inefficient queries that may have to grab either the same record or different records from a table multiple times, resulting instead in one query to the database to retrieve all the records needed.

This query fetches the builds for the main page we saw above. However, for each build, we need to query the pipeline to get the nice display name to present to the user.

Notice how we’re selecting the pipeline for every item in the list. This could be extremely inefficient as we’d be firing off a query to fetch the pipeline for each build. To counter this, we figure out exactly what pipelines are being requested, create an array of their IDs, remove any duplicates, and fire off only one query to the database for this connection — doing a SELECT ... WHERE pipeline.id IN [1,2] where 1 and 2 are the ids of the pipelines that are being requested. We also take advantage of GraphQL’s ability to select the fields you want in your query, and map that to our database queries — only selecting the necessary data and reducing query times massively.

We try to make the Build Portal as fast as we can. We wrote a custom router that prefetches the route chunk when a user hovers over a link (we code split every route out, so the initial page load is quick). We then utilise the preloaded queries that Relay experimental offers and preload the data for the route on mousedown. As a result, the data is loaded before the browser starts to navigate to the next page, and is almost always available at the first render of the route components.

Oh, and it goes without saying, we offer both a dark and light mode, depending on system preference, but it can be overridden if you prefer to always use one.

Tracking build metrics

As I mentioned above, build times are an extremely important thing to our team. We create custom APM traces for our builds, so we can monitor the 95th percentile build times in Datadog.

APM traces allow us to see when each stage has started, and how their times change. It also gives an overall view as to what is running at the same time, and where our bottlenecks are

We also have an extensive dashboard so we can see at a glance how well our CI platform is doing, with some custom metrics, and pulling in metrics from our custom APM traces.

Viewing the average wait in seconds a build has to wait for a node to be available to run tests — we tag this based on the node type being requested

We track every step of every build of every pipeline so we can pinpoint if any changes have slowed down the build time significantly. These are the results for our main application’s CI runs

We also track the slowest integration tests, so teams can fix and improve the speed of the builds

Slack integrations

We also provide automated rich Slack comments for users with their build results. From these messages, you’re also able to retry builds from within Slack. This way you don’t have to open up the Build Portal if you’re confident your build failed because of a flakey test.

Beep is our internal bot for providing information about builds

Clicking on “Retry Failed Steps” or “Retry All” will kick of a build and update the message to tell you so, without having to open a browser

@here — the enemy of most people using Slack, seconded to @channel. We realise that, but we notify the relevant channel for the pipeline of any deployment branch failures. It’s important to keep master healthy, as other developer’s builds will fail because of it (due to the fast-forward merging).

Final notes

There are a lot more features to the Build Portal that I haven’t gone over in this article. It’s constantly evolving and is one of the most used internal tools at Udemy, and listing out all the features would make this article too long. By creating the Build Portal, we’ve improved productivity within our engineering teams, allowing them to debug builds faster, create pipelines themselves and improve their overall developer experience. We have a lot more planned for the Build Portal in the future and will share our journey along the way.

We constantly strive to make the developer’s lives at Udemy easier in any way that we can, and remove any dev-related pain points or blockers that they may have during their normal work day.

Author

Ryan Clark is a Senior Software Engineer on the Build team at Udemy. He works on a variety of different tools such as the Build Portal, CI, and development environments.