How Udemy does Build Engineering
An overview of Build Engineering at Udemy and the way we provide continuous integration at scale.
At Udemy our engineering team is growing at a steady rate, with hundreds of developers that are spread across different teams and many time zones. This diversity has resulted in varying technology stacks depending on the team’s purpose, with some teams possibly having no overlap in technologies. There’s one area that they all have in common, however — the need to test and prove our code. We use Jenkins for all of our CI, but we built a custom frontend and backend for developers to interact with in order to remove all the pain points we all have experienced with Jenkins. Let’s dive into how, and why.
Build Engineering
I work as part of the “Build [Engineering] team” at Udemy — a team with the core mission to empower engineers with the right tools and automation in place to ensure the best developer user experience possible. Our team is small, but we all have different career backgrounds that enhance our empathy towards frustrations developers may experience. With backgrounds from SRE at Microsoft to a GNU contributor, we cover many bases. I come from a feature development, frontend-heavy background, but I’ve always liked to dive into tooling, infrastructure and operational work as much as I can. I’m also a huge open source fan and contribute to webpack and their webpack-contrib projects where possible.
Our primary focus is the management of some development environments that deal with the core Udemy application (there are a few teams that manage their own) and to provide the infrastructure for all teams to run continuous integration no matter their needs.
When it comes to continuous integration, it goes without saying that each team, each project, and each service requires a different scale of complexity when testing their code. For example, our main application has 20,000+ tests running across varying test runners (E2E, frontend testing, unit, integration, etc). At the time of writing, the average duration of a CI run for our main application is around 21 minutes — a number we’re always looking to beat. Improvements are made in places, but then more features and tests are added.
This has caveats — through no fault of their own, developers rely on CI too much to know if their code has introduced regressions as it’s not viable to run the test suites locally. Developer time is a precious commodity that shouldn’t be wasted — not only should we support every team being able to test their code, we should also enable them to do it as quickly as possible. Having to wait for 45–60 minutes for CI to run on every commit is frustrating, and any minute we can optimise to improve that experience stacks up, increases productivity and more importantly, saves developer’s time.
Jenkins
Ah, Jenkins. We all love to hate it, but it’s without a doubt the most established open-source CI platform around. We rolled out our own Jenkins cluster running in AWS, as a third-party CI provider would be too expensive for our use case. For example, to be able to run CI for our main application in a reasonable amount of time, we use a machine with 72 cores and 144GiB of RAM. This allows us enough parallelisation and compute power for each stage. We run around 5,000 of these per month (our developers deploy changes to Udemy.com around 30–40 times per day).
The other pipelines we run account for another 8,000 builds a month, meaning we run around 13,000 builds per month, or nearly 700 per day, excluding weekends.
We don’t require branches to be up to date with master, as this would be extremely annoying when queuing for a release. Instead, we fast-forward merge any build into master, to make sure it’s working with the latest code. As the majority of developers work on widely different aspects of the codebase at any given time, conflicts and regressions not caught by CI are very rare. To reduce the risk, we expire any build that is older than 2 days (excluding weekends, so you don’t have to re-run CI on a Monday morning, eurgh). This means that a pull request is guaranteed to be working against the latest master from at most 2 days ago. Any important large changes to the codebase involve broadcast communication to developers that they will need to rebase master, and we have the ability to set all builds as failed, prompting the developer to push a new commit (therefore fast-forward merging it into master on CI) or for them to rebase with master for the build to pass.
In our experience, there is one major limitation to Jenkins — the UI. As a feature developer at previous companies, the experience in debugging my builds was frustrating and confusing. As an engineer on the Build team at Udemy, the experience in configuring pipelines didn’t differ — it too was frustrating and confusing. At the time of writing this post, we have 63 pipelines defined in Jenkins and managing them all through the Jenkins UI wouldn’t be a pleasant experience.
Note — outside of our main application’s build configuration (which is massively complex, with parallel steps, steps that rely on other parallel steps to finish, etc.) we offer a feature we call “generic pipelines” to any development team.
We try to make as few assumptions about what these pipelines do as possible, and only require there to be a Makefile
with build
test
and deploy
targets (the deploy
target is only run on merges into the default branch, such as master). Our agents are very barebones, so we encourage the usage of Docker to provide repeatable environments no matter where they’re run. We scale these agents automatically depending on demand and usage, so they’re not idle for too long before being destroyed. This also enables us to ensure that the host machines aren’t affecting builds (nothing is perfect, and artifacts or processes that we can’t account for that may have been left behind by previous builds could possibly affect future runs) — all machines are terminated and replaced within 24 hours.
We didn’t always have 63 pipelines. When I joined Udemy in January 2019, we had 18. These were defined in a Job DSL file, with a pipeline specific array that we’d have to update every time a team wanted a new pipeline created. It looked something like this:
This wasn’t scalable, both in terms of adding new configurable properties of a pipeline, and adding new pipelines. We ended up blocking teams that had to wait for us to create their pipeline and run the Job DSL for them. We hate blocking teams — we consider our work best done when they don’t need to think about us at all.
Like most engineers, we love to automate anything that’s repetitive. Building on the groundwork from my team, and my experience with frontend development, we decided to create a portal that allows developers to self-serve when it came to creating pipelines.
Build Portal
We created the Build Portal, an internal service where all of our developers can go as a reference point for anything related to our team and what we offer within Udemy.
The original purpose of the Build Portal was to allow anyone to create and manage pipelines. Through many iterations, it’s evolved to a lot more than just that — we will explore how it enables developers at Udemy to not only manage pipelines, but also give a better UI around their builds, and ideally never have to look at Jenkins again.
Pipelines
In addition to allowing developers to create pipelines, we enhance their functionality through configurable options on top of them. We found the GitHub plugin with Jenkins to be very difficult to work with, especially with large repositories, so we removed it and now kick off builds ourselves, using GitHub’s web hooks instead.
Looking at the screenshot above, there are a few features we offer that enhance our Jenkins experience.
- We allow rules for kicking off builds if there are changes inside or outside certain directories in a repository (useful for mono-repo builds)
- We can control how long Jenkins keeps the logs for the builds
- We allow a different “deploy branch” such as
release
that will run the deploy stage of the generic pipeline - Different jobs can run on different node labels, which map in our automated Jenkins configuration to a different type of machine (we have all of our infrastructure as code, and can destroy and recreate Jenkins as many times as we want)
- The messages for builds passing or failing can go into a Slack channel if desired
- Although not shown here, pipelines can define parallelisation for their tests
- We can also control which pipelines run at the same time. We can configure it so if two or more pipelines try to run at the same time, the build will automatically fail. This helps with our mono-repo — we don’t allow changes to multiple services at once, and each service has its own pipeline, therefore we fail the build if a pull request is out of scope and edits two services at once
Builds
When opening the Build Portal, we get straight to the point and show users their own builds. We include the status of the builds in this view, so it’s easy for them to get a glance at how their runs are doing without having to dig down into them.
There’s also the option to see all the builds that are currently running, and a brief history of builds over the last few days. You can also refine by the specific pipeline you’re looking for. We archive builds after three days, so developers can still view historical builds when clicking on the details link in GitHub, but we don’t pollute the main view with the thousands of archived builds.
Improving developer user experience
We’re huge believers that developers shouldn’t have to browse through logs to figure out why a build has failed, so we are constantly creating deeper integrations to interpret results from the builds, to then be displayed in the Build Portal.
For example, when an integration test fails, instead of having to browse through the logs (and there are a lot of them), we include a test failures tab in the Build Portal.
We understand that tests can be flakey, and it’s annoying to have your build fail because of a flakey test related to code that you haven’t touched in your pull request. We offer the ability to only retry the failed steps of a pipeline, or retry all of it.
How does it all work?
Although we tried, we weren’t able to use the Jenkins API to provide the same level of data and real-time updates that we wanted to implement in the Build Portal. We created our own backend with a database that keeps track of builds and their statuses. From the Jenkins side, we interact with it only where needed, e.g. grabbing the logs to display to developers and to kick off builds.
When it comes to updating our own database to reflect the status of the build, we have configured Jenkins to push the status updates into a RabbitMQ queue, which are then processed by a worker that saves the results to the database, as well as publish an event for any page that’s currently subscribed to the results. We chose this route to provide redundancy — if for any reason the Build Portal’s API went down, we wouldn’t lose status data and potentially provide incomplete/incorrect build results to our developers. If the worker goes down, the messages would then be processed by its replacement.
The frontend is built with React, and the backend API is a custom built framework on top of Koa, utilising GraphQL. We use Apollo’s server core to run the query (no need to reinvent every wheel) and then our framework takes it from there.
GraphQL subscriptions allow us to disconnect the worker processing messages from the server. Whenever an update to a build happens, the worker publishes the result into Redis, which notifies the server, passing on any relevant data to the browser through a WebSocket connection that has an active subscription. This means that the UI updates in real-time, removing any need to refresh the page to see the latest information.
GraphQL allows for faster frontend development, as the ability to query what you need may already exist within the schema and require no server changes. Therefore, client and server changes can be rolled out at different times. We also provide the ability for developers to create API access tokens, so they can query historical build data, or pull their build statuses into their editor, or terminal, etc.
The perk of it being an internal tool is being able to use more bleeding edge technology. We use the latest experimental builds of React and Relay, meaning we can test out React’s new concurrent mode, and provide feedback to our main frontend team about the direction React’s heading towards. There isn’t a single React class component — everything is a functional component using the React hooks API, all written in TypeScript.
We provide a Relay-compliant GraphQL API through a custom backend framework that takes our GraphQL schema (defined in TypeScript, utilising helper functions to auto-create the Payload
, Input
and Node
generated types for the mutations and subscriptions) and maps the resolver functions to a dependency-injected controller based approach.
We also provide caching during queries and remove any inefficient queries that may have to grab either the same record or different records from a table multiple times, resulting instead in one query to the database to retrieve all the records needed.
This query fetches the builds for the main page we saw above. However, for each build, we need to query the pipeline to get the nice display name to present to the user.
Notice how we’re selecting the pipeline
for every item in the list. This could be extremely inefficient as we’d be firing off a query to fetch the pipeline for each build. To counter this, we figure out exactly what pipelines are being requested, create an array of their IDs, remove any duplicates, and fire off only one query to the database for this connection — doing a SELECT ... WHERE pipeline.id IN [1,2]
where 1
and 2
are the ids of the pipelines that are being requested. We also take advantage of GraphQL’s ability to select the fields you want in your query, and map that to our database queries — only selecting the necessary data and reducing query times massively.
We try to make the Build Portal as fast as we can. We wrote a custom router that prefetches the route chunk when a user hovers over a link (we code split every route out, so the initial page load is quick). We then utilise the preloaded queries that Relay experimental offers and preload the data for the route on mousedown
. As a result, the data is loaded before the browser starts to navigate to the next page, and is almost always available at the first render of the route components.
Oh, and it goes without saying, we offer both a dark and light mode, depending on system preference, but it can be overridden if you prefer to always use one.
Tracking build metrics
As I mentioned above, build times are an extremely important thing to our team. We create custom APM traces for our builds, so we can monitor the 95th percentile build times in Datadog.
We also have an extensive dashboard so we can see at a glance how well our CI platform is doing, with some custom metrics, and pulling in metrics from our custom APM traces.
Slack integrations
We also provide automated rich Slack comments for users with their build results. From these messages, you’re also able to retry builds from within Slack. This way you don’t have to open up the Build Portal if you’re confident your build failed because of a flakey test.
@here
— the enemy of most people using Slack, seconded to @channel
. We realise that, but we notify the relevant channel for the pipeline of any deployment branch failures. It’s important to keep master healthy, as other developer’s builds will fail because of it (due to the fast-forward merging).
Final notes
There are a lot more features to the Build Portal that I haven’t gone over in this article. It’s constantly evolving and is one of the most used internal tools at Udemy, and listing out all the features would make this article too long. By creating the Build Portal, we’ve improved productivity within our engineering teams, allowing them to debug builds faster, create pipelines themselves and improve their overall developer experience. We have a lot more planned for the Build Portal in the future and will share our journey along the way.
We constantly strive to make the developer’s lives at Udemy easier in any way that we can, and remove any dev-related pain points or blockers that they may have during their normal work day.
Author
Ryan Clark is a Senior Software Engineer on the Build team at Udemy. He works on a variety of different tools such as the Build Portal, CI, and development environments.