After years of heavily using Jenkins, we started a gradual and coordinated effort to take our CI/CD tasks to GitHub Actions. On this article, we present the rationale behind this decision and what we had to face while moving some of our complex pipelines.
We needed a more distributed, modern and scalable platform to level up our CI/CD game. GitHub Actions is not perfect, but enables us to work as truly independent and agile teams, while sharing common workflow pieces and coordinated communication between repositories. We know the DevOps culture movement goes way deeper than the tool we use to orchestrate pipelines. We see GitHub Actions as a platform that, from the point of view of infrastructure, gives us so much more than we had with Jenkins.
As people and business needs on digital services increase, organizations need to have shorter and shorter workflows to delivery more frequently, otherwise risking to loose competitive advantages . As one form to have new features released in a cyclical process of added value, we have the Continuous Delivery/Integration (CI/CD) approaches: the work is divided in small deliverable batches, and most of the manual work such as approvals, tests and deploy are eliminated and replaced by automated tasks [1,2]. In this context, DevOps practices eclode as important enablers of integration between teams, goal oriented work, decrease of failures during delivery and the fast verification of the value of new features. The workflow is now composed by several cycles of continuous improvement, reinforcing a culture of collaborative incident investigation, knowledge sharing and reaching towards better solutions. In a broader view, those concepts align with Continuous Software Engineering , that involves short production cycles and data-driven prioritization and validation of new features.
Here at Passei Direto we have dozens of automated pipelines, supporting several teams and products. We are still transitioning from a traditional unified engineering team to a more distributed, goal oriented squad approach, where small multidisciplinary teams work on specific features or products.
Our Jenkins is the biggest symbol of the old team organization. Before this migration started, we had hundreds of pipelines, with more than 55 hours of execution and 1800 triggers daily. There are all kinds of pipelines, from simple PR checkers to complex end-to-end tests, Machine Learning jobs and a deploy to a myriad of cloud infrastructure services, such as ECS, AWS Lambdas and CloudFront.
What we miss with Jenkins
We are aware of many challenges to nourish a long lasting, general DevOps culture. But from the point of view of the infrastructure we saw many issues that Jenkins could not help us to solve:
- Platform Management Effort
- Single Point of Failure (SPOF)
- Pipeline Step Sharing
- Non ephemeral containers
The first issue is that our new work model does not have a clear responsible for Jenkins maintenance. Sure we have a awesome SRE team, but they have other important actions to perform, regarding environment isolation, observability and our deploy workflow. It means that we have a scattered effort to maintain our Jenkins, but it is not priority on any team roadmap. The vital tool for our daily workflow does not have the attention it deserves. This is the main point here because all possible solutions for the other three issues an expert Jenkins administrator could think, we are short handed to address. Besides that, we don’t have a native experience with Java.
Other critical point we see is that our Jenkins the Single Point of Failure of our operation. Whenever Jenkins stops (when a upgrade goes bad, no disk space or network outage) all our pipelines stop too, and every team cannot check PRs, run tests or deploy. This is unacceptable when we have several teams working totally independently, but still end up not delivering because the same problem.
Scaling in/out Jenkins is hard, and requires a lot of team specific work. We end up with several (and not small) EC2 instances provisioned 24/7. They are not able to run off on quiet days, such as weekends and holidays, nor handle several teams needs to build projects at the same time. This means the queue increasing to a few dozens of waiting jobs.
We have several steps that are very common in different pipelines: ECR build/tag/push, ECS task definition update, SonarQube analysis and much more. In Jenkins the most common way of sharing steps is using plugins or shared Groovy libraries. So many Jenkins plugins are outdated and we had several cases of upgrade breaking pipelines. We also don’t have the Java (or Groovy, for that matter) expertise to write and maintain our own plugins.
The last issue we can point out with Jenkins is that jobs can impact on a later or concurrent job. They will be racing for the same computational resources, but not only that: configuration files can be disputed, ill performed cleanups can impact the next job and hard crashes (due to network or disk issues, for instance) can let the whole agent inoperative. We want ephemeral runs and deterministic pipeline results, so we when something breaks we can be sure why.
As an attentive reader can notice, our issues are not on Jenkins alone: in our distributed, agile and context those are deal breakers. We know there is a lot of solutions for what we face, specially if you are a Jenkins expert. We just prefer to move towards a more compatible ecosystem, hoping we can have a better platform, tools and community to bring our CI/CD game to a next level.
About GitHub Actions (GHA)
GitHub Actions (or GHA) is a native platform within the GitHub ecosystem. Launched in 2018, it offers a set of task automation features, such as tests, deploys and checks. These tasks are described via a declarative configuration file (.yaml) that is versioned within the repository. It also specifies which triggers are triggering these routines. A set of linked tasks is called a pipeline or workflow. Within the task, we refer to each minimum unit as steps.
Actions was a response from GitHub Inc. to GitLab CI, its main competitor. GitLab follows the strategy of providing complete solutions that involve project management, knowledge base, automation and monitoring using the same platform.
Composing different actions is simple, since they are self-contained and do not influence the functioning of others. This does not happen in Jenkins, where plugins often conflict and are installed globally. In case of conflict or solution that does not perfectly fit the our use case, it is possible to fork or contribute directly to the repository, using our best known language and in a much simpler process than in Jenkins. The versioning of GitHub’s executors is much more solid than that of Jenkins, since it is not common to find conflicts between the new versions and the old actions.
The computational resources for running the pipelines in the GHA can be divided into two categories. GitHub Runners are instances managed by GitHub and made available on demand after the start of a pipeline. The user does not have to worry about this infrastructure and the cost is calculated by the time (minutes) of execution. GH “Team” users have 3000 minutes of Linux instances included. At the end of this quota, new minutes are purchased separately.
Another way to use GHA is through Self Hosted Runners. In this case, it is necessary to manage the computational resources (for example, AWS EC2 instances) and register them as executors able to receive workloads, either for the entire organization or for a specific repository.
Pipelines define what type of runner they demand, and are queued according to availability. With the executors of virtually there are no limits for parallel pipelines, although there is a limit for simultaneous tasks within the same pipeline. For self hosted, each registered instance can perform one task at a time.
The approaches are complementary, and serve as the basis for the different projects, technologies and use cases that our teams need to deal with. The GitHub executors are much more practical, especially for fast pipelines. However, they can be slower in tasks that need major customization of the environment, with images, download libraries, etc. Accessing internal network resources through is more complex through the GitHub executors. Tasks that access multiple resources can be more efficient on self hosted performers.
How we dealt with Self Hosted Runners
Since we have several complex pipelines that can take a few hours, we had to think on a self hosted stack that satisfies our main requirements: really low maintenance needs, ephemeral and scalable. We ended up with a ECS Cluster with a containerized runner being started for each pipeline, using EC2 instances. More details about our approach will be explained in a next article that we are producing with AWS crew that helped us to come up with this solution. Hang tight!
Migration so far
We decided for a gradual migration, since there are some really complex jobs that are working well with Jenkins, so this will be a long process. We started with a four week proof of concept with the following steps:
- Discussions about GHA
- Self Hosted Stack
- Selected pipeline migration
Now we are finishing the last step, documenting our stack and repositories. We moved out 4 complex pipelines related with ETL tasks, compassing deploy, ECR containers, ECS Task Definition, Cross repository events and manual triggers. Now we are monitoring those freshly created workflows and some conclusions are starting to be drawn:
- Our self hosted solution is really low maintenance and scalable, since we have no instances running several times a week and they are all the management is ECS handled. We will have some discussions about instance size and bootstrapping optimization, as well as monitoring approaches, but nothing compared on what we have as effort to keep Jenkins running.
- GitHub Actions is really new, and lots of feature are still being worked out. So there are still sharp edges and somewhat verbose solutions that could get better. It reflected on longer pipeline definitions than we had on Jenkins, although we do not have any hidden “agent preparation” code, such as installing dependencies or credentials. It is all clear on the pipeline, so that’s another reason we ended up with more code than before
- Having a way of reserving extra resources for a given pipeline or even a given pipeline run is a great feature we could achieve with our new self hosted approach. CPU intensive pipelines can have more provided vCPUs, and the same goes with memory. The trade offs are the cost of the instances and the time of provisioning, but it is very useful for certain long running jobs.
- Some mature features on Jenkins are still evolving on GHA. For instance, build another workflow from another repository is trivial, but we had no easy way of doing that with GHA. Good news pipelines are really extensive, so we’ve built our own action to do that. Sure more will come!
We felt like Jenkins was not the ideal platform for a distributed, squad oriented, agile workflow, nor a enabler ecosystem for a true growing DevOps culture. After some small tries with GitHub Actions, we decided to start a Proof of Concept, leading a gradual non disruptive migration of our Jenkins pipelines do GHA. We came up with a handy solution to scale and manage self-hosted GitHub Runners, ideal for dealing with long running complex pipelines. As we keep moving out pipelines, we sure will face different challenges that will be addressed in due time. We expect to have another blog post in a few months sharing the next steps and what we accomplished, in the hope to help others to walk the same path.
 HUMBLE, Jez; KIM, Gene. Accelerate: the science of lean software and DevOps: building and scaling high performing technology organizations. IT Revolution, 2018.
 FORSGREN, Nicole et al. 2019 Accelerate State of DevOps Report. 2019. Disponível em https://services.google.com/fh/files/misc/state-of-devops-2019.pdf
 KIM, Gene et al. The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations. IT Revolution, 2016.
 BOSCH, Jan. Continuous software engineering: An introduction. In: Continuous software engineering. Springer, Cham, 2014. p. 3–13.