Growing Fast & Staying Reliable — Our Journey Towards CI
Nowadays, it seems like every software development company practices Agile software development methodology, or at least a version of it. At this point, the ability to deliver reliable code flexibly and rapidly while quickly adapting and responding to changes has become an industry standard. When faced with the challenge of fast growth and rapid response time, it becomes clear that a CI solution is a necessity to avoid compromising the quality of the product.
Designing and implementing a CI solution comes with its fair share of challenges. For us, like many other SaaS companies, those have included understanding what a developer’s task is, determining what our testing environment should look like and where it should be triggered from, as well as which technologies would need to seamlessly integrate the whole process. This forced us to look inwards and consider our technology and day-to-day workflow in order to ultimately arrive at the ideal CI solution for us.
In this post, I’ll share how we overcome these challenges and offer an insider look at the technologies that helped us implement our own Yotpo-CI.
Dynamic Environments — CI Environment on Demand
One of our first challenges was understanding the expectancies from our testing environment. Sure, it had to meet the regular demands one would have of any testing environment, but when you think about CI — you have to think scale. If we were to go for a single environment solution it would have to be robust enough to handle lots of traffic but also constantly overridden based on the specific commits the task you want to test is comprised of. This is problematic when you think of dozens of CI processes running constantly and in parallel. This has led us to the conclusion that we need to provide environments dynamically and on demand for each task that we want to run our tests on.
Having HashiCorp deeply incorporated into our production infrastructure, we were able to achieve this task by utilizing Nomad as our containerized application deployer and scheduler. By using Consul as our service discovery, each application that is deployed by Nomad is a client that is registered to our Consul service which allows it to easily find the services and resources it depends upon using DNS or HTTP API, making it fairly simple to define, configure, and raise a living and breathing environment.
Another advantage is the fact that you can simply define a containerized Nomad job based on your application using templating. This allowed us to not only define a sound testing environment, but to also easily modify and expand upon it as our stack keeps growing and changing. Once all the templates are well-defined, The ease with which you can deploy them using Nomad allowed us to instantiate whole customized environments on demand.
The Trigger: Jira
When it comes to unit testing, we use Travis CI when a commit is pushed to a branch to make sure that the inner logic is sound.
Intuitively, one would think that the CI process should be triggered when a commit is pushed to the master branch or when a pull request is created. In some cases that’s correct, but a lot of tasks at Yotpo are comprised of different pull requests from different repositories. These pull requests are often dependant on each other to create a whole. While each of these pull requests can be safely deployed to production on their own by using feature flags, in order to continuously integrate behavioural and regression tests on a whole task, we came to the conclusion that the triggering should be done elsewhere.
The only element that encapsulates all of these pull requests is the corresponding JIRA story that defines the task as a whole. The JIRA task contains all of its related GitHub commits using a hook from GitHub that is based on naming convention as well as the developer who is assigned to the task, the team board, and all sorts of useful data.
Since the JIRA board represents the task’s life cycle on its way to be deployed to production, developers are constantly moving their tasks on it as part of their daily routine. By triggering the process from JIRA when a task reaches the “CI” state, we managed to get all the necessary data our process needs while also making it seamless to the developer.
The Flow — How It All Comes Together
Once a JIRA task reaches the “CI” state, a Jenkins pipeline is triggered via a webhook. As the first step, the pipeline extracts all the relevant GitHub data connected to the task and validates it. The validation process consists of:
- Verifying that the JIRA task has open pull requests connected to it
- For each of these pull requests, the pipeline verifies that their Travis build has finished successfully.
- The pipeline also double checks that Travis tagged and pushed the docker image representing the commit to our ECR.
Once the task is validated, the pipeline instantiates a custom environment for the task by deploying the required Nomad jobs with their corresponding docker tags that Travis pushed to ECR. This step also verifies that the environment is properly allocated in our Nomad server and performs the necessary applicative health checks to make sure that the environment is healthy and ready to go.
The next step is the actual testing. We use TestRail to manage our test cases and it allows us to easily create custom test plans based on the development teams, as well as a solid reporting system. Once the test plan is created, the pipeline deploys a nomad batch job that represents the test plan and responds to the custom environment. After the job is deployed, it runs all of the subset of tests that are assigned to that specific team in parallel.
Based on the result of the test plan, the pipeline handles the JIRA task accordingly by either advancing it on the board when the test run was successful or pushing it back if it failed. Additionally, it provides the link to the detailed TestRail report in the JIRA task’s comment section along with a notification on the relevant Slack channel.
Once the pipeline finishes or is aborted, the environment is destroyed. It’s important to note that the pipeline can fail at any step of the way. Whether if it’s in the validation step, the environment creation, or the tests themselves, the pipeline captures the specific failure, aborts, and notifies accordingly. This provides the developer with all the information he needs as to why his CI process failed.
Conclusion — Implementing CI The “Right” Way
When we first started researching the “ideal” CI solution we found out that there are many different implementations for it. Every company has its own best practices because every solution has to be coupled with the company’s architecture, infrastructure, methodologies, culture, and most importantly business. While we were able to draw many conclusions from observing other implementations, we learned it’s best to look closely at your own stack and workflow to implement the solution that best aligns with your company’s needs.
We’re still working on our CI/CD implementation, and there’s a long way to go before it’s “perfect.” Nevertheless, we wanted to share with you some insights from the milestones along the way, the most valuable being the list below. It’s a must-ask list that anyone tackling the challenge of a new CI solution can use to make sure they’re on track:
- What defines a testable task? Is it just a commit on a single repository?
- Whether you’re using a single environment with a queuing mechanism, or environments on demand, how can you make your testing environment/s robust enough to handle multiple CI processes?
- Where should you trigger it from?
- Who should orchestrate it?
- How can it seamlessly be integrated into the developer’s workflow?
- How can it provide the developer with the feedback s/he needs?
Once you have the answers, you’ll have a clear path forward for designing and implementing a CI solution that best integrates with your infrastructure, and suits your unique needs.