CI evolution on Android in hh.ru

Published in

hh.ru

20 min readJun 7, 2022

‘What the hell is going on here?!’ That’s how our infrastructure-scripts could be characterized up till now. Something needed to be changed, and we did it.

My name’s Pavel Strelchenko, I am an Android developer at HeadHunter company. I’m going to tell you about how our CI have been evolving during three years, what problems we have faced, how we analyzed them and tried to fix, also what we generally did and what results we got.

It’s an article made of our videoblog, so if you prefer watching instead of reading, welcome to our Youtube-channel. We’ve managed to add a lot of useful extra links, so you can check them too.

An extra important disclaimer

After having watched the video or read the article, you might get an impression that setting up CI for Android apps’ builds is a tremendously complicated task, which can be facilitated if you use fastlane utility.

It’s a wrong feeling, kick it away.

First, setting CI is easy in the 80–90% of cases, thanks to the fact that in 2022 we already have a lot of tools which simplify the process (you may take a look at Jenkins / CircleCI, Github Actions, etc.)

Secondly, fastlane is not a miracle drug. The hh team is not ready to recommend it yet. It’s important to remember that before using any tool, you need to study it in order to understand if it’s suitable for every particular case, assess all the risks and estimate its adaptation cost.

Don’t be a trusting soul, double-check the information yourself.

Blessed you be with a stable CI!

What was it like living without CI?

Let’s start with the story that happened 3 years ago. My college, Alexander Blinov, described that period very well in the video on the history of refactoring. In those times long ago the code base was changing simultaneously with the infrastructure changes.

At that moment we didn’t have any infrastructure-scripts at all. A special developer, whose name was Anton, took the role of the build-server. He assembled release APK and debug APK at the testers’ request to give them in for running tests.

Build, this is Anton. Anton, this is the build you need to assemble.

Of course, regular builds were off the table. We didn’t run any tests on regular basis. Consequently, any commit to ‘develop’ or ‘master’ branch could destroy absolutely everything.

Apart from that, we failed to launch any static code analysis. Therefore, the code style was inconsistent throughout the project. For example, you could find special commits in our project commit history that kept applying the adopted style code to the entire codebase. There’s no joy in that, my word.

Really, a lot of commits?

The history contained numerous commits correcting code style in lots of files.

The absence of a clearly fixed code style in the IDE settings often led to the need of correcting something in several files inside the commit. Several days would pass, and we had to correct a huge pile of files again.

What did we do to improve the situation?

We realized that it was enough living like that, and it was time to configure a CI.

Set up build machines — we went to the hh infrastructure team and ordered special build machines from them with the configuration that we needed. We didn’t require much: Java and Android SDK;
Written out the types of builds — we created a special table where we wrote the name of the build, the criteria by which one or another flow should be launched, and what exactly was to be launched in terms of the run.

In the table we thoroughly described the necessity of a particular build, by which trigger it should be invoked, whether any unit-tests were needed inside it, static code analysis, and what apps in terms of build-types it should assemble.

Thus, we managed to get four major build types:

Pull request (PR) build — a developer creates PR and we launch the build on our CI-server. This build checks app compilation, run static code analysis and unit-tests;
Night build — it’s a regular night build. Our developers work on features in separate feature-branches. We assemble these branches every day until they are merged to ‘develop’. Here we run unit-tests, UI-tests, check the compilation of all apps. In a nutshell, this build assemble everything.
PR to Develop build — this build is launched when you try to merge into develop. When the developer finishes working on their feature, they try to merge it into develop. This is the moment when we need to study the feature thoroughly: run all tests, static code analysis and overall compilation.
Custom build is a build that could be configured for almost anything. Potentially it was capable of building release versions, debug versions, flavours, and also run or not run unit-tests etc.

How did we implement all the four flows?

We wrote a single huge Gradle-script. It was designed to do almost anything, take into account any steps, launch any tests, and so on. So it was supposed to release Custom build plan.

What it looked like:

Just a small piece of that gigantic Gradle-script

One screen doesn’t have enough capacity to display it all, the same file contained all the necessary Groovy classes and many utilities.

With the help of variables from our CI server we configured turning on the particular build features. Owing to this unified script we managed to configure all our four flows.

What variables are we talking about?

Variables in Bamboo for build configuration

is_crashlytics_release — the flag which the pushing of the final build to Crashlytics Beta depended on (tears of nostalgia);
step_app_names — here you could write the list of apps for building, in this plan only applicant was noted;
step_build_types — you could list all the necessary build types: Debug, PreRelease, Release;
step_extra — additional run of some Gradle task;
step_to_fabric — it’s not quite clear, why it said ‘true’, as that step required writing ‘AppName:BuildType’ for building additional test apps;
step_ui_tests — if it’s necessary to run UI-tests;
step_unit_tests — Gradle task for running tests in this particular build;

To be concise, the customization was going smoothly, although not quite consistently.

Hooray, the CI started working!

What conclusions did we make?

Life without CIis a pain indeed. The earlier you turn it on and start running tests and static code analysis on regular basis, the sooner your app and code base quality will reach a brand new level.

And then half a year passed.

What happens when you are not aware of what you’re doing

It was after a year and a half of using and fixing our super script, when we had realized that it wasn’t as good as we wanted it to be.

The script was a long sheet of Groovy code — It was located in a single Gradle file. The major part was taken by a huge Gradle task, which was written with no regard to Gradle best practices;

A little note

Writing a script on Groovy wasn’t actually that bad. Back then Kotlin Scripts was not very popular, and KTS itself was at its beginning.

We didn’t realize the purpose of Gradle task right away

Whereas its functions included the following:

Bash command text formation;
Adding it to shell file;
Launching this special shell file.

So it happened like that: we launched Gradle task on CI. The task launched bash, which… launched Gradle once again.

Well, you got it

I put a totally different joke in the video, not the one that I wanted, so let this be here.

At some point we found out that our PR builds lasted for approximately an hour. We couldn’t check the build scans, because the launched gradle task was written oddly.

Confusing static code analysis — These scripts were launched through invisible paths, so we couldn’t properly update detekt;

In addition, the development of applicant app was begun at the same time, and quite intensively. Due to the huge number of new features and refactorings we had to accelerate the process of regress pass.

How did we solve the new problems?

We tried to stop using a super-customized script — in order to do so, we took bash command generation to bash scripts which we configured with the help of UI belonging to our CI server. The same options are available on Bamboo. We didn’t consider any problems inside bash scripts, we just took them and started running them regularly;

What did it look like?

There’s a special type of tasks on Bamboo: Script Configuration, that’s where we copied the necessary scripts to.

We began to set up UI tests run and asked our colleagues from infrastructure teams to help us. They found Kubernetes cluster, configured emulator launch (you can read the whole report on it here). Our part about describing a small Docker container, where we put java, Android SDK and Marathon utility, which was supposed to help us run UI tests.

The results of half a year of working

What did we manage to achieve?

We stopped using Custom Build script in all flows except the release one — The script had still lived in our code base, when we got rid of it about a three months ago.
The logics of infrastructure-scripts became a bit clearer — We visualized it in the UI interface of our CI server, broke into steps, and life became easier;
We fixed the problem related to running our PR — They no longer lasted for an hour; plus, we restored build scans’ work;
Regular UI tests’ launches accelerated the regresses — We got an opportunity to make releases more often.

What did we learn from that time?

We learnt that it was no use writing infrastructure-scripts in haste. It’s better to explore the technology you’re going to utilize, prepare the infrastructure, make a bunch of decisions on tools, discuss the process of work and so on.

Half a year passed like that.

I know what you were doing 2 years ago

In half a year some changes happened with the team, and a couple of problems arose:

The people working on infrastructure-scripts changed — Due to the low bus factor, we had to study build scripts from scratch again;
We hadn’t refactored bash scripts for a long time — Whereas we were expanding their functionality, they were becoming more and more sophisticated;
Logics duplication in scripts — Soon we noticed that the scripts were duplicating each other almost in every flow that we launched on CI. They differed in terms of little details: some required a special gradle flag, others needed tests. The changes and updates that we wanted to include in all these scripts had to be duplicated in almost all settings of different planes and flows.

Let me take a look

And now imagine that there are 10 tabs like that.

CI scripts’ changes affected every developer — the CI plans’ configuration was the same for all developers. That’s why we had to either create a middle plan for testing updates or create special tricks;
CI logs were hard to crack — Our CI-server usually writes information about steps execution into the log, when a particular step is launched inside our build. However, the stream of logs occupied so much space, that it was easy to miss those lines.
Sometimes the build machine configurations changed — Sometimes the infrastructure team changed something on the machines that were already set up, and we lost the utilities necessary for launches.

How did we live, what did we do?

We tried to solve the problem of sh-scripts’ update — For that we retrieved them from our CI as separate sh-files and added them into the repository. Then CI launched not just a fixed sh-script, but the one that was inside the repository. As a bonus, we got an opportunity to change any sh-scripts and test it inside a particular branch. Right, we still couldn’t add new steps on CI, because it still affected all scripts. However, bash script was capable of launching some other bash script, and that’s how we were able to check the new steps.

Were there many scripts?

A lot. And many of them used to be duplicated.

We simplified log reading — It was about adding a little utility to our scripts. It accepted the name of the script, which was to be launched, and its arguments. With the help of a special construction in bash we were aware of what script we wanted to launch. So we were writing a clear description, adding lots of special characters, so that they were easily distinguished in logs. Then we launched the necessary script.

What construction are we talking about?

If you ever wondered, what ‘switch-case’ looked like on bash, here it is:

Moved some tasks to Docker — In order to minimize the risks connected with infrustructure changes, we started to contemplate the idea of moving our tasks inside Docker-container. For example, APK build in particular. So far they were performed on build-machines.

It was the moment of enlightenment for us

We understood that:

It’s necessary to increase the bus-factor — Enhance the expertise of all your developers working on infrastructure-scripts. This way you won’t lose the valuable information you’ve been gathering if people decide to leave. Since then we’ve been conducting demos, writing articles on wiki, so that the knowledge that our developers possess is kept somewhere else;
Docker simplifies the process of infrastructure relocation — Now we can move our builds between different machines anytime, the only condition is for bash and Docker to be installed;
Don’t write sophisticated constructions on bash — Huge functions and various switch cases look heavy on bash and are barely supported.

That was the condition of our code base almost till the present moment. Now we will move a bit closer to May 2021.

The matter of the not-so-long-gone past

It’s been a year wince we restarted the development of our new employment application. It was frozen for a year before and we didn’t launch it on CI.

In addition, we continued to change the functionality of the existing infrastructure-scripts without global refactoring. Yes, we were aware of some problems. We knew the ways to solve them, but we didn’t have a viable reason for that. There’s a perfect Russian saying that describes what we were going through: hedgehogs poked, but kept eating the cactus. It means the same as Sisyphus task, but imagine if Sisyphus was carrying a gigantic cactus up the hill. Poor thing.

Our editor begged to put this image

Hedgehogs poked, but kept eating the cactus.

I couldn’t say no to him.

What were the problems that we faced?

Messy and complicated bash-scripts — Moving bash-scripts from our old Groovy-script without any alterations had serious repercussions for us. We didn’t study the logics of bash-scripts thoroughly when relocating them to Bamboo. We didn’t do that either when moving them back to the repository. A dirty trick it was, as at some point we realized that the scripts were messed up. One called the other, which in its turn calls the third. No logics, no consistency. Arguments are thrown between scripts and converted in some nonsense.
Every second script uses Bamboo variables — Why is it bad? Because we remained super attached to our CI server. If we had decided to move to any other CI server (e.g. Jenkins), we would have faced an unpleasant surprise and a lot of additional work;

CI variables?

The scripts are bound with the specific CI variables

Some scripts required the name of the current branch, the plan where the build is launched, other scripts needed the numbers of build, way to the artifacts, and so on.

Not all work is done inside Docker container — It’s 2021, and we still haven’t moved our infrastructure-scripts inside Docker container. A part of them was done on build-machines as usual, and it suffered from any relocations. The other part was performed inside Docker container;
A zoo of programming languages in infrastructure-scripts — After three years of the infrastructure-scripts’ existence, a whole zoo of programming languages emerged. There was Groovy, Kotlin, Bash, Python, even Go!
‘Scripts from aside’ — a part of infrastructure-scripts was in a separate repository, which wasn’t connected with the mobile developers. When those scripts crashed, we had no idea how to fix that problem.

Scripts’ versioning

Add to all the issues the fact that all our infrastructure-scripts were still configured through UI interface of our CI server — and get a complete ‘suffering-developer starter kit’.

Why is it bad? Because that configuration wasn’t versioned at all, and it was universal for all the plans. That provoked a new wave of problems.

New step addition became more complicated — We had to either be patient and suffer in silence from infrastructure-scripts’ crashes, or wait, when a script with fix would be merged to ‘develop’. There was a third option: checking if there was a particular file inside our repository. And if there was, launching it. It complicated the process of script updating;
Step duplication between Bamboo plans — In order to reuse the steps of the described flows, we had to duplicate the descriptions of these flows between the plan settings on Bamboo. One plan launched APK assemble, and the other did the same. We had to duplicate all these launches of various scripts between the plans.
Code-review absence for plan configuration — which means that the quality of the plans may suffer significantly;
Not every developer can fix the settings — because only the privileged have access to the plans’ configuration on CI and only they can change or correct something;
It’s forbidden to reuse common logics between platforms — and that kind of logics was there. For example, notifications on Slack, adding links on Jira, Github. As the scripts were configured on Bamboo, we couldn’t reuse it.

All this mess caused the HELL I was talking about. We had to get rid of it, but we lacked a trigger that would nudge us to start changing everything.

How we were going to solve the problems

While the trigger was far from our horizon, we were still planning the improvements.

Tool relocation to Docker — The matter of utility launch outside Docker could be resolved by moving all the tools to Docker container. We described the Docker file which contained our installments with everything we needed: Java, Android SDK, Marathon, Allure, Python, Gradle-Profiler and lots of other stuff.
Single access point to Docker — We wrote a special bash script which has now become a new single access point for our infrastructure-scripts. This bash script just runs Docker container and gives it control, so that the commands and utilities we need could be called inside the context of Docker container;

It looked pretty simple

Having written such a script, you can use it on CI this way:

sh ci/run_in_docker.sh "some_command_for_execution"

CI variables conversion to ENV — The problem of our infrastructure-scripts depending on CI server could be resolved by gathering the list of all CI variable in one place. We threw it inside Docker container as ENV variables, and now it’s only them that we use inside all scripts. If one day we decide to leave Bamboo, it will be enough to change just one file, which contains these Bamboo variables.

Moving CI variables

That’s how you can add envs into the Docker:

docker run \     
    --env-file <(echo "$envs" | grep -E '.+=.+') \     ...

It will throw only the ENVs where the value after the ‘=’ sign is not an empty line.

The rest of the problems, like the absence of script versioning, language zoo, difficulties with scripts’ updates and review absence could be solved with the approach called Infrastructure as Code.

To be honest, we wouldn’t have mastered a thorough Infrastructure as Code. However, converting infrastructure to code was quite affordable. The only thing left to do was choose the tool and the way of implementing those infrastructure details inside our code base.

Miles away a man was showing QA

Out of the blue they emerged. Our testers, our incredible QA department. Our tester Danya talked about the automation of our release flow in one of the episodes of ‘HHella cool stories’. Due to this, there appeared such tools as Ruby and fastlane in the code base of Android app.

Our testers had serious intentions. They didn’t want to rewrite the scripts which were already working on iOS, that’s why they cheered for using fastlane.

Despite the fact that Android developers are more used to Kotlin, Gradle scripts and Bash, we still decided to give a chance to fastlane with Ruby.

This way we’ll be able to say no to the language zoo — Because Ruby is capable of almost everything that Python and Groovy do;
The opportunity to use some code of iOS team — iOS developers already have a huge expertise in Ruby and Fastlane. That’s why we were able to use their experience for review and automate the process with the help of their drafts;
Extensive Ruby and fastlane ecosystem — Ruby and fastlane have a broad ecosystem which is used in tasks connected with continuous integration and continuous delivery. We carried out a small research and realized that all the tasks that we had could be completed via Ruby and fastlane, so we could stop using the old scripts.

In general, while making revolutionary changes, like moving to fastlane and Ruby, we could fix the mentioned problems at once. We thought: ‘Why not?’

Yes, we were aware of the risks. Android developers are absolutely not used to working with Ruby. And putting all scripts on the new rails, the new track — is a long and complicated process. However, the number of problems that had been piling for a while just overweighed the risks, and we started to prepare for the changes.

Major changes in scripts

This time we’ve decided to approach the change of infrastructure-scripts systematically, that’s why we have created a Miro board, where we put all the info on our CI flow: what steps to use, which are reusable, what arguments are suitable for access, what has to be launched simultaneously and what can be run consequently, etc.

Miro pictures

We’ve written out the elements of the plan and what’s happening in them.

An example of a detailed job plan step by step

We started modifying our scripts starting with the plan connected with ‘pull request’. As we decided to launch all these actions on CI inside Docker container, it was where we added the tools helping to launch Ruby and fastlane (rvm / bundler).

After that we started translating the existing infrastructure-scripts to the Ruby language. At the same time, we were trying to reuse the utilities that the iOS team already had.

We made a separate repository for that, which we designed as a fastlane plugin. Then we started to move the common code there. It was successful with many elements: work with Jira, utilities for git, common constants and models, plus a code for our release flow.

Common repository

The structure of the repository was created automatically by `fastlane new_plugin`.

A bit about pain and suffering

After all of this you would’ve definitely asked me: ‘Has it really gone so smooth with Ruby and fastlane?’ It seems we have to talk about the sad parts too. And to your satisfaction, there was a lot to cry about.

Android developers rarely work with Ruby — It’s hard to reprogramme your own Java-tuned brained, and it’s even harder to reconstruct a Kotlin-focused mind. The concepts you’ve been used to, disappear: there’s no concept of an abstract class in Ruby, no Java interface. On the opposite, there appear other language features, for example duck typing.
Dynamic typing hurts — We had quite a painful example, which we found only after a month since code addition.

What was the pain about?

We had written a script that was supposed to check our test stands for relevance. One of his steps was a creation of a json model and sending it to the server. When creating the model, it was necessary to convert the date to a massive with numbers for our server.

FULL_DATE_TIME_FORMAT = '[%Y,%m,%d,%H,%M,%S,0]'  def self.current_time_for_fixtures(plus_days: nil)     
    time = Time.now     
    time -= (24 * 60 * 60 * plus_days) unless plus_days.nil?    
    time.strftime(FULL_DATE_TIME_FORMAT).split(',').map(&:to_i) 
end

Due to the incorrect format FULL_DATE_TIME_FORMAT, the conversion was broken, and the result of the function was an array with zeros. This broke a bunch of UI tests for us, there were some wild flaking. In a good way, since the function could not convert something, it should have thrown some kind of exception, but no, Ruby interpreted the error as 0.

This is how it works correctly:

FULL_DATE_TIME_FORMAT = '%Y,%m,%d,%H,%M,%S,0'  def self.current_time_for_fixtures(plus_days: nil)     
    time = Time.now     
    time -= (24 * 60 * 60 * plus_days) unless plus_days.nil?
    time.strftime(date_format).split(',').map do |arg|         
        Integer(arg, 10)     
    end 
end

In short words — pain.

Android Studio doesn’t support Ruby — VS Code is available for working with Ruby (with the following plugins: VSCode Ruby, Ruby, ruby-rubocop, Ruby Solargraph) or IntelliJ IDEA Ultimate (with Ruby plugin), or RubyMine;
Insufficient fastlane support in IDE — the specifics of fastlane is not supported, which causes a lot of problems. Once I couldn’t call fastlane Action and I was looking for a mistake for half an hour only to find out that Action class didn’t have suffix Action. IDEA inspections could have come handy there, but they didn’t, because they were not there.
Problems with the private common repository — We wasted a lot of time until we learnt to use the written fastlane plugin on CI. There were some problems with special ENV-variable ‘access token’ for Github-repository access;
Absence of common rules for writing Ruby-code — We shared code-style easily with the help of Rubocop, but we didn’t have any common rules for naming the functions and generalizing the necessary code. So we wrote codes separately from our iOS colleagues, which caused misunderstandings during reviews.
Problems with launching bash-commands from fastlane — For some reason the command ‘grep’ was constantly failing. Eventually we moved it to Ruby, and everything started working.

The advantages of the current approach

Despite all the difficulties, switching to fastlane has significantly improved our infrastructure-scripts.

We managed to describe the flow of our CI plans as high level functions — Such functions are easy to read and support;

Example

desc 'Run checks for PR: static analysis + unit tests + app build'
lane :checks_for_pr do   
    add_jira_link_to_pr(     
        pull_path: HeadHunter::GitHub.pulls_path("android")   
    )   
    static_analysis   
    build_and_run_unit_tests   
    approve_if_possible 
end  # And then - separate functions `add_jira_link_to_pr` / `static_analysis` / etc

We switched almost all CI plans to fastlane — And we’re going to do it with the rest of the plans, as we really enjoyed the process.
We dismissed the language zoo — Now we have only bash as a single script + Ruby;
CI-plan configuration now lies in the initial repository code — Now these scripts are going through code review. Moreover, now they can be reviewed both by Android-developers, and iOS-team.
The attempt to reuse common logics with iOS-team proved to be successful — We took common scripts to a separate repository, now both teams can use it.

The conclusion of the story

What decisions do we consider the most pathetic throughout the whole history of our evolution?

Not starting to work with CI sooner — This is the biggest issue we had at the beginning. In other words: the earlier you launch CI, which makes your app build, tests, and static code analysis a regular process, the better. Don’t put it off, because it’s a direct path to a significant improvement of your app and code base quality.
Insignificant bus-factor in terms of infrastructure — If you lose the sacred knowledge regarding the different parts of the app, you will need a lot of time to restore it. Share the information, perform demos, write on wiki;
Language zoo in infrastructure-scripts is a true evil — It’s recommended to use as fewer tools as possible. This way it will be easier to figure the process out.
Describing CI tasks only in UI interface of CI server — It causes a lot of issues that I have described above.

What decisions do we consider successful?

Infrastructure as Code approach — This approach enables the versioning of your infrastructure-scripts, it allows to review them, change easily, update, create new steps;
Scripts reuse between platforms — It’s not suitable for every team, but it’s great when you manage to reuse the common infrastructure logics;
Bus-factor increase — The more people learn about your infrastructure and its details, the better.

It’s all I wanted to say. If you have any feedback, leave it in the comments below.