Modularising the Badoo iOS app: dealing with knock-on effects
In my previous article, I explained how we singled out the chat functionality in our app for modularisation. Everything had gone well and we were preparing to roll out this experience to start work on modularising all of Badoo iOS app development. We even did a presentation on the approach for the product, testing and continuous integration teams, before gradually starting to introduce modularisation into our processes.
We immediately realised there would be problems so we didn’t rush things. We rolled out the solution in stages and this helped us identify the problems we explore below.
In this article I explain:
- How we got lost in a complicated dependency graph
- How we rescued the CI from overload
- What to do if the app runs slower and slower with the adding of each successive new module
- Which indicators you need to have monitoring for, and why this is necessary.
Complex dependency graph
Only once the number of modules really started to grow a lot, did we realise that linking and support for an extensive dependency graph is pretty complicated. What we had imagined 50 modules might be like, displaying an attractive hierarchy, was quickly dispelled by the harsh reality.
This is now the Badoo app dependency graph looked when we had about 50 modules:
No effort was spared on visualising this in a convenient and readable manner, but it came to nothing. The complicated graph helped us to realise several things:
- Visualising all dependencies is both a complicated task and one that makes little overall sense. When working with a graph you need to know both the problem being looked for or addressed and which modules are affected. This tool can only be used effectively in connection with a graph built around one or, at most, just a few modules.
- Visualisation is complicated and this, in turn, makes debugging complicated. It is no easy matter to identify fundamental problems in a complex dependency graph.
- Adding dependencies, especially intermediate ones (we spoke about these in part one), is a highly resource-intensive task that not only creates an additional workload for developers but also distracts them from their product-related tasks.
But what we do know is that simple visualisation is really necessary, but that work with the dependency graph needs to be automated. This is what led us to create our internal utility, deps. Its purpose was to solve our new tasks: searching for problems in the dependency graph, directing developers to them and correcting typical linking errors.
The utility’s main characteristics are:
- It is a console Swift application
- It works with xcodeproj files using the XcodeProj framework
- It understands third-party dependencies (we haven’t been active or keen on accepting them into the project, but we do use some of them: they are loaded and built through Carthage)
- It is included in the continuous integration processes
- It knows the demands our dependency graph is subject to and works in accordance with these.
The final point is that it is not worth hoping for similar open-source utilities to appear. Today there is no universal, well-described one-size-fits-all structure for projects and their settings. Various companies may have a number of different graph parameters:
- Static or dynamic linking
- Tools for supporting third-party dependencies (Carthage, CocoaPods, Swift Package Manager)
- Resources stored on separate frameworks or at app-level
- And others.
For this reason, if you are looking at having upwards of 100+ modules, at some stage you will most likely have to think about writing a similar utility.
So, in the interests of automating the work with the dependency graph, we have developed several commands:
- Doctor. This command verifies whether all dependencies are correctly linked and embedded in the app. After running this, we either obtain a list of errors in the graph (for example, something missing at the “Link with binaries” or “Embedded frameworks” phase), or the script says that everything is fine, and you can proceed.
2. Fix. This is an evolution of the “doctor” command. In automatic mode this command corrects problems found by the “doctor” command.
3. Add. This adds module-to-module dependency. When the app in question is simple and small, adding dependency between two frameworks appears to be a simple task. But when the graph is complex and multi-level, and you are working with explicit dependencies “on”, adding the necessary dependencies is something you don’t want to be performing manually each time. Thanks to the “add” command, developers can simply specify two framework names (the dependent one and the “dependee”) — and all the build phases will have the necessary dependencies, as the graph illustrates.
As a result, the script for creating a new module, as per the template, has also become part of the “deps” utility. What does this give us in the end?
- Automated support for the graph. We find errors right in the pre-commit hook, keeping the graph stable and correct, and giving the developer a chance to correct these errors in automatic mode.
- Simplified editing. You can add a new module or connections between modules with a single command, which very much simplifying their development.
Continuous integration (CI) unable to cope
Before we moved to modularisation we had several applications, and our strategy was simply to “check everything”. Irrespective of what has actually changed in our monorepository, we simply rebuilt all the applications and ran all the tests. If everything was fine, we allowed changes to make it into the main development branch.
Modularisation quickly made it clear that this approach is poor in terms of scalability. This is because there is a linear correlation between the number of modules and the number of CI-agents necessary for parallelising error checking: when the number of modules increases, the queues for CI also increase. Of course, up until a certain point in time, you can simply purchase new build agents, but we opt for a more rational path.
Incidentally, this wasn’t the only problem with the CI. We found that the infrastructure was also subject to problems: the simulator might not run, the memory might prove insufficient, the disk might get damaged etc. While these problems were proportionally small in the scale of things, as the number of modules to be tested (jobs run on agents overall) increased, the absolute number of incidents grew, and the CI team were no longer able to handle incoming requests from developers promptly.
We tried several approaches, all unsuccessful. Here’s more detail so that you can avoid making the same mistakes.
The obvious solution was to stop building and testing everything all the time. The CI needs to test what needs to be tested. Here is what failed to work:
- Calculation of changes based on the structure of directories. We tried to place a file with directory ↔ module mapping in the repository, in order to see which modules need to be tested. However, literally within a week, due to the increased number of times the application crashed at production, we discovered that the file had been moved to a new module and changed at the same time, while the mapping file had not been updated. It seems impossible to automatically update the file with mapping — and so we moved on in search of other solutions.
- Running integration tests on the CI at night. Overall, this is not a bad idea, but the developers did not thank us for this. It became a regular occurrence that you would go home trusting that everything would be fine, and in the morning on the corporate messenger, you would get a message from the CI, that 25 tests had been unsuccessful. That meant the first thing you had to do was to get to the bottom of the previous night’s problems, which were potentially still blocking someone’s work. Basically, not wanting to spoil people’s breakfast we continued our search for the optimal solution.
- The developer shows you what to test. This experiment ended more quickly than all the others — literally within a couple of days and it was through this experiment that we concluded there are two types of developers:
- Those who test everything, just in case
- The over-confident who believe themselves incapable of breaking anything.
As you will have understood, the compulsive testers kept the queues for the CI much the same, while the over-confident developers caused our main development branch to break.
In the end, we returned the idea of automated calculation of changes, but from a slightly different angle. We had the deps utility, which knew about the dependency graph and project files. And through Git we obtained a list of changed files. We extended deps using the “affected” command, which allowed us to obtain a list of affected modules based on changes shown by the version control system. Even more important was the fact that it took account of dependency between modules (if some modules depend on an affected module, then they have to be tested as well, in order that, for example, if the interface of a lower module changes, the upper one continues to build).
Example: changes to the “Registration” and “Analytics” blocks on our diagram point to the need to test the “Chat”, “Sign In with Apple” and “Video streaming” modules, as well as the app itself.
This was a tool for CI. But developers also had the option of viewing locally what could potentially be affected by their changes, in order to manually test the operating capability of the “dependee” modules.
This delivers a range of benefits:
- In the CI we only test what has actually been affected directly or indirectly.
- There was no longer a linear correlation between the duration of CI tests and the number of modules.
- The developer understands what may be affected by their changes, and so where they need to be careful.
- The mechanism was subsequently adapted not only for testing error-free module compilation but also for running unit tests on affected modules. This in turn substantially reduced the load on the CI and changes reached the master more quickly.
Waiting for Е2Е (End-to-End) tests to complete
For the Badoo application, we have over 2000 end-to-end tests which launch it and run, based on usage scenarios for the expected testing results. If you run all these tests on one machine, running all the scenarios takes about 60 hours. For this reason, on the CI all the tests are launched in parallel — insofar as the number of free agents permits.
After successful implementation of filtering based on changes for unit tests, we wanted to implement a similar mechanism for end-to-end tests. And this is where we encountered a clear problem: there is no direct correlation between end-to-end tests and modules. For example, a scenario for sending messages in the chat also tests the chat module, the module for loading images and the module which is the entry point for the chat. In actual fact, one scenario may indirectly test up to seven modules.
To resolve this problem, we have created a semi-automated mechanism underlying which is mapping between modules and sets of functional tests that we have.
Each new module has its own separate script in the CI which tests whether there is a module in this mapping. So that developers remember to synchronise with the testing team, tagging the module with the necessary groups of tests.
This sort of solution can hardly be described as optimal, but implementing it still brought tangible advantages:
- The load on the CI was substantially reduced. Here is a graph to prove it. It records the time spent in the queue by the task for running end-to-end tests (the red line shows where tests got smart):
2. The noise from infrastructure problems was reduced (the fewer tests run, the fewer crashes due to frozen agents, broken simulators, lack of space etc.).
3. The mapping of modules and their tests became a place where development and testing departments were able to synchronise. In the context of development, the programmer and the tester now discuss, for example, which of the groups of tests available may also be suitable for testing a new module.
App launching too slowly
Apple openly states that dynamic linking slows down the running of an application. However, it is precisely this which is the default option for any new project created in Xcode. This is what the real graph, showing the running time for our projects, looks like:
In the centre of the graph, you can see a sharp reduction in time taken. This is due to the transition to static linking. What is this connected with? The dyld tool for dynamically loading modules from Apple performs labour-intensive tasks but not in an entirely optimal manner. There is a linear correlation between the runtime for these tasks and the number of modules. This was the main cause for our application launching more slowly: the more new modules we added, the slower the dyld became (the blue plots shows the number of modules added).
After transitioning to static linking, the number of modules stopped slowing the application’s run speed and when iOS 13 Apple came out, it transitioned to using dyld3 to run applications which also helped speed up this process.
It should be noted though that static linking also carries with it a number of limitations:
- One of the most inconvenient limitations is that resources can not be stored in static modules. This problem has been partially resolved in the relatively new XCFrameworks, but this technology cannot yet be considered to have been tested over time, or be fully ready for business use. We resolved this problem by creating, alongside the module, separate bundles, which already pack into a ready-to-use application or test bundle for running tests. The deps utility also ensures the integrity of the graph when working with resources — this has added several new rules. Right now, we are on the way to migrating our modules to XCFrameworks but it is proving to be quite time-consuming.
- Having transitioned to static linking, you need to test the application carefully for runtime crashes. Many of them can be corrected simply by using less-than-optimal optimisation parameters. For example, in the case of almost all Objective-C modules, you need to switch on the -all-load flag. Once again, I will say that resolving all these problems by moving out xcconfigs (about xcconfig, see part one) was not as tortuous as it might have been.
And so, at this point we have dealt with the two main problems: we have moved resources out to separate bundles and we have corrected the build configuration. As a result:
- The number of modules in the application has ceased to slow its running speed
- The size of the application has been reduced by about 30%
- The number of crashes is down three-fold due to faster launching. iOS “kills” applications if they take a long time to launch, and on devices that are more than three years old, with weak batteries, and where the processor hasn’t been operating at maximum capacity for some time — these are real cases that have become far less common.
The figures show the direction to go
We have considered global changes which we had to make for a “quiet life” when it comes to the modularisation process:
- Automation of work with the dependency graph: keep an eye on the graph, do things in a way that makes it easy to see what depends on what, and where the bottlenecks are occurring
- Reduce the load on the CI by filtering modules to be tested and using smart tests: don’t fall into the trap of making the duration of CI tests directly dependent on the number of modules
- Static linking: most likely, you will have to move to static linking since, by the time you have 50–60 modules the drop in launch speed will start to become noticeable To you and probably to your managers as well, as it did for us.
I have not said this directly but, as you may have noticed, we have graphs showing the time that tasks spend in the CI queue, the launch speed of the application etc. All the changes described above were necessary due to the values of metrics which are important when developing an application. However, we might consider transforming processes as well. It is essential you have an option of measuring their results. Without such an option, changes probably won’t make much sense.
In our case, we realised that changes would have a great impact on developers, so the main metric we kept our eye on, and continue to do so, is the build time for the project on the developer’s hardware. Yes, there are indicators for the full build time with the CI, but developers rarely perform full build; they use other equipment etc. This is to say that, both the build configuration and their environment differ.
For this reason, we have generated a graph showing the average build time for our applications on developers’ computers:
If the graph displays variations (a sharp decrease in the build speed for all applications), this most likely means that something has changed in the build configurations — and that won’t be good for developers. We have looked into these problems and tried to resolve them.
Another interesting conclusion we came to is that, having obtained similar analytics, modules building slowly is not always a reason to go for optimisation.
Take a look at this graph. The axes show average build duration and the number of builds per day, respectively. It is clear that, of the modules which take the longest to build, there are those which build just twice a day; overall, their impact on the general experience of working on the project is extremely small. On the other hand, there are modules which only take 10 seconds to build, but they are built more than 1500 times a day. We need to keep a careful eye on these. Basically, try not to limit monitoring to just one module, but see the bigger picture.
Moreover, it soon became clear which equipment was still suited for working on our project, and which was becoming obsolete.
For example, it is clear now that the 2017 iMac Pro 5K is no longer the best hardware for building Badoo, while the 2018 MacBook Pro 15 is still not bad at all.
But the main conclusion we drew was that it is essential to improve developers’ quality of life. In the process of modularisation you are likely to get so immersed in the technology that you forget the reason you are doing it for. So, it is important to keep in mind the basic motivations and to acknowledge that the repercussions will affect you.
Measuring build time
In order to obtain data on the build duration on developers’ computers, we created a special macOS application called Zuck. It sits on the status bar and monitors all xcactivitylog files in DerivedData. xcactivitylogs are files containing the same information which we see in the Xcode build logs in Apple’s difficult-to-parse format. Based on these you can see when the build of an individual model began and finished, and the order they were built in.
The utility has white and blacklists so we only monitor working projects. If a developer has downloaded a demo project of a library from GitHub, we are not going to send data about its build anywhere.
We send information on building our projects to the internal analytics system, where there is a wide range of tools available for building graphs and analysing data. For example, we have the Anomaly Detection tool which predicts anomalies in the form of values which vary too much from the predicted values. If the build time has changed significantly compared with the previous day, the Core command receives notification and starts to investigate what has gone wrong and where.
Overall, measuring the local build time yields the following important benefits:
- We measure the impact of changes on developers
- We have the option of comparing full and incremental builds
- We know what needs to be improved
- We obtain the results of experiments quickly. For example, change the settings for optimisations — and already by the next day, we can see on the graph how this has been reflected in the life of developers. If the experiment has not been unsuccessful, we quickly roll back the changes back to a restore point prior to the changes.
I will now give you an indication of the metrics which will need monitoring if you have started to move in the direction of modularisation:
- The launch time for the application. Xcode provides this information in the Organizer section. The metric quickly indicates any problems which have arisen.
2. Artifacts size. This metric helps us to identify linking problems quickly. There may be cases where the linker does not notify us, for example, when a given module is duplicated. However, this will be shown by the increase in size of the compiled application. It is worth paying close attention to this particular graph. The simplest way of obtaining information of this kind is from the CI.
3. Infrastructural indicators (build time, time tasks spent in the CI queue etc.). Modularisation will have an impact on many of the company’s processes but particularly on infrastructure is particularly important, because it has to be scaled. Do not fall into the same trap we did when, before merging to the master, changes were having to queue 5–6 hours.
Naturally, there will always be room for improvement but these are the main indicators which allow you to track the most critical problems and errors. Some metrics can also be collected locally if you are not able to invest in complex monitoring. Get local scripts and at least once a week look at the main indicators. This will help you to understand whether, overall, you are moving in the right direction.
If your impression is, “This is what I am doing now, and things will get better whatever happens,” but you are unable to measure what “better” means, it is best to wait before conducting an experiment along these lines. After all, you probably have a manager to whom you will have to report the results of your work.
After what I have told you, maybe your impression is that building a modularisation process at your company will require lots of energy and resources. This is true in part but let me give you some statistics. Actually, we aren’t that big.
- In total, we currently have 43 iOS developers working for us.
- Four of them are on the Core team.
- We now have two main applications and N experimental ones.
- About 2 million strings of code.
- About 78% of these are in modules.
The final figure is gradually increasing: old features and rewritten legacy code are slowly moving from the main target applications to modules.
In these two articles I have been “selling” modularisation to you but of course, it does have its downsides as well:
- Processes become more complicated: you have to solve a range of issues both relating to your department and rank-and-file iOS developers, and other departments you interact with: QA, CI, product managers etc
- During the process of moving to modularisation, problems will arise that you did not expect. You cannot anticipate everything; you are going to need additional resources and the input of other teams
- Everything you build is going to need additional support: processes, new internal tools — and someone is going to have to be responsible for all this
- Applications based on modules are more dependent on one another. For example, when you change something in the module for working with the database for the Badoo app, within a few minutes you may receive a message from the CI to say that the tests in the Bumble application have not been successful. Such instances will need solutions.
But the good news is that all these downsides are manageable. They are nothing to be afraid of since these are all things that are resolvable. You just need to be ready for them.
We have already looked at the considerable advantages of modularisation but allow me to reiterate them briefly here:
- Flexible scaling of iOS development: new teams start working in their own modules, with a harmonised approach to their creation and support
- Clear areas of responsibility: on the one hand, there are specific people who are responsible for specific sections of code; on the other hand, in terms of teams, rotation is possible because since the approaches are shared — the only difference is what’s inside the modules
- Further development of infrastructure and monitoring: these are indispensable. Without them you will find it very difficult to create and support the modularisation process.
My parting words to you are that when you start implementing modularisation you don’t have to take on a large part of the application as your first module. when you start implementing modularisation. Take on smaller, simpler chunks and then experiment. It is not a good idea either to blindly follow any of the examples from other companies; everyone has their own approach to development and the chances of someone else’s experience suiting you perfectly are slim. Make sure you develop infrastructure and monitor your results.
I wish you every success and if you have any questions do not hesitate to add them to the comment section!