Plugin-Based Architecture and Scaling iOS Development at Capital One

Korhan Bircan
Capital One Tech
Published in
12 min readApr 10, 2018

--

How does one of the biggest mobile organizations in the enterprise domain manage to ship a new version of their flagship app every other week without compromising on quality? With an active user base of tens of millions of customers and around 100 software and QA engineers working on the same iOS codebase; shipping more features every year, and doing it more frequently, was certainly no small achievement.

The history of our current iOS app begins in 2014, when we decided to build a fully native servicing app in-house. We retooled our development infrastructure, brought world class talent on board, and focused on building a fantastic mobile experience for our users using a modern API infrastructure on the backend. Our efforts came to fruition and in 2015 we released the flagship iPhone and iPad apps along with an Apple Watch extension.

By 2016, both apps had been welcomed enthusiastically by our users and we started getting industry-wide accolades for our work. A lot had also changed in terms of the technology, team makeup, and scale of the app. We converted our iPhone app to be universal and sunsetted the standalone iPad app. Our organization doubled in size each year and we faced a lot of the same challenges that many other growing iOS organizations face: build times, merge conflicts, and engineers bumping into each other. With a codebase of over 350,000 lines, we started hitting a wall.

Today, I want to talk about an architecture that helped us overcome these issues by enabling us to build and test features in parallel with minimal risk of regressions. We have been successfully building and rapidly iterating certain features on this architecture for over a year now, while dramatically expanding the number of contributors to the platform.

The Code Grows and Grows

Looking at the number of code commits vs. project timeline below, you can see that we experienced a ramp going into mid-2015 while marching to production for the first version of the app. Another ramp follows later in the year, right before the iPad app launch. We then see a dip around Christmas and New Year’s Eve where we historically don’t have production releases. Going into 2016, there’s a greater stream of commits since we had both more people contributing to the codebase and because we were merging the iPhone and iPad codebases. Then we see somewhat of a plateau afterwards and that’s for a couple of reasons.

Contributions to master branch, excluding merge commits, since the inception of the flagship app.

First, the bet we took with writing the app in Swift paid off. Not only did it help us attract and hire the best talent, but thanks to its type safety and compile time checks, the app was more stable than anything we’d built previously. However, at the same time, the build times were taking longer than expected. Some developers were reporting 30-minute wait times on clean builds, and arbitrary Xcode crashes were also a problem. Swift wasn’t backwards compatible or ABI compliant so every time we had to migrate to a new version, it took a non-trivial amount of effort. If you look closely at the graph, you won’t see a dip in the 2016 Christmas break because an ambitious group of engineers were working diligently to port our codebase from Swift 2.3 to Swift 3.0!

That’s not to say Swift was the only factor at play. Having close to 100 people work on the same codebase meant that when something was broken, it was broken for everybody. There were more merge conflicts that needed resolution. More features meant that there were now more regression tests to run which took more time. And unfortunately, there were many more regressions added with each new release.

In the face of these challenges, it would’ve been easy to throw our hands in the air and give in. However, we were determined to maintain our velocity and increase code quality as the engineering team expanded. We started experimenting with a number of practices. We used all the monitoring tools available to us to figure out which features did not ship on time, how late they shipped, and what the reasons were when they were late. This way we were able to improve our estimates by identifying common pain points such as API availability and contract conformance, establishing standards, engaging with backend teams and getting their buy-in, and integrating health check mechanisms to our, and their, software delivery processes.

We also invested in our build infrastructure — increasing capacity, optimizing build scripts, and reducing the time between PR submissions and the completion of the required tests. We refocused our test automation efforts by developing and maintaining multiple sets of tests for different purposes in the CI/CD cycle. In this fashion, what we call “smoke tests” use mock API responses and are triggered by every PR submission. Critical tests run daily on the master branch and our mandate is to remediate the surfacing defects on the same day. Then, we have a more comprehensive set of tests that are run nightly on the master and individual team branches. We improved our unit test coverage by integrating it as a metric to every PR submission and even automated the creation of some tests such as the ones for accessibility. We even looked in depth into how we could improve Swift build times and came up with some interesting observations and workarounds.

Architecting Against Roadblocks

Even though our efforts made a remarkable impact, we started observing evidence that they alone would not suffice. If the trends continued, we might not be able to maintain our velocity and quality while keeping up with the demand for adding more features and engineers to the mobile organization. We started rethinking our approach to mobile development. We needed a new mechanism that would support the mobile organization’s growth for years to come, provide guardrails, and empower experimentation. We needed to think ahead and proactively create solutions to problems before they hit us.

We prototyped a number of ideas, tested them in practice, evaluated the pros and cons. After numerous iterations, we ended up developing a plugin-based architecture along with a partnership model that satisfied our goals. This architecture was scalable and stable and allowed us to create valuable content and integrate both internal and external development teams into the software development life cycle without compromising on velocity or quality.

Here is how it works at a high level.

Think of the flagship app as a container, and features as standalone plugins that are developed in isolation. There is a well-defined interface between the app and the plugins. By conforming to a number of protocols, the plugins communicate with the app without knowing anything about its implementation details. It’s encapsulation at the app scale.

The flagship app acts as a container and its services are made available through protocols.

Our platform architecture team provides the feature teams with a starter project that comes with all the necessary core networking, persistence, UI libraries, unit and UI tests, and necessary metadata. The starter project already conforms to the protocol and has stub implementations for its functions. This project compiles into a framework so when the feature is ready to be released, the owner team publishes a new version of their framework. Our team then pulls the new versions of all plugins once a week, performs automated testing, static analysis, security scans, and then integrates these plugins into the flagship app. All the steps in the process are automated extensively. The feature is tied to remote configuration and we have the ability to turn it on or off on the fly. Once the feature ships, we monitor metrics and collect feedback. Then the feature team iterates on the feature and we gradually release it to a larger user base in stages.

Over the last eighteen months, we have successfully developed and experimented with many features on this architecture. Because we were able to quickly release features in a throttled fashion and had analytics streaming baked into our pipeline, we immediately started analyzing user engagement patterns. This helped us iterate on our ideas and we have released a number of tools that help our users take more control of their financial lives.

Plugin Interactions

Let’s look at how these plugins interact with the flagship app during its lifecycle. The main screen of the app has a collection view that presents a customer’s accounts, various offers. or things like messaging. Plugins are exposed in the same collection view through data source and delegate patterns. During the setup phase, all the frameworks are analyzed to determine which of those are plugins. Those plugins are then initialized and their permissions, authenticity, and integrity are checked. A rules engine determines the display order of the components, and a plugin data source then aggregates both the plugin and account information and feeds it into the collection view.

Components participating in the view controller lifecycle.

Now that the collection view knows how many items to display and that our plugins have been identified and initialized, we can look at how each cell gets rendered.

A plugin is built as a standalone component that does not know the implementation details of its container app.

The plugin data source asks each plugin to provide a view as their entry point. For this purpose, we have a framework that provides templated tiles which can be customized as desired. These tiles are then registered with the collection view. When the collection view requests a cell for a specific plugin, the plugin data source notifies the plugin so that it can configure its tile and start prefetching data if necessary. Some plugins dynamically display data on their tiles and they can request the collection view to update their UI through the plugin data source.

Plugins configure and update their collection view cells.

Now that the plugins are visible, they can start receiving user events. They can choose to handle UI gestures to provide custom behaviors such as horizontal scrolling or taps. By default, every user event is logged as an event and the component is notified of a tap gesture. The component can respond to a tap by requesting a custom view controller to be presented, after which the component owns the user experience. We provide standard controls so that at any point in the interaction, the user can navigate back to the main screen of the app.

Architected for Growth

So, how does this architecture support the growth of our organization?

Normally, teams move in two-week sprints with all features going into a single repo and each team owning a fork of this repo. During the sprint, each team develops in their team forks, performing reviews and testing. Towards the end of the sprint, validated code in the team repos are merged into the single repo where we perform integration and regression tests. The drawback of this model is that adding more teams caused us to hit boundaries because every team is working on the same codebase.

The flagship teams move in two-week sprints working on the same codebase. Plugin teams move in one-week sprints working on isolated codebases.

The plugin architecture, on the other hand, allows teams to work on isolated codebases. The plugin teams build on a standalone harness that we provide them. In this harness, we expose common facilities such as core networking, data management, utility libraries, UI controls, and design components. The plugin teams move in one-week sprints and at the end of every week, they publish a semantic version of their feature as a framework. The core team performs automated analysis, security scans, testing, and review of the code. At the end of the following week, the feature is incorporated into the flagship app with our dependency manager.

This allows development and testing of features to occur in parallel in a distributed fashion and therefore helps scale the organization while accelerating delivery speed. As an added benefit, this architecture prevents massive controllers and layering violations because all the features are sandboxed and have to interact with each other and the app through well-defined interfaces.

Guardrails

Let’s take a look at how we reduce risk and take care of plumbing for the feature teams so they can focus their time and energy on creating value for our users.

Design Side

On the design side, we have a number of collaboration sessions during which our designers provide an initial design consultation. These conversations underline the design guidelines and advocate the reuse of both the design component frameworks and the UI design resources. Some of the features require unique design patterns and sometimes this results in the feature teams contributing to new design standards that are then reused by other feature teams. Once the designs have matured and are signed off, developers can start implementing the features and the user experience flows.

Engineering Side

On the engineering side, this entails three stages.

  • First, we onboard a feature team and get their project off the ground. Internally, we practice the open source model and have built a number of core frameworks ranging from networking, data persistence, UI controls, design components, analytics, feature flagging, and helper functionality. We expose this functionality out-of-the-box and provide a template project that is already integrated with our Mobile Orchestration layer.
  • Second, during the development phase we provide support by helping identify complexities around dependencies, backend integration, and encouraging software best practices such as clean architecture, defensive coding principles, scalability, and security. Our seasoned DevOps pipeline automates linting, static analysis, performance testing, security scanning, and code coverage metrics into the code review process, reducing the risk of human error.
  • In the last phase, when the feature is in production we closely instrument and monitor the feature’s performance and health metrics. We also have systems in place to make sure the feature is powered by APIs that are available, accessible, and conform to contracts. We do this both on the client and the backend side by using a number of in-house tools along with various software analytics services. We then analyze the trends and have automated systems alert stakeholders when necessary.

Empowered Experimentation

Each feature starts out as a hypothesis. We think our users have a problem and we have determined that this problem is worth solving. We know there are multiple ways to solve it but we don’t know which one is the best. However, we have performed in-house and external user testing so we have some good ideas. We also have defined what success could look like. We then implement one of these solutions, present it to our users, collect data, analyze it, make a data-driven decision, and iterate on our solution until we’re convinced that we’ve created an experience our users both need and want.

One of the central reasons this model has been so successful is our commitment to user-centric decision making. Out-of-the-box tooling includes a feature switch SDK backed by remote configuration APIs on the backend. Through a staged rollout of these features, we’re able to release the features to a subset of our user base and observe the engagement patterns. We have baked-in analytics SDKs that securely record and report user events near-real time so we can analyze further metrics and fine tune our flows. We are therefore able to create an active feedback loop between the engineers, designers, and users — leading to faster iterations to create better user experiences.

Conclusion

So far, we have been successful in using our guiding principles to build a world class engineering organization through hiring and retaining the best talent; learning about, adopting, and adding to industry best practices; and staying on the leading edge of tools and technologies. When our codebase became so large that we started observing deceleration in our ability to more rapidly iterate on features, we built an architecture that encapsulated the complexity of our codebase and defined new processes that helped integrate new developers and partners into the software delivery lifecycle. We’ve helped scale the growth of our organization for years to come through parallel development and testing of new features. We also improved upon our seasoned tooling and automation pipeline to commoditize feature testing and the collection of user engagement metrics. Through empowered experimentation, testing out ideas before they become a full-fledged feature is now much faster and far more cost effective.

This is the technology, architecture, and team with the most potential I have ever had the pleasure of working with. I’m very excited to see what we will achieve next!

Related

DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2018 Capital One.

--

--