From Monolith to Microservices
This blog details how Intuit transitioned from being a financial data aggregation application into becoming core technology platform that powers all product applications at Intuit. It includes the case for change, the target state, benefits, challenges, journey line, next steps, and takeaways.
Signs of a Monolith and the Case for Change
With any large enterprise, there comes a point of realization that the core application, designed originally for one purpose in mind, had morphed over the years…into a group of services, each designed to solve for a different need. As a result, the core application was no longer efficiently supporting the business needs. This was true for Intuit and it became vital for us to figure out what ailed this monolith and design a long term solution.
How the Codebase Was Built
To tackle this monolith, we initially began exploring how the codebase was built. It had a high level of complexity with too many features baked into the all-in-one code, as well as thousands of unit tests. Without consistent APIs, many nonstandard integrations, or one-offs, had been deployed. Tight coupling of integrations existed at every level, including on modules and datastores, without boundaries. For functional test cases, quality took a big hit. Lot of gaps were present and band-aid fixes often led to further quality issues.
How the Application Operated
For the second step, we examined how the codebase was operating. A single runtime of this application revealed problems around scalability, availability, performance, and resiliency. Vertical and horizontal scaling limits resulted because it was an all-or-nothing scenario. The availability of modules was impacted since everything was run as a single deployment. Resiliency was also impacted because of the inability to isolate an issue if something went wrong, bringing down the whole system. In addition, lengthy release cycles negatively impacted both team productivity and organization scale. It was a herculean effort to manage releases, with lots of engineers committed to building, testing, and deploying. From the security perspective, tightly coupled integration, along with the age of the code and the health of the code base, was making it more difficult to quickly adapt to new security standards and models.
Once we concluded this discovery process, it became clear that a migration to run micro services on a platform would be a likely solution. The next step involved figuring out how to get there.
Breaking Up the Monolith
Before going into domain decomposition, it was also important for the team to consider where Financial Data Platform would fit in the map of capabilities required to run Intuit products including: TurboTax, QuickBooks, ProConnect and Mint. All of these product families are built on top of Ecosystem Capabilities such as accounting, payments, payroll, capital, and personal finance management (see boxes under “Ecosystem Capabilities” in the diagram). These capabilities depend on Foundational Capabilities — Identity, Customer Success, Money Movement, Financial Data Platform (FDP) to name a few. FDP enables end customers to bring together their financial data banking transactions and tax forms, documents, other information required to manage either their personal finances or the accounting part of their business. Further breakdown of FDP showcases the beginning of the decomposition step (see boxes outlined in green in the diagram).
Decomposition Opportunities & Considerations with Abstractions, Cardinality, Security, and Velocity of Change
Across the stack, the team examined data, application, protocol, and communication, and it soon became clear that the current model no longer adequately supported the business needs. For example, account types were recognized as separate entities in the system. As more and more banks offered customers different kinds of accounts (such as personal loan, trading account, etc), it became more difficult for our internal model to keep up with these external changes. We took on the opportunity to improve upon the existing model and introduce new abstractions such as Account as an entity that represents different account types (see the diagram below).
We also knew that banks’ integrations were changing, moving away from web integration into API-based integration. However, during such migrations, parallel channels would need to co-exist. There was a need to manage multiple credentials and multiple ways to access the bank on behalf of the customer. This confirmed that the model’s cardinality had to evolve to make it more flexible.
With security, always top of mind, no compromise on compliance and security aspects was ever permitted. Not only did security require on-going enhancements, it had to be on par with the latest Intuit’s and industry security standards evolving in parallel. Banking credentials had to be protected end2end assuming zero trust operating environment. PCI compliance requirements also enforced decomposition decisions to isolate access to payment instrument data elements.
The number of providers (financial institutions or FIs) was growing by the thousands. There was the need for speed to bring on and integrate providers more quickly, while adhering to security and compliance standards. This included high variability of data sets, transactions, account details, and data quality among FIs. The team wanted to have a solution that can react to late discoveries of data issues, sometimes found in production only.
The Target State
To provide a uniform integration experience to the product developers as well as deliver a consistent experience to our customers, we had to solve for isolating product top tiers from external providers outages while eliminating variability from one provider to another. After many discussions, debates, and iterations, the team landed on a target state concept (see diagram). Besides recognizing functional services, this target state identified layers, serving as a pattern for defining future services as well.
The blue layer (business process service) would include services around business functionality and integration with TurboTax, QuickBooks, ProConnect, Mint. One realization was to pull business logic and part of a user experience that previously existed within the products themselves into the new platform. To name a few of such capabilities: onboarding new banking connections where a customer authenticates with the bank and authorizes Intuit to act on their behalf was built as a widget embedded into a product experience; continuous banking data acquisition and reconciliation with the internal references; categorizing banking transactions for tax, accounting, budgeting, and other purposes.
The orange layer (entity) would encapsulate data access. The entities defined in the model (as described in the previous section) would be manifested in separate services. For example, services would manage read and write about the providers’ metadata, profile information, credential sets, accounts, transactions. These services would own one entity’s data and the access. No other data level integration would exist outside of the orange later. There would also be no crossover between the services (account service should not directly access transaction entity persistence layer).
The green layer (connectivity) would encapsulate access to third parties (financial data providers). This layer would manage variability of 20,000+ providers, including multiple channels/protocols of integration (web, OFX, API, file-based) and differences in API definitions, data quality, and data sets. It also provided resiliency to the 3rd party outages, longer response times and other production issues.
Benefits of Three Layers
The recognition of three distinct layers helped the team divide the ownership and move independently. Clear ownership enabled the organization to scale to more teams.
Also with the teams’ independence and autonomy, they now had the freedom to choose a different technology and protocol best suited for a given task. For example, the rate limit service in the green layer used GRPC as a transfer protocol compared to other services that used REST. Since data access was isolated behind the services, teams could experiment with and evaluate different data persistent technologies without impact the application developers.
One team was in charge of all connectivity and optimizing connections with financial data providers. They had the focus to go deeper, better understand functionality, and then optimize access to different banks and different APIs. They built a set of tools enabling product managers and business owners to set up and try out new banking connections with less dependence on the engineering team.
Connectivity to third parties was isolated into a separate layer, preventing cascading effect, not only to the FDP, but also to product availability, benefitting internal product developers. Asynchronous invocation of this layer enabled more control over error handling and retries, and shifted user experience to notify a user when the task is done rather than having to wait and attempt again in case of a failure.
Operational overhead became an issue with more teams, more people, and more moving parts, and the teams had a doubt about granularity of services designed, especially when a service was projected for more functionality that had not materialized yet.
As FDP went through its transformation, changes occurred across the enterprise such as adhering to new standards and new security considerations. It was playing catch up with the rest of the company, including new capabilities provided by other foundational domains such as Identity and services authentication and authorization.
Sometimes the team had to make tough decisions or tradeoffs; we could not do everything and had to prioritize, sometimes taking on tech debt in the process. Despite the transformation, the team would need to acknowledge iterations, address tech debt, or plan to in the next set of iterations. Finally, managing flexibility and implementing significant regulatory change were additional challenges.
The Journey Line: Plan vs Reality
Whenever starting out to drive a transformational journey, there is a plan, with a defined starting place plodding along a straight line towards the finish line. But then there is the reality that hits during execution phase, with lots of zig zags, requiring iterations, an open mind, and possible temporary compromises that will occur during this journey.
To first get to a plan, we pulled together a case for change, acknowledging the monolith’s existence, along with what needed to change, how much needed to change, and ruled out that this was beyond a simple rewrite or fix.
To earn a long-term commitment from stakeholders and leaders, our plan would outline that it would take multiple quarters (3 year target) for this journey to be completed. We prioritized capabilities in the plan by balancing between a business value, technical feasibility, and the effort and autonomy it would take our teams to transition to the new implementation. Having a strong case and presenting the iterative phases was necessary to assure the commitment.
During the defining of microservices iterations, we also needed to consider what would occur once the monolith was broken. How would we isolate issues? Which processes would need to change? What kind of support would we get from other platform and domain teams? Creating a pattern of microservices layers (as described in the previous section) helped sequencing the design and implementation.
As we moved into the execution phase, we started developing and testing while using the legacy code as a point of reference for validation. The next step in each execution phase was to enable live traffic and transition the capability to the new deployment. If a step included data migration, it took extra time to prepare and validate migrated data while running traffic in parallel before fully switching over to the new system.
We realized that the scope would be modified and standards would always evolve, so we had to be open for changes including allowing for new features (such as new type of banking data requested) and new internal or industry standards being introduced. It was still important to remain focused on intermediate milestones (monthly, quarterly) even when we were aware of accumulating some of the technical debt during a phase development. We revised our plan to address for the work-arounds made along the way and to incorporate new enterprise level asks (e.g. changing the hosting model, migrating to AWS).
We also learned a lesson of that we were making too many changes in the same phase especially when data migration was in scope. So we revised the plan and added in extra validation both online and offline before completing the transition.
With important critical services, we took the approach that traffic would need to be slowly transitioned, not do an immediate cut over. This would include a percentage routing to see whether the parity remained, if things were working well, or if any major issues were impacting users. This slow transition and a controlled transition of traffic was key.
When implementing a service based on the existing code, different strategies needed to be considered. The options ranged from reusing the existing module with a new interface on top, applying the Strangler pattern with rewriting the application code (but keeping the data model), and finally doing a full rewrite including the data store. We revisited these options for each capability and set of services. It was not a one-size-fits-all type of a decision across the board. We also encapsulated some of the existing code into a library as long as it did not have any outbound dependencies.
There is a need to improve operational overhead to streamline deployment across scrum teams. We are working on containerization and leveraging container orchestration to get to “No Ops” nirvana.
In order to manage resiliency better in the space of many services now, we are looking to further utilize a streaming architecture and replace a few point-to-point integrations. Our ability to better handle error conditions across services call chains and better understand the cause vs. effect is taking us to to define services behavior per layer and standardize the contract definitions.
Decentralizing API routing and services authN and authZ with Service Mesh will likely help us to reduce the number of hops and overall latency.
Keeping the cloud hosting cost in check is becoming one of the top concerns. We continue to look at the current decomposition and opportunities to either apply a serverless option, utilize a sidecar, or consolidate some of the services while avoiding the recreation of monolithic stacks.
With financial provider data sources and formats diversifying, we are introducing machine learning and artificial intelligence to solve for data extraction, categorization, and data freshness.
The above-mentioned next steps are possible because of the major re-platforming the FDP team has courageously taken on and dedicated itself to see through to the finish line. Although the target state is continuously evolving, looking back on this journey has assured us once more that moving away from the monolith was the right decision.
Author: Snezana Sahter is a distinguished architect in Intuit focusing on the Financial Data Platform which enables all of Intuit’s products to connect with banks across the globe. She has been with Intuit since 2017, excited for the opportunity to expand her e-commerce and marketplace domain knowledge into FinTech. Domain driven design and modeling APIs while dealing with legacy has been her area of interest for the past several years. Prior to Intuit, she was a principal architect at eBay for 10+ years, responsible for modeling and API strategy in Identity and Risk Management domains. Originally from Serbia, she has spent most of her engineering career in the San Francisco Bay Area.
Author: Thirugnanam Subbiah had worked as Principal Software Engineer for Intuit focusing on Financial Data Platform from 2009 to 2019. He had led efforts from the ground up on breaking prior monolith system to multiple microservices, which is the genesis of Financial Data Platform. Thiru also carried out migrating whole Financial Data Platform to AWS, addressing performance, scale, availability, and latency. One of his last contribution before deciding to pursue opportunities outside of Intuit was streamlining CI/CD pipeline.