Intuit Persistence Service: A Paved Road for Persistence

Published in

Intuit Engineering

14 min readMay 13, 2022

This blog post is co-authored by Intuit’s Seshu Adunuthula, Director Data Platform, Raj Parameswaran, Software Development Manager, and Larry Raab, Principle Engineer, for the Intuit AI+Data organization

Today’s product engineering teams have no shortage of choices when it comes to selecting a database for their services and applications. With a large number of data persistence technology choices — both proven and unproven at scale available — it’s a significant undertaking for any engineering team to consider the pros and cons of these different choices when contemplating a new micro-service, or when refactoring/decomposing an existing service.

Making such critical technology decisions can be complex, time-consuming and outside of their core competencies. As a result, teams — especially small teams — can sometimes be put into decision-making roles for which they may be ill-equipped to handle or they may just “follow along” with another team’s choice. Ultimately, this can lead to a sub-optimal solution for their particular use case.

In this blog we will describe a “Paved Road For Persistence” we’ve built and deployed at Intuit to improve reliability, scalability and developer productivity for persistence. We took a comprehensive, platform approach to evaluating database options that meet the needs of product engineering teams. We then built a set of capabilities around these choices including a Persistence Specification Language(PSL) for schema management, Polyglot persistence to make automatic copies across different database types, ingestion to the data lake, Application level encryption for sensitive data and much more.

A Pivotal First Step: Selecting the Right Databases

From the outset, it was imperative that we focus on a targeted set of choices, and invest in building out the associated paved roads for each option in four database categories:

Relational: Preferred technology choice for Intuit’s small business and self-employed group for our QuickBooks portfolio of products and services. The relational data model with SQL query capabilities is the most popular and widely deployed database model in the industry.
NoSQL: Document/key value stores offer many advantages such as schema-less/flexible schema models and global scalability. This is the preferred technology choice for Intuit’s consumer group and our TurboTax and Mint products and services.
Search: A search index is an important access pattern, especially for document/key value stores that do not support secondary indexes. Search databases typically are read-only copies of relational and NoSQL databases with additional access patterns.
Graph: An important category for use cases that involve traversing relationships, such as risk/fraud and identity. It has a small but growing footprint within Intuit.

Selecting the right database options was a pivotal decision for Intuit, requiring rigorous analysis. Listed in the table below are the recommended first and second options for each of the database categories that would enable us to deliver on the scale, reliability and productivity needs of Intuit product engineering teams across the company.

For Relational, Aurora PostgreSQL is the recommended choice as it forms the most convenient option for services that are migrating out of Oracle. For NoSQL, we selected DynamoDB because it is a proven solution that has been successfully deployed at scale for large global services (e.g., Amazon retail). AWS ElasticSearch (now OpenSearch) has established itself as the industry leader in the search category. Rigorous evaluations led us to select TigerGraph as the recommended choice for graph databases with AWS Neptune as the second option.

Building the Paved Road

In the remainder of this blog we’ll take a deep dive into the capabilities and components of the Intuit Persistence Service (IPS) that is enabling Intuit engineering teams across the company to accelerate development of awesome experiences for more than 100M consumer, small business and self-employed customers with products such as TurboTax, QuickBooks and Mint.

Database Entities & Persistence Specification Language

An entity model that is consistent and uniform across all the database technologies is a critical component of Intuit’s persistence paved road. The entities, and relationships between entities, are defined using Intuit’s internally developed Persistence Specification Language (PSL), an XML-based schema definition language.

Entities: An entity refers to a record in the database, and an entity type is analogous to the database schema. Let’s look at how an entity type is defined using PSL with a real-world example below. This is an expert profile entity type used to store information about an expert (tax, small business accounting, etc.) working with the customer on one of Intuit’s “live” platforms. For simplicity of visualization, the XML tree representation is shown below.

The entityTypeDef section contains metadata about the type itself. This includes the typeName that uses a folder based notation to identify the specific entity type in the entity tree.
This is followed by a set of attributes that comprise the entity. These are akin to the column names in a table. Each attribute has a type. This can be a simple type such as a string or a composite type such as a Map. There is also an option for an enumeration as shown with the addressType attribute. A key element of the attribute definition is the data classification which describes the data classification level for the specific attributes and falls into one of five data classification levels specified by Intuit, which are, in increasing order of sensitivity, Public, Restricted, Sensitive, Highly Sensitive and Secret. There are strict rules across Intuit that govern how data belonging to each level are handled.
The final section which is optional is the index section which contains one or more indexes. This is again similar to the concept of an index in a traditional database and allows for constant time retrieval of records based on the index attributes. Indices can either be unique which means that exactly a single entity matches the attribute values or non-unique, in which case you would have a set of entities mapping to the particular combination of indexed attributes.

Entity Relationships: Entity Relationships establish a relationship or connection between two different entities. This is similar to a foreign key relationship between items belonging to two tables in a relational database.

For illustrative purposes, let’s consider the relationship between two entity types: account and transaction. The account could be a customer account for a financial institution, and include attributes such as accountId, name, address and balance. The transaction refers to a financial transaction and includes attributes such as transactionId, amount and date. An account could have many transactions associated with it. In such a case, PSL allows for a one-to-many relationship between an account entity and a transaction entity. Below we see how it is defined.

Mapping Entities to Databases

The internal representation of entities differs based on the underlying database. Let’s explore this for the NoSQL (DynamoDB/Cassandra) offering.

In DynamoDB/Cassandra, the data for each entity is stored in a single entity table. The internal structure of the table itself is abstracted from the clients who use REST (representational state transfer) API to call into IPS and perform CRUD (create, read, update and delete) operations on the data.

Entities are keyed off an entity key which is a UUID (universally unique identifiers) generated by the system. There is a single entity table that spans all entity types defined for the particular BU. Each entry in the table corresponds to a single attribute of the entity. To illustrate this, let’s take a look at the entity expert profile discussed earlier. An entity of this type would be stored in the entity table, as shown below. (The table has been simplified for illustrative purposes and some of the columns are omitted.)

A GET request for a single entity would retrieve all rows corresponding to that entity and return the composed result.

As mentioned earlier, indexes allow for quick look-ups of the data by using the indexed attributes in place of the entity key. All the indexes themselves are stored in a separate index table. Just like the entity table, the index table spans all types of entities and their indexes. The structure of this table is shown below (simplified for illustrative purposes).

The primary key of the table is composed of both the index_name and the index_value. So the entity abc-123 can now be looked up based on the domain (business) attributes employeeId and tenantId rather than the database key, which is IPS generated.

There are similar mapping of PSL to Search, Relational and Graph databases but that is a subject for another blog article.

Scalability & Reliability Improvements

IPS is amongst the most heavily used platforms within Intuit, and high availability is a core tenet of the team’s operating model. For example, during the most recent tax season, IPS NoSQL alone served a peak throughput of more than 100K TPS. Here’s how:

Scaling Strategies

The IPS stack consists of application servers and the database layer. The application runs on either EC2 for NoSQL or Intuit’s internal container service, Intuit Kubernetes service (IKS), for search. While Intuit runs its own Cassandra servers on EC2 for NoSQL, the Elasticsearch offering uses the AWS managed Elasticsearch. For application servers running on EC2, the scaling is handled by AWS Autoscaling. On the Cassandra side, the number of Cassandra nodes is determined based on estimated traffic and data volume. When CPU usage on the Cassandra nodes breaches a certain threshold, the strategy that the team employs is vertical scaling. In this instance, EC2 instances running the Cassandra nodes are replaced with a more powerful version, in an automated manner with zero downtime.

Monitoring, Alerting, Golden Signals

Operational excellence is the cornerstone of a well-managed platform. For IPS, we rely on a carefully-curated set of metrics for monitoring the health of the system, are collected from each of the nodes, both application and database, and routed to Intuit’s Observability platform, powered by VMware Wavefront. We use the Wavefront dashboard to visually monitor our metrics. These metrics are routed to the alerting system, which pages the oncall when certain thresholds are breached. All system and application logs are available in Splunk for the oncall to query and is the key tool in triaging issues. Some of the key metrics are also mapped to “golden signals,” which provide a summary of the operating health of the system. Following is a high-level view of what we track on the IPS dashboard:

Availability of the service in percentage. A drop in availability immediately triggers an incident.

Throughput in terms of transactions per second across two regions. The Cassandra clusters are replicated across US-east and US-west with both serving traffic.

Errors by errorCode (IPS-specific errorCodes). Incidents are triggered based on thresholds set for various types of errors.

Incident Management

At Intuit, “incidents” are issues that have the potential to impact our customers. They are rated in priority from P1 to P4 (priority1 to priority4), with P1 being the highest severity. Incidents require immediate attention from on-calls, typically from more than one affected service or application. The Intuit operations center (IOC) is a central team that manages incidents across Intuit and spins up an incident call where they are triaged.

With IPS as the paved road for persistence, the platform team took over central responsibility for managing database-related incidents, and learning from them to improve system stability and operational excellence. This approach was instrumental to our ability to drastically reduce database-related incidents here at Intuit. . This tax season, there were no P1 or P2 database-related incidents.

Polyglot Persistence

Polyglot persistence refers to the practice of storing the same data in multiple database technologies in order to support differing access patterns. The most common example of polyglot persistence is making the data in DynamoDB available in ElasticSearch for sorting, filtering and searching large quantities of data with sub-second response times. The figure below shows the IPS polyglot persistence architecture.

NoSQL and relational databases are the source databases for polyglot persistence. Trail logs from these datastores are converted into data change events and a polyglot engine is responsible for creating read-only replicas for the database tables in search and graph stores. The polyglot engine is responsible for any data transformations needed when data is copied from one data store technology to another.

The polyglot engine is flexible to allow for any source type to any destination type. For example, a relational table copy optimized for analytical access pattern (reporting) can be made from a source table optimized for transactions. In practice, the most common pattern for polyglot persistence is from Relational/NoSQL stores to a Search store.

Domain Events

Domain Events are messages with a well-defined schema that are discoverable and adhere to an Intuit eventing standard called “event carried state transfer.” This is an architectural pattern used in system design to update clients of a system such that they don’t need to contact the source system to complete further work. For example, a customer management system might fire off events whenever a customer changes their details (such as an address) with events that contain details of the data that changed. A recipient can then update its own copy of customer data with changes without needing to communicate with the main customer system to complete future work. IPS generates domain events for entities defined using PSL.

A domain entity transformer converts the data change events (from NoSQL and relational stores) and transforms them into the domain event schemas defined in a central repository. The events also adhere to the “event carried state transfer” standard to allow the destination to make their own copies of the event data.

Data Lake Ingestion

Ingestion into the data lake has traditionally been a manual process, taking days to weeks to onboard new data sets.. IPS makes this process completely self-service by automatically mapping entities defined in PSL to data lake tables defined in Hive MetaStore.

IPS supports three different data formats for data lake ingestion.

Raw Data: The raw domain event data is persisted in Amazon S3 data storage buckets in JSON/Avro formats. Primarily, this storage is useful for maintaining long-term, relatively inexpensive backups of persistence data in the data lake, and not optimized for access.
Time-series Hive Tables: The entity model is mapped into a Hive schema and a Hive table entry is created in the Hive Metastore. Hierarchical event data can be default-mapped into a flat Hive table or a user can define their custom mapping logic in Python.
Materialized Hive Table: Data is converted to Parquet for optimized query access, and deletes/updates are applied to the Hive table.

Recently, Intuit has also adopted the Delta Lake Table format for ingesting data lake tables.. This option enables real-time data ingestion and materialization, which means data change events are available in the data lake in a matter of seconds.

Enabling Authentication & Authorization at the Persistence Layer

Traditionally authorization for access privileges is performed in the app tier,where the appropriate database calls are made. One of the most unique features of IPS is its support for authentication and authorization at the persistence layer for Intuit customers, as it’s tightly integrated with Intuit’s AuthZ identity and authorization service. Authorization models supported by IPS include:

Customer is the owner: In this AuthZ scheme, the subject owns the resource and only the subject should be able to access the resource.

Realm is the owner: The subject is part of a realm that owns the resource. Hence, the subject is expected to have access to the resource, which is governed by policies set by the owners of the data.

System Offline User: A system offline user is a trusted Intuit service that has full access to a subset of resources within a certain context.

Delegation: Subject S owns the resource. S can delegate access to the resource to another subject S.

The feature-rich and advanced authorization capabilities of IPS are a key differentiator that helped drive adoption of IPS within Intuit, as it offloaded the burden of implementing sophisticated authorization capabilities at the application layer. It also simplifies security and compliance reviews.

Protecting Sensitive Data with Application-level Encryption

Intuit requires that all sensitive data be encrypted at rest, and that all highly-sensitive and secret data be encrypted at the application layer.

With IPS, application-level encryption (ALE) is automatically taken care of on behalf of the developer, which means they’re only required to annotate it in PSL (Persistence Specification Language). A platform approach gives Intuit assurance that we’re protecting our customers’ data across all Intuit applications and services, and eliminates the need for individual development teams to manage security keys.

Conclusion and Future Work

As a team, we’re incredibly proud of our accomplishments to date, and excited about what the future holds. As one of the most mature and widely-deployed platforms at Intuit, IPS is delivering on the promise of :

In the recent peak tax season, it enabled efficient scaling of micro-services for TurboTax consumer filers to 10X the normal loads with: best in class MTTD (mean time to detect) and MTTR (mean time to recovery) metrics; uniform and consistent monitoring and alerting; conversion of CDC (change data capture) logs to domain events; self service and real-time ingestion into the data lake, and much more.
Similarly, IPS NoSQL served 140 billion transactions with a peak TPS of 110K during the most recent tax season without a single P1/P2 incident. 1.9 Billion events were ingested into search and Intuit virtual experts performed more than 8.78 million searches.
And, hundreds of entities have been deployed in IPS NoSQL and search, and we’re actively working on deploying many more in the IPS relational database. Developers at Intuit take advantage of the many capabilities, including entity model, polyglot persistence, domain events and data lake ingestion.

We’re continuing to invest heavily in IPS to make the service even more reliable and a delight for the Intuit developer community (e.g., complex domain events, with data joined from multiple data sources, common control plane for managing all offerings, and more).

So, keep an eye out for IPS updates on this channel, as we continue to evolve the platform!

Many key individuals were responsible for building our Intuit Persistence Service: Achal Kumar, Amit Matety, Larry Raab, Juby Victor, Ben Covi, Denson Pokta, Harshavardhan Srinivasan, Kiran Prasana, Chau Nguyen, Muthu Dakshinamurthy, Tanzeem Katmale, Nishant Rangrej and Vineeth Kumar