How Chase Transitioned its Data Lake from Hadoop to AWS — Part 1
By: Praveen Tandra, Principal Software Engineer, and Sudhir Rao, Managing Director of Data Engineering, Chase
Over the last decade, the data ecosystem at Chase has evolved organically and inorganically, incorporating multiple systems, methodologies and platforms. This evolution has experienced complex integrations and significant tech debt, including data duplication, persistent data quality issues, metadata drift and platform incompatibilities, with disparate technologies applied to address immediate problems.
The on-premises platforms for Chase’s Data Lake (Apache Hadoop-based) and Data Warehouse (vendor technology) are reaching a point where scaling and managing them is not easy and continues to lack the needed agility for adapting to new use cases. With the on-premises Hadoop-based Data Lake reaching the end of its shelf life, we embarked on a journey to find a new, flexible and future-proof home and migrate the Data Lake to the public cloud.
Here’s how we began our complex journey of migrating our consumer and community banking (CCB) data pipelines and ecosystem from Hadoop to AWS, including the events, decisions and intricacies involved such as:
- Fixing metadata issues encountered migrating custom file formats to open file formats.
- Keeping Enterprise Data Warehouse (EDW) intact to avoid disruption to data consumers.
- Extract, Transform & Load (ETL) Compute (legacy and modern) migration to cloud native scaling on AWS.
- Intelligent Data Validation and Optimization.
- Fully-collaborative efforts across teams to adopt the public cloud for our Data Lake.
Adoption of the Public Cloud
The Hadoop ecosystem has become an industry standard for processing Big Data at scale, used by data engineers, data scientists and analysts for widely adopted programming interfaces that handle large volumes of data.
At Chase, the Hadoop ecosystem was deeply entrenched as a backbone for data processing. It feeds data into warehouses and data lakes, supporting various use cases such as Analytics, Business Intelligence (BI), Data Science, Marketing, Risk, and more for the firm.
While Hadoop addressed many challenges for Chase, with thousands of datasets generated daily and loaded into the data warehouse for various downstream use cases, over several years it became a classic example of infrastructure tech debt, like many systems.
Lingering issues in our Hadoop environment included:
- Much of our data is still stored in vendor proprietary and TEXT formats.
- Frequent hardware refreshes are required with pre-planning.
- Software is at the end of its shelf life.
- Regular upgrades, patching and maintenance are necessary.
- Hadoop clusters limiting expansion with old, coupled compute/storage architecture.
- Duplicating datasets to address performance bottlenecks on individual Hadoop clusters.
- Constrained on user consumption capabilities, e.g. infrastructure, performance and row
level security.
Modern cloud data infrastructure is built around the pattern of de-coupled storage and compute. This approach allows for the independent scaling of resources, eliminating the need to duplicate datasets merely to achieve a boost in compute power.
As shown in the diagram, we intended to swap the massive on-premises Hadoop-based lake with an AWS cloud-based implementation while keeping the sources and consumption intact as much as possible. The data consumption mainly happens on the enterprise data warehouse on the right catering for a plethora of use-cases ranging from business analytics and operations to data science, among others, with hundreds of applications and thousands of analysts querying the data daily.
Adopting a cloud-based Data Lake is a true game changer for data engineering use cases and a significant improvement over the tightly coupled storage-compute architecture of Hadoop. Additionally, the public cloud offers the opportunity to start fresh with a “clean room” approach, allowing for the removal of outdated legacy data and introducing transparency to the operating expenditure model.
Specifically, we adopted AWS S3 for storage, while Kubernetes became our go-to solution for compute orchestration and management. We continue to leverage multiple Extract, Transform & Load (ETL) tools alongside open-source frameworks such as Apache Spark™.
Lead Us from Darkness to Light: Open File Formats
In our legacy Hadoop setup, most of the data was stored in TEXT format, which was error prone and inefficient for large-scale analytical workloads. With the move to AWS, we were resolved to adopt an efficient format for storing large volumes of data in the cloud. It was essential for this format to be open to promote interoperability and avoid vendor lock-in.
Based on the recommendation of our Data Architecture team, we chose Apache Parquet as our storage format to tackle some of these challenges. Parquet is an open-source, columnar file format in the data ecosystem that offers optimized storage, excellent compression options, fast queries and much-needed schema enforcement.
The benefit of using Parquet is its capability to solve some existing issues as well as propel the Data Lake into the future. Adopting Apache Parquet also paid off significantly as the industry is currently adopting Apache Iceberg™ technology as a layer on top of Apache Parquet and realizes the concept of “Lake House” (Data Lake + Warehouse) semantics, positioning us for future convergence of technologies. Our Data Lake team is currently well underway adopting the Apache Iceberg technology. Apache Iceberg is an open table format that works well with Apache Parquet files, facilitating ACID (Atomicity, Consistency, Isolation and Durability) semantics (limited but evolving) like those of a relational warehouse.
Significant benefits of Parquet include:
- Schema enforcement and evolution: Schema is in-built with Parquet dataset reducing misinterpretations and making schema evolution easy.
- Efficient storage: Default column-wise compression on the data with encoding schemes.
- Query performance: Improved query performance by skipping over non-relevant data, making aggregations faster, similar to a warehouse.
- Open standards: Data is interoperable with other ecosystems including multiple vendors and open-source tools.
- Payload integrity with built-in headers: Data quality is significantly enhanced by incorporating built-in checksums and headers that include row counts alongside the schema, eliminating the need for additional artifacts. Binary encoding leaves no room for mistakes that are typical with delimited files.
Challenges with TEXT Data
We observed multiple data format issues, including those noted below, with historical Hadoop data while standardizing it in AWS Data Lake.
- Nullability: Null and Blank representation is not clearly defined and handled consistently across datasets on Hadoop (ascii), so these gaps had to be addressed and adopted as part of migration.
- Loosely Typed Data: Datatypes in schema declaration are not regulated, and choice of precision scale was independent of data impacting the actual values stored, with truncation or additional zeros and other issues.
- Timestamp Formats: Datetime values are stored and supported by different formats, with few assumed Time zones. The different formats and inherent assumption of time zones were creating issues.
- Character Encoding: Data encoding of the source and target are assumed with TEXT files not having implicit support. It created inconsistencies in handling special characters, unknown character replacement issues.
- Special Characters (“\n”, “\r” etc.): Special characters were present in the column conflicting with line endings. Data needs to be read with additional steps as they overlap with other delimiters or special characters.
These broad issues were later addressed through our next step, Project Metafix.
“Houston, we have a problem” — Project Metafix
After finalizing some of the infrastructure decisions, our next major challenge was tackling the fragmented and outdated metadata for thousands of datasets. We needed to reconcile, converge and adapt decades-old metadata — encompassing multiple versions, generations and hosting systems — to the cloud and Parquet, all while ensuring no disruption to our customers’ normal operations.
Given the enormous number of artifacts spread across Warehouse and Hadoop clusters, the task was becoming overwhelmingly complex and large in scale.
It was time for all hands-on deck, including our Data Delivery (aka Data Engineering), Data Lake, Data Pipeline, Data Consumption and Data Governance teams. We launched Project Metafix to align everyone and efficiently tackle the problem at hand. Teams were brought together from across continents to focus on the issues at hand and develop solutions.
And tackle it we did, addressing several key challenges in each of the areas:
- Null vs. Blank: Parquet offers first-class support for null values, similar to a data warehouse, which was lacking in TEXT files.
- Loosely Typed Data: We standardized data types, such as ensuring decimals have consistent scale and precision, and properly distinguishing them from integers.
- Timestamp Format (non-binary, i.e. TEXT): We unified timestamps from over a dozen different formats, standardizing on UTC with ISO format.
- Date Format: We reconciled date representations across different calendars used by various ETL systems such as Julian and Gregorian.
- Character Encoding: We resolved issues with character encoding, including ISO-8859–1, UTF-8 and Unicode, among others, standardizing on UTF-8.
- Special Characters (“\n,” “\r,” etc.): We resolved special character issues appropriately. Thanks to Apache Parquet, we don’t need to deal with delimiters and line endings anymore.
Through collaboration and innovation, our Data Delivery teams led the automation of metadata changes across legacy applications. They consolidated legacy applications’ metadata into innovation databases, paving the way for efficiency across the data technology at Chase. The work not only streamlined the process of metadata modifications but also democratized access to the accelerator for dataset and pipeline promotions, empowering other teams to effortlessly promote 11,451 datasets from identification of changes to production.
The solution went beyond automation, advancing the entire data lifecycle from identifying metadata differences in datasets to code promotion and validation across DPL (Data Pipeline), CCMS (Catalog Management Service), AWS Glue DDL and Warehouse. The accelerator became an example of a new standard at Chase for efficiency and reliability. This has eliminated manual interventions and minimized human error, thus saving countless hours apart from enhancing data integrity and consistency.
Project Metafix by the Numbers
Overall, 80% efficiency with automation (~45K hours) was achieved. Central to this was the development of a sophisticated Python and Java framework capable of parsing complex vendor and home-grown metadata files as well as intricate SQL queries to discover and manage differences.
The solution surpassed traditional boundaries, enabling seamless integration between diverse data sources and repositories and facilitating swift and accurate metadata updates and promotion to production using a Java accelerator built on a simple spreadsheet template with pre-defined rules.
Through their dedication, expertise and collaborative spirit, teams have not only transformed how metadata is managed at Chase but have also laid the foundation for future innovations in data management and automation.
Tracking Data Movement — Lineage 2.0
This team-wide, semi-automated exercise connected all the platforms and data assets end to end and gave us a denormalized view of data assets and their footprint across all legacy and target state platforms.
Over multiple iterations spanning a few months, we defined the scope of legitimate datasets that need migration cloud and subsequently provision to Enterprise Data Warehouse (EDW).
We started with an inventory of more than 22,000 datasets from Hadoop (Horton) and diligently re-created end-to-end lineage from ingestion all the way to EDW consumption. This metadata was crucial in finalizing the migration scope with a high degree of confidence. Post Lineage 2.0, we ended up with approximately 12,000 datasets with valid rule code (RC) assignments. Rule code is a way to determine the significance of the dataset in the context of migration. A couple examples are listed below.
Key Takeaways and Next Steps
The Lineage document helped identify datasets needed for history copy versus one-time loads. It also helped in defining consumable, provisioned and risk impacted datasets. Capturing the scheduler information helped us monitor, track and compare the execution stats between Hadoop and AWS jobs. Observability of all the pipelines is critical to success.
Stay tuned as we continue sharing the team’s migration journey from Hadoop to AWS, including managing friction points and achieving data stability and optimization.
JP Prabhakara, Senior Director of Software Engineering, Michael Zeltser, Senior Principal Software Engineer, and Chandra Modigunta, Director of Software Engineering, contributed to this article.
Like what you’re reading? Check out all our opportunities in tech here.
JPMorgan Chase is an Equal Opportunity Employer, including Disability/Veterans
For Informational/Educational Purposes Only: The opinions expressed in this article may differ from
other employees and departments of JPMorgan Chase & Co. Opinions and strategies described may not be appropriate for everyone and are not intended as specific advice/recommendation for any individual. You should carefully consider your needs and objectives before making any decisions and consult the appropriate professional(s). Outlooks and past performance are not guarantees of future results.
Any mentions of third-party trademarks, brand names, products and services are for referential purposes only and any mention thereof is not meant to imply any sponsorship, endorsement, or affiliation.