Online Data Migration from HBase to TiDB with Zero Downtime
Ankita Girish Wagh | Senior Software Engineer, Storage and Caching
Introduction and Motivation
At Pinterest, HBase is one of the most critical storage backends, powering many online storage services like Zen (graph database), UMS (wide column datastore), and Ixia (near real time secondary indexing service). The HBase Ecosystem, though having various advantages like strong consistency at row level in high volume requests, flexible schema, low latency access to data, Hadoop integration, etc. cannot serve the needs of our clients for the next 3–5 years. This is due to high operational cost, excessive complexity, and missing functionalities like secondary indexes, support for transactions, etc.
After evaluating 10+ different storage backends and benchmarking three shortlisted backends with shadow traffic (asynchronously copying production traffic to non production environment) and in-depth performance evaluation, we have decided to use TiDB as the final candidate for Unified Storage Service.
The adoption of Unified Storage Service powered by TiDB is a major challenging project spanning over multiple quarters. It involves data migration from HBase to TiDB, design and implementation of Unified Storage Service, API migration from Ixia/Zen/UMS to Unified Storage Service, and Offline Jobs migration from HBase/Hadoop ecosystem to TiSpark ecosystem while maintaining our availability and latency SLA.
In this blog post, we will first learn the various approaches considered for data migration with their trade offs. We will then do a deep dive on how the data migration was conducted from HBase to TiDB for one of the first use cases having 4 TB table size serving 14k read qps and 400 write qps with zero downtime. Lastly we will learn how the verification was done to achieve 99.999% data consistency and how the data consistency was measured between the two tables.
Data Migration Strategies
In general, a simplified strategy for data migration with zero downtime consists of the following:
- Assuming you have database A and you would like to migrate the data to database B, you would first start to double write to database A and database B.
- Import the dump of database A in database B while resolving conflicts with live writes.
- Do a validation of both data sets.
- Stop writing to database A.
Each use case is different and can present its own set of unique challenges.
We considered various approaches for doing data migration and finalized the methodology based on various trade offs:
- Doing double writes ( writing to 2 sources of truths in sync/async fashion) from the service to both tables (HBase and TiDB) and using the TiDB backend mode in the lightning for data ingestion.
This strategy is the simplest and easiest to implement. However, the speed offered by TiDB backend mode is 50GB/hour so it’s only useful for data migration of smaller tables.
- Take a snapshot dump of the HBase table and stream live writes from HBase cdc (change data capture) to a kafka topic, then do data ingestion of that dump using local mode in the lightning tool. Later, start double writes from the service layer and apply all updates from the kafka topic.
This strategy was difficult to implement due to complicated conflict resolution when applying the cdc updates. Additionally, our home grown tools for capturing HBase cdc only store the key. Hence some development effort was also needed.
- An alternative to the above strategy is where we read the keys from cdc and store them in another data store. Later, after starting the double writes to both tables, we read their latest value from the source of truth (HBase) and write to TiDB. This strategy was implemented but had the risk of losing the updates if the async path of storing keys via cdc had availability issues.
After evaluating trade offs of all strategies, we decided to take the following approach described in the rest of this blog.
Client: A downstream service/library which talks to thrift service
Service: Thrift service which serves online traffic; for the purpose of this migration, it’s Ixia
MR Job: An application which is run on map reduce framework
Async write: The service returns an OK response to the client without waiting for a response from the database
Sync write: The service returns a response to the client only after receiving a response from the database
Double write: The service writes to both underlying tables in either sync or async manner
Since HBase is schemaless and TiDB uses strict schema, before this migration can be started, a schema needs to be designed containing correct data types and indexes. For the purpose of this 4 TB table, there is 1:1 mapping between HBase and TiDB schemas. This means the TiDB schema was designed by using a map reduce job to analyze all columns in a hbase row and their maximum size. The queries were then analyzed to create correct indexes. Here are the steps:
- We used hbasesnapshotmanager to take HBase snapshot and store it as csv dump in s3. We stored the CSV rows as Base64 encoded to work around special character limitations. Then we used TiDB lightning in local mode to start ingesting this csv dump while doing base64 decoding before storing the row in TiDB. Once ingestion is finished and the TiDB table is online, start async dual writes to TiDB. The async dual writes ensure that TiDB SLA does not impact service SLA. Although we have a monitoring / paging setup for TiDB, we kept TiDB in the shadow mode at this time.
- Perform snapshotting of HBase and TiDB table using a Map reduce Job. The rows were first converted into a common object and stored as SequenceFiles in S3. We developed a custom TiDB Snapshot Manager using MR Connector and used hbasesnapshotmanager for HBase.
- Read these sequence files using a map reduce job that writes the unmatched rows back to s3.
- Read these unmatched rows from s3, read its latest value from the service (backed by HBase), and write the value to the secondary database (TiDB).
- Enable double sync writes so writes go to both HBase and TiDB. Run the reconcile job in step 3, 4 & 5 to compare data parity in TiDB and HBase daily to get stats on data mismatch between TiDB and HBase and reconcile them. The double sync write mechanism didn’t have a rollback in case write to 1 db fails. Hence the reconciliation jobs need to run periodically to ensure there is no inconsistency.
- Keep sync write to TiDB and enable async write to HBase. Enable reads from TiDB. At this stage the service SLA completely relies on the availability of TiDB. We keep async writes to HBase as a best effort to maintain data consistency in case we need to rollback again.
- Stop writing to HBase completely and deprecate the HBase table.
Dealing with Inconsistencies
- Inconsistency scenarios due to backend unavailability
The double writes framework built at Ixia service layer doesn’t rollback writes in case they partially fail due to unavailability of either database. This kind of scenario is taken care of by running reconciliation jobs periodically to keep both HBase & TiDB tables in sync. The primary database, HBase is considered as the source of truth when fixing such inconsistencies. This in practice means if a write had failed in HBase but succeeded in TiDB, during the reconciliation process, it will be deleted from TiDB.
- Inconsistency scenario due to race condition during double writes and reconciliation.
There is a possibility of writing stale data to TiDB if the events happen in the following sequence: (1)reconciliation job reads from HBase; (2) live write is written to HBase synchronously and TiDB asynchronously; (3) reconciliation job writes previously read value to TiDB.
This class of issues is also resolved by running reconciliation jobs multiple times because after every run, the number of such inconsistencies decreases significantly. In practice, the reconciliation job only needed to be run one time to achieve 99.999% consistency between HBase and TiDB for the 4 TB table serving 400 write QPS. This was verified by taking the dump of both HBase and TiDB tables a second time and comparing its values. During comparison of rows, we saw 99.999% consistency for the tables.
- We saw 3x-5x p99 latency reduction for reads. The p99 query latency went down from 500 ms to 60ms for this use case.
- Read after write consistency was achieved, which is one of our goals for migrating use cases specifically from ixia.
- Simpler architecture in terms of the number of components involved once the migration is complete. This would drastically aid when debugging production issues.
Challenges & Learnings
In-house TiDB Deployment
Deploying TiDB in the Pinterest infrastructure has been a great learning experience for us since we are not using TiUP (TiDB’s one stop deployment tool). This is because a lot of responsibilities of TiUP overlap with internal pinterest systems (for example deployment systems, operational tooling automation service, metrics pipelines, TLS certificate management, etc.) and the cost of bridging the gap between the two outweighed its benefits.
Hence we maintain our own code repo of TiDB releases and have build, release, and deployment pipelines. There are a lot of nuances on safe cluster management, which we had to learn the hard way as TiUP takes care of it otherwise.
We now have our own TiDB platform built on top of Pinterest’ AWS infrastructure where we can do version upgrades, instance type upgrades, and cluster scaling operations with no downtime.
We ran into a couple of issues when doing the data ingestion and reconciliation listed below. Please note that we got full support from Pingcap on every step. We have also contributed some patches to the TiDB codebase which have been merged upstream.
- TiDB lightning version 5.3.0 didn’t support automatic TLS certificate refresh, which was a hard problem to debug due to lack of relevant logs. Pinterest’s internal certificate management service refreshes certificates every 12 hours, therefore we had to go through some failed ingestion jobs and work with pingcap to get it resolved. The feature has since been released in the TiDB 5.4.0 version.
- The local mode of lightning consumes a lot of resources and impacted online traffic on a separate table being served from the same cluster during the data ingestion phase. Pingcap worked with us to provide short term and long term remediation of the Placement Rules so the replica serving online traffic doesn’t get impacted by local mode.
- TiDB MR Connector needed some scalability fixes to be able to snapshot a 4 TB table in reasonable time. The MR Connector also needed some TLS improvements, which have since been contributed and merged.
After tunings and fixes, we were able to ingest 4 TB of data in ~ 8 hours, and running one round of reconciliation and verification took around seven hours.
The table we migrated as part of this exercise is served by ixia. We ran into a couple of reliability issues with the async/sync double writes and query pattern changes. The issues at the thrift service (ixia) became harder to debug as a result of the complex distributed systems architecture of ixia itself. Please read more about it in our other blog.
We would like to thank all past and present members of the Storage and Caching team at Pinterest for their contributions in helping us introduce one of the latest NewSQL technologies in Pinterest’s Storage stack.
We would like to thank the Pingcap team for continued support, joint investigations and RCAs of the complicated issues.
Finally, we would like to thank our clients for having immense patience and showing tremendous support as we do this huge migration, one table at a time.