Spot: A Payment Record Search Engine
The story of building a simple service under complex circumstances.
Written by Manik Surtani. Originally posted February 1st, 2016.
Spot is the payment record search engine built at Square. It’s a seemingly simple service that grew in complexity as we encountered new challenges. Today, we’re diving into the technical details of this critical payments search engine.
Setting the stage
Square’s core systems started as a monolithic service. As Square grew, we found it harder and harder to manage the monolith codebase in many aspects, including the number of committers and frequency of deployments. In a bid to reduce these complexities, all features (except some aspects of processing payments) were refactored and extracted into separate, discrete services.
Payment processing was deemed too mission-critical and too scary to attempt a major refactor. This decision wasn’t surprising given that we had little design documentation and implementation notes. Also, few of the original authors were available to consult, and it was written in a dynamically typed language (no joy of static analysis and refactoring tools). So while the actual work of processing payments was moved out to various services, the monolith was still the orchestration point and the system of record for looking up payments. (Yes, we knew we were kicking the figurative can of tech debt down the road.)
In early 2015, we realized that leaving payment processing in the monolith was going to cause it to reach its scaling limits soon. Payment volume was seeing rapid growth, and we were simply running out of space. But, we were already running one of the largest MySQL servers in the industry, with high capacity, expensive enterprise flash drives allocated to it. These were made all the more expensive with multiple replica sets for redundancy, read
paths, etc. We no longer wanted to throw money at the problem by buying bigger hardware, but we needed to find a way to remove technical debt.
We investigated various options for building a system to search for payments. Most approaches fit into two broad categories: B-tree indexes (most relational databases) and inverted indexes (full-text search engines such as Apache Lucene and derivatives like SOLR and ElasticSearch).
Inverted indexes seemed like an obvious solution, where entire payment records could be indexed and searched for by almost any field. However, in practice, inverted indexes didn’t perform well for our use cases, some of which involved numeric aggregations and searching on date ranges. Inverted indexes are optimized for textual searches, and often don’t perform well when working with numeric data. We also found inverted indexes to be inefficient when it came to space — as the entire payment record was indexed.
With B-trees, a separate tree is maintained for each field indexed, with each node pointing to the record in question. B-trees proved to be much faster on range searches, as well as more space efficient since we’d only create indexes on the fields we cared about, (i.e. a small fraction of an entire payment record). So we turned to MySQL again: an efficient, proven storage engine backed by B-tree indexing.
On the surface, Spot is a simple, boring service that converts requested search filters to a SQL WHERE clause which is run against a database. It also indexes and stores payments as they enter Square’s ecosystem. However as I foreshadowed, Spot grew to be less simple in order to meet our throughput and latency expectations.
Spot is implemented in Java using Square’s Java Service Container (a Guice-based container exposing protobuf services, not dissimilar to DropWizard), and using JOOQ for database interaction. Backed by sharded MySQL, we treated the database as a key/value store with plus secondary indexes, as described later. Payments are sharded by Square seller (a merchant), which makes aggregation queries more efficient when serving requests. (A request here could come from a seller looking for their daily total sales.)
After a capacity planning exercise, we chose to go to production with 16 database shards. Even with an initial load of billions of records, we’d have enough capacity to see us well into 2019 before we’d have to think about adding more shards.
Payment records are encoded as protocol buffers, making them easy to pass around from one service to the next in a platform-independent manner. We maintain two tables: the first one, payments, simply mapped payment tokens to serialized payment records. The second, payments_searchable, contains specific fields extracted from the payment record and stored them as separate columns, allowing MySQL to build B-tree indexes from them. This takes some inspiration from Bret Taylor’s article on FriendFeed’s architecture.
payment_token VARBINARY(64) PRIMARY KEY
payment_token VARBINARY(64) PRIMARY KEY,
created_at_ms BIGINT(20) ... etc ...
Spot reads data from Esperanto, Square’s service that orchestrates communication with credit and debit card processing gateways. To enable efficient database resource usage and to minimize database locking, new payment records are read from Esperanto in batches of a few hundred and added to Spot’s database in a single, large INSERT INTO … (records … ) ON DUPLICATE KEY UPDATE (records… ) SQL statement. Such batch updates are resource-efficient: a single JDBC connection is used, and a single database transaction is used. As a further micro-optimization, having the records ordered by primary key makes MySQL scanning its clustered index to add records/check for duplicates faster. This leads to some interesting SQL generation logic, but JOOQ makes this easy.
Searching and Pagination
Most search queries are scoped to a merchant. As such, they’re directed to a specific database shard. However, Spot also supports unscoped queries by fanning out to all shards. By using GuavaListenableFutures, we’re able to stream results in parallel from all shards, apply observers to each result stream, and return early when adequate results are received. This is extremely efficient when you’re fanning out to search for a single record, or you need a fixed size set of results that may span all shards which need to be ordered or paginated (both in terms of time as well as garbage creation).
Given the number of indexes on payments_searchable designed to answer queries pertaining to our supported use cases, we had some composite indexes that overlapped to some degree. Under load tests, MySQL was prone to picking a less-than-ideal index from time to time, causing latency spikes. So, we built an index hinter, which analyses the elements of a WHERE clause and applies an index hint to the query. Index hints live in a configuration file and are looked up based on WHERE clause components.
Given our usage of the database, we found that, during load tests, our MySQL shards were maxing out on IOPS, but were fine in terms of CPU usage — hovering around 10% utilization. So, we opted to explore compression, trading off the headroom we had in CPU capacity for storage capacity.
We experimented with compressing payment records before storing them in the database using bothzlib and snappy. This only bought us about 10% savings in storage capacity — mostly because this approach didn’t apply any compression to the indexes we created, but also each compressed record maintained its own dictionary and lost efficiency. This approach could have improved with more flexible compression libraries with shared dictionaries, semantic compression, etc., but that would have greatly increased complexity. Further, this would eat up app host CPU and not database CPU, which is where we had spare capacity. So we dropped this approach.
InnoDB page compression
We then tried InnoDB page compression and had much greater success. We’d typically see 5–10 payment records in a single InnoDB page, so dictionaries were better utilized. Indexes would get compressed too. After tuning compression target ratios to suit the sizes of typical payment records, we were able to compress our dataset to 40% of its original size.
As for performance, we expected latency to increase as a part of this tradeoff, and it did. But given that Spot’s write path is so unstressed, we barely noticed it; Spot is a read-heavy system (after all, it is a search engine). Even better, search and lookup latency didn’t increase. We attribute this to greater efficiency in buffer pool usage, since MySQL caches entire pages (with compressed pages, more things get cached, which possibly compensate for the additional work to decompress pages). Database shard CPU usage under load increased by about 30%, leaving plenty of headroom. And in terms of IOPS utilization (our limiting factor under stress), it remained the same.
Spot’s API allows clients to specify arbitrary filters when searching for payments, and supports non-merchant-scoped searches. This results in fanning out queries to all database shards and aggregating results in memory. We found a few badly written requests that caused long-running queries. Due to the way JDBC and MySQL interact, even after such requests time out in Spot and the JDBC connection is recycled back into a connection pool, MySQL would keep the query running. We had cases where this became a resource hog on the database shards, where a badly written query would do a table scan and inhale IOPS.
So, we added a safeguard: the query sniper. The query sniper is a tool to prevent long-running queries eating up shared resources, and is blogged about in more detail separately.
Load testing and tuning
A lot of the details described above are a direct result of designing and running proper load, stress, and failure tests for the various usage scenarios we expected. Given that this is a new system, we were able to load test on production hardware before we made it generally available, saving us the need to build a load test environment. In addition to the implementation details discussed above, this load testing helped us tune database shards, JVMs, optimize code for minimal garbage collection pressure. There is a lot that we explored here, and possibly a topic for a future blog post on on GC tuning and techniques when measuring latency.
The final product
We now have a payments search service in production today that responds to bulk lookup requests (by bulk primary keys) in about 15ms (P99), and 1.5ms (median) at a sustained 1100 requests/sec. Concurrently, it also responds to searches (which hit secondary indexes) in about 8ms (P99) and 2ms (median), at a sustained 180 requests/sec.
Spot’s stable too. We rarely see pages or incur downtime, and we don’t see latency spikes thanks to the query sniper. The code is clean thanks to Java 8’s support for streams and lambdas — useful, since we do a lot of collation and manipulation of result sets in memory. And we’ve got enough database capacity to last us well into 2019.
Predictions, by definition, don’t take into account the unpredictable, such as acquisitions that may increase our processing volume or new product offerings that may warrant additional fields to index in Spot. That said, we designed Spot to be able to add capacity by increasing database shards and to add search fields by adding more index tables, both on the fly. Time will tell how the design works when the need for more capacity inevitably arises.
Thank you to the WebScale team (in no particular order: Kathy Spradlin, John Pongsajapan, Zach Anker, Gabriel Gilder, Andrew Lazarus, and Alyssa Pohahau) and Square’s group of MySQL experts for their tireless effort.