by Zach Owens
As we look to our digital future, one of Nike’s key initiatives takes center stage: a microservices architecture with cloud deployment. Over the last few years, the small number of data-center-deployed monolithic systems have been incrementally migrated to an expansive set of microservices that serve our consumers 24/7 on Nike.com, SNKRs, the Nike app and other digital experiences.
In any microservice-based architecture, data storage is a critical building block for business logic. DynamoDB is now the go-to NoSQL data store because of its simple scaling and low operational burden.
Over the past few years, DynamoDB has stood up to Nike’s scale and demands. Even against our most challenging engineering problems, such as high-heat product launches where we deal with millions of consumers within a window of just a few minutes, DynamoDB reliably provides a consistent experience for consumers.
When we started the cloud migration, Nike’s services typically used self-hosted Couchbase and Cassandra clusters. Cassandra, in particular, was widely understood and adopted by teams. Cassandra latency and scalability were a perfect fit for most use cases, as Nike’s services are typically designed to scale with the number of users as the primary dimension of scaling.
However, as time went on, the engineering effort required to maintain Cassandra and Couchbase clusters increased. The database maintenance overhead slowed down the overall progress. Additional time spent scaling, upgrading and maintaining a database cluster removed time away from adding features or improving service code. We needed to eliminate this roadblock.
Most of Nike’s microservices are deployed in AWS, so DynamoDB was an obvious choice as a replacement. DynamoDB has a storied history at Amazon: it was originally developed as Dynamo for the Amazon Shopping Cart as an alternative to a relational database. Today, DynamoDB is a service-based offering where any team can spin up a DynamoDB Table with a simple API call. Once provisioned, DynamoDB is always on and provides low-latency reads and writes.
Now new services typically choose DynamoDB because of the low barrier to entry.
Our architecture principles dictate that a service is only accessed by the service’s API and not by the database directly. As a result, we can migrate existing microservices to DynamoDB without client impact.
Learning the Model
Most teams were fairly familiar with key-value store design, usually from past experience with Cassandra and/or Couchbase. However, DynamoDB has some significant differences from previously used databases.
DynamoDB tables are designed around items (“rows” in a typical database), attributes (“columns”) and the key schema. The item key consists of a hash key and an optional range key, which forms a fixed key schema. Other than the key, the attributes between items are unrestricted and can vary from item to item for flexibility.
A write to DynamoDB can be a complete overwrite operation or a partial update via the Update Item API with supports compare-and-set semantics.
On the read side, a single record can be queried by key. Tables containing a range key can be queried by hash key, returning multiple records that are sorted by the range key. Utilizing range keys for sorting is a powerful technique where multiple records within a hash key bucket can be efficiently queried. A common situation utilizing this technique is a range key set to a timestamp to form a list of records in chronological order.
Global Secondary Indexes (GSIs) allow different forms of querying on the same underlying data. Previously, teams would typically write to more than one Cassandra table (with the versions used at the time) to support multiple forms of querying. Now, a single write to DynamoDB can accomplish the same result by utilizing GSIs. Under the hood, DynamoDB creates a shadow table on your behalf, using the GSI’s separate key schema, and it will copy data from the main table to the GSIs on write. Not only does this save time and engineering effort, but is also safer from a data durability standpoint than the double write scenario from our Cassandra implementations.
Overall, the DynamoDB model is simple and flexible, allowing evolution of the services to add additional features.
With millions of consumers utilizing Nike’s front-end user experiences, scalability is often a concern for microservices. Scalability is two-pronged: a service must be able to store data for millions of consumers, and it must be able to horizontally scale to handle a large volume of requests. The beauty of DynamoDB’s design is it allows both forms of scaling to occur — perfect for Nike’s unique challenge set.
DynamoDB is a partitioned store, in that there is a set of independent and replicated segments that form a table. It scales up partitions automatically by splitting them into two smaller partitions. Being a service-based offering, DynamoDB performs the automatic partition-split behind the scenes with no downtime.
If the key schemas for both the table and GSIs are designed correctly, DynamoDB will happily store more data by increasing the number of underlying partitions seamlessly. For many services, data accumulates with long term use. DynamoDB is a perfect fit for the use cases in this category.
In addition to storage scaling, DynamoDB also has the ability to scale the supported throughput on a table or GSI. Write and read throughput values are provisioned separately and can be changed with an API call.
Throughput can also be auto-scaled based on the read and write demand. So DynamoDB is also a perfect fit for situations where demand fluctuates.
To add capacity to Cassandra or Couchbase clusters, engineers must increase the number of EC2 instances used for the database. These events can cause cluster stability issues because of the migration of data between machines. While some scaling events are non-disruptive, great care has to be taken to protect against negative effects on live traffic or data loss. DynamoDB, on the other hand, does the same heavy lifting on our behalf without negative side effects to the microservice. For us, that means no angry sneakerheads trying to cop the latest kicks — a huge win.
Challenges At Higher Scale
The past few years with DynamoDB have undoubtedly unlocked faster delivery for our services. However, some challenges have arisen as more workloads utilize DynamoDB.
Key Design and Throttling
Unlike most other data stores, DynamoDB actively prevents bad behavior through the use of throttling and request rejection. A primary example of throttling is the presence of “hot keys,” where a large amount of traffic is sent to a single partition over a short period of time.
DynamoDB prefers keys to be well distributed so that, as the amount of data grows, keys are free to move onto separate partitions. Because the overall table load is shared between partitions, high frequency of requests to the same hash key over a short period of time can also lead to a hot key scenario.
While most teams are able to design the table key space with broad key spacing over a large hash field, some teams run into issues with designing GSI keys. In some flawed designs, a GSI hash key is not designated with a high-cardinality key. When a GSI is throttled because of a hot key during write, it can have a negative effect on tables (i.e. the main table writes can be throttled).
Unfortunately, hot keys are often not found until production data is added to the table and by then, it is usually too late. Good up-front key design is the best way to avoid production issues.
Throughput During Spikes
Nike has a unique problem when it comes to product launches and sneaker drops. This traffic pattern isn’t just a hockey stick; it’s a steep cliff! When we are about to release new products with high consumer interest, traffic can increase across many services by many orders of magnitude — just before launch time. Often the traffic volume is over 500% more than the baseline traffic.
This particular traffic pattern presents interesting implications for infrastructure scaling. Typical deployments of microservices rely on horizontal auto-scaling. Auto-scaling tracks demand and automatically increases or decreases the number of instances of a service. This maintains the available capacity to take requests. When request rates spike within seconds, auto-scaling is not able to react quickly enough to meet the demand.
To counteract this, we rely on pre-scaling of both instances of a service and the service’s DynamoDB throughput. Because we happen to know when a particular product is launching, we can anticipate the wall of traffic and scale up affected tables beforehand. We can even perform this scaling automatically (again, using DynamoDB’s API) without engineers manually intervening.
Even with pre-scaling, however, some tables have been hit with throttling during critical moments. This usually happens when tables haven’t been scaled up enough. Fortunately, DynamoDB will only throttle requests above the provisioned throughput amount. This means that only a small percentage of requests are rejected. Automatic retries by the AWS SDK can also work around momentary throttling.
Throughput-based throttling is a consistent issue that requires careful tracking of capacity to meet up-time objectives.
Wrong Tool for Some Jobs
Although DynamoDB has been a welcomed solution to some of our challenges, we have found that DynamoDB doesn’t fit every workload. DynamoDB is a tried and true, widely-deployed database; still, it doesn’t fit every single use case.
Often we’ve needed to reach for a data store that allows for more complex query patterns. Usually such needs are a perfect fit for Elasticsearch. We typically deploy AWS Elasticsearch service, because of the same benefits on the operational side and the added benefit of complex query support.
For some use cases, a graph database or relational database is more appropriate. Even files in S3 with large amounts of data — Redis for transient data or Kinesis for streaming data — can be more appropriate than DynamoDB.
In short, we’ve been careful to choose the right tool for the job.
DynamoDB has played a crucial role in Nike’s migration to the cloud. AWS takes care of our data storage and query scalability with a simple-to-use API and model. As a service-based data store solution, DynamoDB has also provided consistently high availability, data durability and reasonable performance.
Since adopting DynamoDB, AWS has even managed to add additional features. Some other capabilities we’re exploring:
- We are starting to take a hard look at global tables as a way to provide multi-region capabilities where it makes sense.
- DAX (DynamoDB Accelerator) is also useful in cases with high-throughput reads on low-cardinality data sets.
- TTL is widely deployed to meet our compliance needs and reduce storage costs.
- Encryption at rest is also a nice recent addition to secure consumer data.
- Finally, the new backup and restore capability will significantly reduce the dependence on EMR-based backup solutions.
Overall, DynamoDB has been a great long term solution to our data storage problems. We are spending less time managing database clusters and more time creating new services, adding features to existing services and innovating Nike’s consumer experiences.
Want to join the Nike Digital Team? Check out the available jobs here.