by Ken Smith
SIM is Nike’s in-house store inventory management solution. It is the
source of truth for stock on hand in Nike stores —among its other
responsibilities. SIM calculates stock on hand based on inventory
movement events in the stores. These events include product receiving,
transferring, sales, returns, etc. Recently, our team was given better
visibility into the costs of our services by tagging our resources and
leveraging our in-house cost tools. We observed many
higher-than-expected costs, but the most expensive Amazon Web Services
(AWS) resources were DynamoDB and they made up a very small percentage
of our total resources. Those DynamoDB resources accounted for far over
half of our total AWS cost. For this post, we’ll focus on one of the
SIM tables; however, these issues exist with many of the tables in SIM.
The majority of the services that use these tables were stood up
following a cookie-cutter pattern adopted by the team using examples
influenced by other practices across Nike Digital Engineering.
Nike has two store inventory activities that cause significant traffic
spikes —they’ll be referred to in this post as Physical Inventory
and Cutover. Physical Inventory is the process of counting all products
in a store and adjusting the stock on hand to align with the verified
count. This is done to account for drift that happens naturally in the
stores due to theft, receiving shipments with inaccurate content lists,
etc. Cutover is the process of loading all stock data from the legacy
system of record into SIM. The loads generated in the SIM system by
these activities aren’t massive in comparison to what some other
systems take but are enough to cause issues if not handled properly.
The four main issues with our DynamoDB configurations:
- Write capacity units (WCU) were set too high in anticipation of spikes
- Read capacity units (RCU) were set too high in anticipation of spikes
- Heavy use of Global Secondary Indexes (GSIs)
- Poorly composed partition keys
supply-stock-events table belongs to the service in our system that
persists all of the events that impact inventory in our stores. It is
part of a Command Query Responsibility Segregation style micro-service
that is responsible for the stock on hand in our stores. This was our
most expensive table and seemed like the most impactful place to begin
When looking at the traffic patterns of our supply-stock-events table in
the us-east-1 region of AWS for efficiency opportunities, the most
obvious issue was over-provisioned WCU on the table.
The over-provisioning of the WCU was in place for a few reasons:
- During the Cutover process, moving from the old inventory system to SIM causes a spike of write load to this table.
- During the Physical Inventory process, the data causes a spike of write load to this table.
SIM is a global platform, so the Cutover and Physical Inventory
processes happen across many time zones for many stores in a short
window of time. Cutover and Physical Inventory processes are planned
activities, and the tables have been scaled manually in the past.
However, we have had cases where the activities were not communicated
clearly to our team, so scaling wasn’t coordinated, and problems in
An obvious solution to this problem might be to leverage DynamoDB
autoscaling, but the team has been reluctant to do this as analysis by
other teams showing long scale-up times are prohibitive. We tested the
DynamoDB autoscaling in our test environment under loads similar to what
we experience during activities that cause spikes, like the Cutover and
Physical Inventory processes. The burst capacity we consumed and the
time to scale up during these activities was good enough to make
autoscaling a great solution for the variable loads this table needs to
Autoscaling is probably not a solution for all cases, but it worked
quite nicely for this case. The minimum capacities needed to support
production were upped a bit from what was found to be acceptable in the
test environment, but it wasn’t by much compared to what the capacity
was set to before moving to autoscaling. Using autoscaling, this table
now runs in production with an average provisioned WCU well under 10
percent of what was provisioned in production before.
Another factor considered for optimizing this table was the number of
GSIs. The provisioned throughput for the four GSIs (Global Secondary
Index) on this table were configured individually, which was quite
expensive. Fortunately, autoscaling was applied to the GSIs as well.
The over-provisioning on the RCU was in place for a few reasons:
- Cutover process as discussed above
- Physical Inventory process as discussed above
- Data Pipeline backups
Autoscaling clearly solves two out of the three problems and was put in
place along with the write autoscaling configuration. The complexity
here lies in the last reason: Data Pipeline backups for disaster
recovery. This is a job scheduled to run hourly that pulls all of the
table’s data and puts it in S3. This poses a problem because it
requires a lot of RCU in order for the job to complete before it kicks
off again. There is also no TTL on the data in this table, so it is
currently growing without bound. This means the job is going to take
longer and require more capacity as the table grows.
Why not just use native DynamoDB backup and turn those jobs off? To add
a little complexity to the Data Pipeline job, a few of our data
aggregation jobs for reporting run on AWS EMR and require that the data these Data Pipeline jobs are handling landing in S3. To address the disaster recovery concern of the Data Pipeline job, we leveraged the On-Demand Backup service running in our AWS account, that only requires a couple of tags on the table: one to specify the cadence of backup and another to
specify retention time of the backups. To address the EMR job concern,
the schedule for the EMR jobs were identified, which ended up being once
per day, and the Data Pipeline job will be adjusted to meet that need
soon. This will reduce the Data Pipeline execution count per day from 24
to 1. The Data Pipeline spikes RCU on this table to around 8,500. The
average RCU consumed, excluding Data Pipeline capacity, is less than 25.
Reducing these jobs by 24x will make quite a difference. Another way we
can create efficiency is by eliminating the Data Pipeline jobs
completely and switching to ETL jobs that access the data from the
tables directly. But this is a larger investment of time and a more
Global Secondary Indexes (GSIs)
To get even more optimized, we can look at things like migrating our
tables that have a lot of actively used GSIs to a relational datastore.
AWS Aurora is an option as a relational database replacement for our
DynamoDB that supports automatic backups and drops cost even more based
on some back-of-the-napkin math. However, the migration option will take
careful consideration, since migrating and refactoring data sources is a
fairly large and complex effort.
Partition keys on some of our tables have caused some provisioning
challenges that impact our cost as well. As mentioned above, we have
“GSI bloat” on our tables. Some of those GSIs could have been avoided
by not using Universal Unique Identifiers (UUIDs) (e.g.
02f44553–37fe-3070-bc04–59d65433968a) as partition keys. Granted, UUIDs
theoretically give a good distribution over DynamoDB partitions, but
they also make it difficult to query tables as efficiently as when there
are more meaningful key values to query.
On the flip side of the UUID key strategy, we have tables that use
legacy store identifiers for the partition key value, which causes a
different issue altogether: partition hotspots. Hotspots are caused when
the values used to partition the data land the data in a small subset of
the partitions of a table far more than the other partitions. RCUs and
WCUs are distributed evenly across the number of partitions supporting
the table. If the data being written is bunched into a single partition,
capacity on the table can be increased drastically without seeing
significant gain in the provisioned capacity on the partition that is
causing the throttling. In order to prevent throttling, the table needs
to be provisioned high enough that the instances supporting each
partition can support the hotspot. This can be enormously expensive and
wasteful. An example of a table doing this in SIM is the
supply-stock-journal (called “supply-stock-ledger”) which is an
aggregate view of the store stock events.
Neither of these partition key strategies are recommended, and the
outcomes of using them are well-documented in AWS blogs. To solve these
issues, we’d have to redesign the partition keys on the tables, which
would be a significant and complex refactor. As called out above in the
GSI section, putting this kind of effort into migrating to a relational
datastore would prove to be a better investment.
Initial cost analysis on the supply-stock-events table was pretty high
when we started and with the proper autoscaling in place we were able to
drop DynamoDB costs by 85%. Once the Data Pipeline jobs are
only running once per day, the cost is projected to drop to nearly 5% of
the original cost.
To recap what we learned in our efforts to optimize SIM tables:
- Over-provisioning tables to support heavy traffic spikes can be difficult to coordinate and requires manual processes to support.
- A lot of GSIs on tables to support many access patterns is not optimal.
- Poor partition key construction causes hotspots and over-provisioning to support load on those hotspots.
- It is not optimal to use a batch job strategy to copy data and use the single source for both disaster recovery backups and as data sources for EMR jobs.
The type of datastore is the major consideration I think would have
prevented a lot of the over-spending seen here. Looking at the load
requirements, the access patterns, and data recovery needs, relational
databases would have been a better fit for a lot of the SIM use cases at
the time. When DynamoDB was chosen for SIM, the AWS RDS product
supported configurable snapshot backups. The backups would have resolved
the data recovery concern and being a relation database it would have
addressed the key composition and GSI issues as well. The Autoscaling
and On-Demand Backups were not available when the decision to use
DynamoDB was made, but they have proven to be valuable in reducing the
impacts we’ve seen here. If there was an architectural requirement to
use DynamoDB when it was chosen for SIM, I think a less read-intensive
backup solution would have been something to consider as well as better
composition of partitions keys.
The last point to cover is the cookie cutter aspect of our services and
how using the rinse-and-repeat model replicated these issues throughout
the SIM system. To prevent spreading impact throughout a system like
this, it is important that teams reusing patterns know why they are
using the patterns and what problems the patterns are solving.
SIM is still being feature developed in production and has been a great
success for Nike. As we move forward in building features and
maintaining our system we have taken careful thought into what
technologies we choose to implement solutions for our consumers and
refactor our older services. The learning covered here have helped us
understand the value in upfront exploratory investment.
Interested in joining the Nike Team? Check out career opportunities here.