Building Object Store — Storing Images in Cassandra at Walmart Scale
“If necessity is the mother of all invention(and innovation), then constraints (especially time) is definitely the father.” Last year at WalmartLabs, we faced the need to come up with a solution, without the luxury of a vast time frame on which to pivot. We learned a lot along the way and it is that journey which is the backdrop to the story I’m about to tell.
Walmart has been using proprietary technology to store all of its product images and other assets. As Walmart started scaling assortment massively, we were facing severe scaling issues for setting up the product images. Although setting up of product images was slow and somewhat error prone, not to mention that they resided in third party data center, there was always something more important to work on until one fine day we were told that the old system is going to be decommissioned. We had five months to not only build a new system but also migrate 100’s of millions of images from old to new without impacting user experience! In the spirit of keeping this blog concise, I am going to talk about only the most critical things that came up in our journey.
We knew the degree of the specifics we can chalk out now would help us drive the design later and ultimately determine the degree of success. We also choose to view this system as a generic Object Store whose first big use case was storing images. Most critical needs were:
- Capable for storing 100’s of millions of product images & assets
- Highly Available and Extremely Fault Tolerant
- Linear scalability
- Have a 99 percentile response time of not more than 2 digits in milliseconds
All of the above needs are directly tied to the fact that any issue in this system is going to impact customer experience (products with no images) in Walmart.com and at Walmart’s scale loss of revenue in thousands of dollars. Yes, a picture is worth a lot more than a thousand words in the e-commerce world!
What would be extremely fundamental to this platform would be choosing where and how to store these objects/images. The first obvious question was In-house vs Outside Vendor? Given that
- We have been using an outside vendor for the last several years and things didn’t look very good when we started scaling
- We are talking about 100’s of millions of product images for Walmart.com
- Understanding the exact nuances of our use-cases and wanting to build a generic object store at scale who first use-case being images
We decided to go ahead with an in-house solution. Picking the right technology was going to be one of the core decisions that we needed to get right. Listing down the things that would drive this decision
- Deployment strategy of this highly available system would be multi-datacenter, multi-cloud which means whatever we pick would need to support such an architecture seamlessly
- Linear scalability as data grows to hundreds of tera-bytes meant an inherently distributed system
- Fault Tolerance and data redundancy must be supported out of the box to support high availability
Ceph based storage supported by one of the internal platform teams capable of storing peta-bytes of data was available. It looked promising and we were excited! It fulfilled the above needs, well almost. While evaluating we soon realized
- Ceph based storage would not be available until the end of the year in multiple data centers where applications handling site traffic are deployed
- Storage Cluster would be shared with multiple other teams who would be contributing their own traffic meant unpredictable read/write performance which is unacceptable
- Data nodes holding the data in the cluster were not SSD’s which is fine for batch systems but is not preferred for real time mission critical systems
In the spirit of letting numbers decide we did some performance testing which confirmed the above assumptions. We did not get predictable performance and many times 95 percentile reads/writes times of images were in 10’s of seconds for the Ceph based storage.
Cassandra - distributed database known for high availability and linear scalability was something we were already familiar with and used in the past. But isn’t it counter intuitive to store images which are essentially files in a database? Probably but when you are operating at Walmart’s massive scale, most rules of thumb break anyway. And we thought it’s best to test existing beliefs/intuition, what you already know time and again and let the machine decide!
- Cassandra satisfied our requirements completely, we could get bare metal machines with SSD’s and have them in all the data centers we wanted plus Cassandra has cross datacenter replication out of the box
- Having used Cassandra previously, we already had the necessary expertise within the team
- By owning everything end-end, we could now look at tweaking whatever we wanted, however we wanted to support our needs
Letting numbers guide us again, we did some performance testing and the read/write numbers (95th and 99th percentile) were simply outstanding for 90% of our use cases! Our performance tests involved storing millions of images of different sizes worth terabytes of data. The remaining 10% were all large images with 10’s of MB in size and the response times were still acceptable. It was pretty clear — we were going to store images in a database.
Cassandra supports the blob data type and it is intuitive to use it to store objects. Stripping down our schema, in its simplest form looked like this
CREATE TABLE object_store (
PRIMARY KEY (key)
This worked just fine until we tested for objects, more specifically images of varying sizes and a pattern began to emerge. Read/write of high resolution large images got slower when under load. After further investigation and observing results, we learnt that storing/Reading big objects were increasing the GC pressure on Cassandra node as well as the application since no streaming is possible without reading the file completely. CQL does not support streaming (yet).
One intuitive approach would be to
- Split up larger objects into smaller chunks and store them in different nodes of the Cassandra cluster
- Perform async parallel write of object chunks to several nodes at once speeding up the total write time taken
- Perform async parallel read of object chunks and do the aggregation at the application level
- Even better would be to directly stream the chunks in-order using transfer-encoding Chunked supported by Http protocol, thereby eliminating the need for the application to have anything in memory
We performance tested this idea and were happy to note that splitting and storing chunks of large objects resulted in 3X improvement in the 95th and 99th percentile response times for read and write. We also load tested create-update ratio of objects small and big which is more or less 90–10 and its effect on compaction. We did notice some compaction activity and but it’s effect on read/write performance of objects was negligible. Tuning specific Cassandra connection parameters and settings is part of a separate post, so I wont go into those details for now.
Is less more? We have been toying with the idea of writing absolutely minimal code to get things done without compromising readability. But is minimal simple? Simplicity is the hallmark of readability which is directly proportional to ease of maintenance. So along with minimal codebase, we wanted a simple codebase. One way of looking at simplicity in code is
- Viewing everything from a single threaded perspective while actually being fully multi-threaded thereby not having to worry about synchronization/locking
- Expressing error handling in a concise way which can otherwise bloat the code base very quickly
- Writing code declaratively following the thought process of focussing on what needs to happen instead of how
Reactive Paradigm where we viewed Cassandra database as an Observable and reacting to data coming in an asynchronous way was very intriguing. RxJava gave us the constructs to
- Write simple, non-blocking code reacting to stream of events
- Guaranteed to be run by a single thread at any given time
- Not having to worry about synchronization, locks, blocking and other multi-threading issues
- Large number of operators (much like the ones in Functional Programming) with great documentation through marble diagrams
As an example, for one of our use-cases, we were able to express
- Asynchronous, parallel writes/reads
- ‘N’ retries with variable delay on specific exceptions
in less than 15 lines of code! Less code meant less potential for bugs. We micro-bench marked this piece of code and saw an order of magnitude increase in write/read performance against an internal library used previously. On the flip side of things,
- Thinking functionally, writing higher-order functions and reacting to streams of data needs a big leap in thinking (it is well worth the effort to write beautiful succinct code)
- Debugging the sequence of events expressed in code takes a while to get used to
The application was going to be a multi-dc (Network Topology strategy), multi-cloud from a deployment standpoint for extremely high-availability. Redundancy factor was set to 3. Cassandra consistency level was set to EACH_QUORUM for writes and LOCAL_QUORUM for reads. With these settings, we could afford to have one node down per object and still have no impact. This provided a good balance of consistency and availability keeping all our use cases for object store in mind.
Resize on the fly - Images
Walmart.com requires several different sizes of an image for a product. These differently sized product images are used in several places within the site like search, shelves etc. Either
- These resized images can be created when the original image is setup and fetched directly at runtime
- They can be created on the fly at run time when that specific size is requested
As we are going to use CDN’s to cache product images, we decided to go with option 2 as these resized images would be cached after the very first expensive call when the image requested is resized. Further investigations revealed 5 times reduction in data size and the approach seemed well worth it.
Migrating 100’s of millions of images from old system to the new Object Store was going to be a mammoth task and hence we decided to do it in multiple phases in the below order
- Released the Object Store only for any newly created products and any new updates to existing images coming in. After observing for a week and validating that things are working as expected
- Created migration api’s which given an product id would move the images from old system to new system, update meta-data etc
- Migrated millions of products images in batches per week over a course of several weeks
- Lastly created Http 301 permanent redirects from old image url to the new one to handle dangling links which are going to be a lot in the first few months
Some of the fun things we learned include:
- Being generic is good but focusing on the most important use cases will help drive design and attend to things that matter
- When faced with multiple ways of building a system, it’s best to validate these design options by running them through load, getting the numbers and then let the machine decide. Even better would be to micro-bench mark snippets of code during development. It can really help decide between one way versus another
- Rules of thumb most likely will break when operating at scale and hence it’s best to validate ideas and beliefs
- Time is a always an important factor and hence it’s best to start somewhere reasonable, move fast, fail fast, and evolve
I hope you found some things of interest in this post and have enjoyed this blog. Please keep an eye out for more to come!