How Flipkart reduced their data size by 96%

Rahul
3 min readNov 10, 2022

Learn how Flipkart decreased their data needs drastically without any change in performance.

Here’s the original article from Flipkart. Article is about Flipkart’s Ratings & Ratings, it’s cloud and data architecture.

Here we will not be discussing their architecture, but only about how they have reduced their data needs drastically.

Currently, they are storing the ids of ratings in Redis and materialised views/combined data in Aerospike, for faster access.

Flipkart’s data reduction steps
Flipkart’s data reduction steps

Data Cleanup

  • Removing unused fields from the data. Store whatever you are using at that point in time, and remove unnecessary data.
  • Review your data often.
  • Reduction in data size by 10%

Data Serialisation

Encoding

  • Instead of storing data as JSON, store in binary formats via Avro, ProtoBuf, and Thrift
  • Since JSON is stored as strings, it takes up about 40% more than binary formats.

Data types

  • Use ENUM types rather than strings.
  • For example, instead of storing values as Goa/Delhi/Karnataka/Andaman and Nicobar, which takes up latest 3/5/9/19 bytes, store as integers, which always takes 1 byte and use as enums in code.
  • Use Integers wherever possible.
  • Specify the precision for decimal values, thereby reducing the storage needed to store floats.
  • Optimise the way sequential IDs are stored.
  • For example, skylla-2018–12–25-<uuid>. Prefix and date are 17 characters long, which thus takes up 17 bytes.
  • To reduce the size, convert skylla to an ENUM (which can also be extended in future) and also convert the date to a number (days since 1st Jan 2000)
  • This would just need 3 bytes (1 byte for ENUM & 2 bytes for int date) as compared to 17 bytes if they are stored as strings.
  • Reduction in data size by 70%

Data Compression

  • Compress and store data.
  • Use right compression algorithms to optimise it.
  • They used Shoco for textutal data, Snappy for binary data compression.
  • Reduction in data size by 35%

UUIDs

  • UUIDs (5f011f46-c502–4f41-a17f-44c01faaa46f) takes up 36 bytes.
  • So, instead of using UUIDs, come up with your own unique ID logic, if possible.
  • Example from them. Let’s assume there are two products Px & Py. The first review written for Px will have Review Number 1. The first review written for Py will also have Review Number 1. A combination of the ProductId & Review Number will uniquely identify a review.
  • Now, instead of entire UUID of 36 bytes, store just 2 integers of just 2 bytes. That’s a reduction by at least 20x.
  • Reduction in data size by 94%

Reducing duplicate data

  • Remove any duplicate data that is stored multiple times.

Compress Further

  • Since these are just an array of IDs/integers, can further textual compress them using FastPForo to further reduce the size.

In Conclusion…

Using various techniques like data cleanup, compression, encoding, serialisation, Flipkart is able to reduce their data storage needs by 96%

Let me know if you have a more techniques in mind.

That’s it! Please let me know about your views and comment below for any clarifications or ask them directly on their article 😄

If you found value in reading this, please consider sharing it with your friends and also on social media 🙏

Also, to be notified about my upcoming articles, subscribe to my newsletter below (I’ll not spam you 😂)

You can find me on Twitter and LinkedIn ✌️

--

--