Tech Transformation: Real-time Messaging at Walmart

Ning.Zhang
Walmart Global Tech Blog
7 min readNov 1, 2016
Photo Credit: Apache Kafka

Many top organizations have been reported to benefit from Service Oriented Architecture (SOA) and Walmart also re-built its eCommerce website (walmart.com) based on the SOA and elastic cloud. An important subset of SOA is the Message-driven architecture, which serves as a channel for asynchronous communication to decouple the bundled components. The result is a more scalable and efficient architecture where each component or service could be independently crafted and scaled out by communicating with others through the messaging platform.

Traditional message queue used to be the solution of Message-driven architecture, but it has become kind of inherently flawed, when it comes to handle the large scale.

Photo Credit: Apache Kafka

When we look at the space of high-throughput messaging solution today, Apache Kafka has been proven a top choice because it is a highly scalable, durable and distributed messaging system that a single Kafka message broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. Kafka has been highly adopted by many Silicon Valley technology properties like Twitter, Netflix, LinkedIn, Uber, Airbnb, Pinterest, and big IT shops out of the valley, like Cerner, Goldman Sachs. Here shows the list of notable Kafka users.

WalmartLabs, the technology division of Walmart, has been leveraging Kafka for many different use cases since 2 years ago. In this blog, we would like to introduce the Kafka use cases in early days, the transition of Kafka deployment model, and the lessons learned from this transition.

Early Days

Kafka was adopted by Walmart in early 2014 after many months of surveying and even contemplating writing our own message system. The primary use case was our tracking pixel feed for which we were using a custom fork of Apache Flume combined with a simple shared TCP stream service designed to feed downstream real-time applications.

Another use case was our application log collection service, which was also using Apache Flume. In both cases we were unsatisfied with the performance, flexibility, buffering capacity, and resilience.

Back then, Kafka had little sense of delivery guarantee and most of the usage of Kafka was just because of its high throughput and low latency which were good enough for them, even though there could be tiny amount of message loss in rare cases.

Shared Kafka on Bare-metal

We started with a few independent Kafka clusters deployed on the bare-metal server box and quickly expanded to 7 geographical data centers across the states.

Out of the seven data centers, five of them are “local” Kafka clusters which only contain the local messages. There are two “aggregation” clusters, each in a different DC, for redundancy. We used Kafka MirrorMaker to aggregate messages from local to aggregate cluster.

Shared Kafka Deployment on Bare-metal with Redundancy

Some applications are only concerned with what is going on in a single Kafka cluster, so they only consume from the local cluster. Many other applications need a full view of what is going on in all data centers, so they consume from the aggregate cluster.

All Kafka clusters were maintained by a group of operational folks and the application teams who wanted to use the Kafka service (a.k.a. tenants) could independently create their topics, plug in their producers and consumers, then start pumping their messages through a Kafka cluster, therefore this model is called “shared” Kafka.

In the first year of running Kafka, the centralized managed and shared Kafka model was quite effective to both Kafka tenants and operation team, because:

  1. Kafka tenants did not have to worry about the hardware capacity and Kafka routine operations, they could solely focus on their own application developments.
  2. The number of Kafka tenants were not that big and each tenant sent controlled volume of messages, so tenants could still enjoy good enough throughput, even though sharing the hardware capacity with others.
  3. Tenants expected Kafka no more than a high-throughput messaging system. In rare cases, Kafka experienced transient or tiny amount of message loss, most applications could tolerate due to the non-critical business use cases or handle the bad cases in their own process.

However soon after, as more tenants migrated their applications and depended on Kafka, the shared Kafka model started to hit more and more bumps:

  1. Compete for capacity: If a tenant produced a traffic spike to a shared Kafka cluster, the rest of Kafka tenants would be impacted immediately.
  2. Lack of Authentication: Any tenant could produce messages to other tenant’s Kafka topic, or even worse, change or delete it by mistake.
  3. “Buggy” clients: A software bug in Kafka client implementation by the tenants could exhaust Kafka resources and block other teams to connect to the Kafka.
  4. No “one-size-fit-all”: Kafka tenants started to set different expectations from configuration, cluster availability to reliability of message delivery, as Kafka itself became a mature production-ready project over the time.

Unavoidably, the shared Kafka model became inefficient as more tenants were on-boarding and planned to use it for critical business. From then we started to look for a more productive Kafka deployment and operation model and OneOps came into our vision.

Self-Serving Kafka Powered by Cloud and OneOps

OneOps is a multi-cloud and application lifecycle management platform that is open-sourced by WalmartLabs. In the upstream, OneOps could orchestrate application patterns/packs to be configured, operated and managed throughout its lifecycle. In the downstream, OneOps could deploy any application pattern to the major public cloud and private cloud providers, such as OpenStack.

Walmart has been adopting OpenStack as the computing backbone for its eCommerce business, and the bridge to connect OpenStack and applications is OneOps. Inspired and motivated by the OneOps offerings, we started Kafka transition from shared bare-metal deployment to the cloud-oriented, decentralized, and self-serving deployment.

OneOps fosters a true DevOps culture and part of it means the teams who spin up a service through OneOps is supposed to be responsible for its entire lifecycle events, e.g. add capacity, fine-tune configurations, upgrade/patch, and bounce the services when needed. This new OneOps (self-serving) model allows Kafka users to customize and deploy their own Kafka clusters through OneOps, by using a production-driven Kafka “OneOps pack”. The business value of the self-serving model could be summarized as:

It is more business-oriented: each team deploys and owns its infrastructures and services to meet business goals and expectations, rather than the shared clusters and services with weak promise.

By using the Kafka pack, it not only deploys and configures a Kafka cluster automatically, but also a GUI for Kafka. The GUI hosts the operational functions and performance monitoring, providing a “one-stop” experience to cover most Kafka daily operational tasks and comprehensive Kafka-level and server-level performance metrics. In addition, the Kafka pack could alert the users through OneOps when things start to break, such as, Kafka process down, under-replicated partitions, leader loss, low disk space, very high CPU utilization.

By adopting the self-serving Kafka model, the application teams now have full control of the operation, capacity, and configuration tune-up, leading to the guaranteed performance on dedicated resources. The monitoring and alerting greatly enhance the visibility and production-readiness of the messaging service powered by Kafka.

Lessons Learned from Transition

The self-serving model facilitates the application teams to own and operate the Kafka clusters quickly and easily. Meanwhile it exposed new challenges to both the Kafka cluster owners and Kafka pack developers:

Kafka cluster owner: learn and adapt to the new self-serving model and DevOps culture. When starting to own a Kafka cluster from day one, they were responsible for operating and monitoring it, especially learn how to handle the common failures quickly and recover with minimum impact.

Kafka pattern/pack developer: provide a robust, highly available and scalable Kafka deployment strategy. For example, the Kafka pack took the advantages of OneOps “auto-pilot” feature, where it could auto-repair the bad node, auto-replace it when auto-repair failed, and auto-scale the Kafka cluster when the load hits the pre-defined threshold.

The success of running the self-serving Kafka clusters depends on the close communication and collaboration between owners and developers. The developers are not only responsible for developing and supporting the pack, but also be the internal Kafka experts where the owners could consult any Kafka technical question, ranging from best practices, deployment strategy, configuration tune-up or even the build-up of Kafka ecosystem.

On the other way around, the developers also take the valuable feedback from the owners’ daily operations and generalize them to improve the Kafka pack in the next iteration. This will form a positive loop to contently accumulate the best practices and give them back to the existing and new owners.

Summary

This is not an end of our Kafka story: more technical details could be shared on the deployment strategy, internal Kafka fork, monitoring pipeline and High Availability across data centers.

Stay tuned for the deep-dives about Kafka!

Twitter: @ningZhang6

--

--