Retail Services inside a Walmart store like pharmacy services or auto-care services or wireless-activation services are generally multi-step long running workflows and require an associate to help the customer during certain parts of the flow. The systems supporting these services and the technology solutions supporting them have evolved over many years and are highly optimised for local operations within the store. As Walmart is making advances and making right technology investments to move many of these workflows to be cloud powered, one capability which will enable efficient work and labour management in store is an ability to distribute work to remote facilities or other lightly loaded neighbourhood stores having qualified associates with free time at hand.
This paper aims at describing one such implementation of centralized work dispenser which aims at routing remote-qualified steps of workflow to a remote facility or neighbouring stores based on business rules and constraints.
Routing work is a standard computer science problem. Like in almost all routing problems there is a queue of pending tasks and set of workers who can work on them, an algorithm decides which worker works on what task and for how much time. The problem becomes simple when both work and workers are uniform in nature, all work can be added to same queue and any worker can take up any task from queue. The complexity of routing in this scenario lies in the fact that not all work is same and work cannot be routed uniformly to all incoming requestors. For instance, some work cannot be done outside the originating store and this is primarily compliance mandated or business rule driven. Similarly, not all store associates are same, some work needs to be handled by a specialist role and what work they can do is determined by license they hold and business rules. These constraints bring in additional complexity to a standard routing problem and some of these aspects are specifically discussed in this paper.
System Design — Key components
- Work Dispenser — Interface for store and remote associates to submit and get prioritised work.
- Partition Manager (PM)– Responsible for maintaining the prioritised list of work, dispensing the same to store and remote associates. Single instance of partition manager is designed to take responsibility and dispensing work for a subset of stores to allow for in-memory implementation of PM in the future.
- Global View Manager (GVM) — Maintains a near-real-time (NRT) view of work load available across all stores scored by priority. This service is used for selecting a store to dispense work for a remote associate.
- Central Event Bus (CEB) — Service Bus delivering work lifecycle events from work dispenser to PM.
- Scoring Service — Runs at scheduled frequency querying routing snapshots from Routing DB, computing scores (representing prioritised load on a given store) to be used to determine prioritised store needing help during get work. The goal of scoring algorithm used by scoring service is to find Stores that are in need of help and give them a higher score.
- Rules Engine — Interfaces with business to capture the key constraints on routing during enqueue and dequeue and provides a rules execution platform for routing components at runtime.
- Profile service — supplies information about all the stores where an associate is qualified to work.
- When new work arrives from store, work dispenser notifies PM to enqueue the work. PM runs the enqueue rules to determine the role, work execution location and qualification for the work and pushes work to routing db.
- Store associates request for work are directed to partition manager which returns the highest priority store eligible work.
- Remote associates are generally qualified to work on more than one store. GVM first returns the most qualified store using scores generated by SS to identify the PM which would then return the highest priority work for that store post applying required constraints for the requesting associate.
Partition Manager — Design Evolution
Partition Manager is a key component in this entire system as it is responsible for maintaining the queue of work at scale and run the algorithm to find the best suited work for an associate during get-work call at high concurrency and low latency.
These are the key design attributes which drive the technology choices for routing engine.
- Latency — is the measure of speed at which routing decision is taken concurrently.
- Scale — is the measure of number of concurrent requests that a given partition can handle.
- Duplication — is the measure of same work done by multiple associates and hence wasted.
The priority of work is determined by its estimated completion time, so a simple solution would have been to put all the work in a single sorted set. This would have hurt concurrent access as every request would have locked the set which it-turn would have increased the latency. To enable concurrent access, work is divided into slots , slots were ordered by priority and work in each slot is considered of same priority. A work is slotted based on its estimated completion time into an existing slot or a new one is created if no slot is available. A concurrent data structure like Java’s ConcurrentHashMap allowed multiple threads to browse a slot and a lock at each work item avoided race conditions.
Complete data structure is only locked to add or remove slots, as a further optimisation slots were pre-created for a day and only deleted empty slots at end of day.
Two instances of this data structure were required for each store, one instance was for store bound work and other for remotable work.
- Building stateful service in a containerised fashion is hard as it is more targeted building stateless micro-services. Choice left were either to have a single instance cluster or build in-efficient solution to keep multiple instances on partition manager in sync.
- Complexity increases manifolds when bootstrap of state came into picture during service restart. Bootstrapping took a long time which meant extended downtime for partition manager. Also, containers can restart frequently and without notice, hence the availability of partition manager was severely affected.
- Central Event Bus was used to sync state between instances of partition manager and at high loads it was noticed that bulk of the time spent by an instance is to process state-sync messages from other instances.
Centralised Persistent Store
Building stateful services was marred with complexity led to making partition manager stateless with state stored in a central persistence store. While central persistence store provides an easy to way to build and query a sorted set of work records, latency at high concurrency with minimal locking was still a concern going down this approach. All get work requests are contending for highest priority work leading to row-contention and has potential to increased latency. SQL Server provided query hints as to way to workaround this problem using ROWLOCK and READPAST hints. The query hint ROWLOCK specifies a row level lock should be taken in place of more limiting table or page level lock. Query hint READPAST specifies that if multiple requests are trying to select same row then only one request gets a lock and remaining request move past and select next row. The combination of these two hints provided a try-lock kind of semantic similar to in-memory data structure.
For performance comparison both the storage modes were tested in a three node cluster at multiple throughputs. At elementary load the performance of database and in-memory storage was comparable, in-memory was marginally fast but at high load there was no comparison SQL server was miles ahead of in-memory storage. The reason for bad performance for in-memory storage was heavy background sync work done by instances of cluster, which was a side effect of maintaining stateful system.
- Score optimisation — Per store score plays a key role in avoiding starvation and making sure right store is getting attention. For first version a simple scoring algorithm was chosen which only takes into consideration current workload in a given store. In future, aspects like logged in users in a store, their throughput can be taken into consideration to identify stores which need help and not just based on current workload in a store. The idea is to learn from each rollout and eventually make the scoring algorithm ML based.
- Improve monitoring and alerting — There could be cases where work which needs to be routed is still sitting in the system. To account for it, a system which looks at all incoming work and alerts routing engine if the work is still seen incomplete after an average work route time has exceeded. This becomes a hint for routing engine to route the work at a highest priority to the next available associate.
- Newer Business Needs — Accommodating for varying constraints around how work should prioritized which will require making changes to design when they cannot be fit into business rule changes. One such example is in current design work is prioritized at the time of ingress into routing engine which might require to change when there are requirements to move to model of dynamic priority which changes with time and not a static one which is determined at the time of ingestion.