Hey Jeff, great post.
Ben Stahl

Hey Ben,

Thanks for the comment. The previous system did not implement jitter. In the post I somewhat simplified the overload case. In some write patterns we were seeing many calls all occurring clustered together at the same time due to retries, and in these cases jitter would have made some difference. However, most of the time the issue was quite literally one of bandwidth and throughput, where there were simply too many, slow, calls occurring at once for a single Redis instance powering a shard to handle, even with those calls more-or-less evenly spaced out. I did think about perhaps introducing jitter into places where the system knew it was about to introduce a large number of simultaneous calls, however those were pretty infrequent places in our system, with the majority of calls occurring at totally independent locations in our system.

Hope that helps! Happy to elaborate further if you’re interested.