Better Prioritize LLM Tasks for Higher System Throughput

How to replace the naive “first-come-first-serve” rule

Benjamin Marie
2 min read4 days ago

As demand surges, efficient scheduling of LLM tasks is crucial to ensure high-quality service, minimizing latency for users while maximizing overall system throughput.

Traditional first-come-first-serve (FCFS) scheduling often leads to significant delays, particularly under high load, due to Head-Of-Line (HOL) blocking. Although shortest-job-first (SJF) and shortest-remaining-time-first (SRTF) scheduling algorithms are known to reduce average latency, they are rarely implemented because they require knowledge of request lengths, which are typically assumed to be difficult to predict.

The paper challenges this assumption, arguing that precise knowledge of request lengths isn’t necessary:

Efficient LLM Scheduling by Learning to Rank

instead, just knowing the relative order of request lengths can be sufficient for effective scheduling.

source

To measure how closely a predicted schedule aligns with the ideal SJF/SRTF schedule, the authors propose using Kendall’s Tau, a rank correlation coefficient. They demonstrate that higher similarity to the ideal schedule, as indicated by Kendall’s Tau, generally results in lower latency and improved…

--

--

Benjamin Marie

Ph.D, research scientist in NLP/AI. Medium "Top writer" in AI and Technology. Exclusive articles and all my AI notebooks on https://kaitchup.substack.com/