Better Prioritize LLM Tasks for Higher System Throughput
How to replace the naive “first-come-first-serve” rule
As demand surges, efficient scheduling of LLM tasks is crucial to ensure high-quality service, minimizing latency for users while maximizing overall system throughput.
Traditional first-come-first-serve (FCFS) scheduling often leads to significant delays, particularly under high load, due to Head-Of-Line (HOL) blocking. Although shortest-job-first (SJF) and shortest-remaining-time-first (SRTF) scheduling algorithms are known to reduce average latency, they are rarely implemented because they require knowledge of request lengths, which are typically assumed to be difficult to predict.
The paper challenges this assumption, arguing that precise knowledge of request lengths isn’t necessary:
Efficient LLM Scheduling by Learning to Rank
instead, just knowing the relative order of request lengths can be sufficient for effective scheduling.
To measure how closely a predicted schedule aligns with the ideal SJF/SRTF schedule, the authors propose using Kendall’s Tau, a rank correlation coefficient. They demonstrate that higher similarity to the ideal schedule, as indicated by Kendall’s Tau, generally results in lower latency and improved…