VisSched: An Auction based Scheduler for Vision Workloads on Heterogeneous Processors
In this blog, I will be discussing my recent research work on scheduling vision workloads on heterogeneous processors by exploiting the phase behavior in the execution of these workloads. Moreover, this scheduling scheme uses auction theory with modified payoffs and bids such that the algorithm is theoretically starvation-free. This work has been presented as a poster at DAC 2020, and as a full paper at ESWEEK CASES 2020. It is published in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD).
Due to an upsurge in the number of IoT, and mobile devices that support computer vision and image processing applications, there has been an emergence of new benchmark suites. The peculiar characteristic of these benchmark suites is that they share a lot of common code kernels. Due to the limited compute capability of the IoT devices, these applications are often offloaded to the cloud servers.
Now, the conventional way to accelerate these applications is to use FPGAs or GPUs. However, these devices do not support multi-application concurrency very efficiently. In fact, there are multiple such predictors  that predict the performance of multi-application concurrency on GPUs with the help of CPU statistics before actually deploying such applications on a GPU. Moreover, we target the applications that work well on multicore CPUs and hence we develop specialized scheduling algorithms for them on the asymmetric multicores.
Figure 1 shows the general scheduling setup where N threads compete amongst themselves to gain a scheduling quantum on K resources. This is a complicated problem, however, it fits rightly in the auction theoretic framework. Our overall approach is shown in Figure 2.
Nash Equilibrium is an important concept in Game Theory. The setting assumes multiple players with strategies and payoffs/utilities. In our case, the threads will be the players and they will have a virtual wallet that will be used in the bidding process. The highest bidder will get to execute on the core.
Nash Equilibrium is basically a stable operating point of the strategies for the players. No player can unilaterally change its strategy without getting worse off.
Let us consider the standard Bob and Alice problem, where both of them are given some pocket money by their moms to buy an ice-cream. Both the kids want to get the ice-cream as well as save some money for buying a toy. Figure 3 explains the setup and the scenarios that can occur.
Based on the scenarios in Figure 3, the bid will be some fraction of the wallet balance (W).
Bid = μ * W
The utility or payoff is defined as the satisfaction or advantage that a player gets by consuming a resource, good, or service.
Utility = (1-μ) * W; for winner, 0; for loser
Let us complicate the scenario such that the ice-cream vendor is able to differentiate between the kids who are really needy and those who do not value the ice-cream. We introduce an auctioneer’s fee (F) to quantify the same. The equations then become:
Bid = μ * (W-F)
Utility = (1-μ) * (W-F) ;for winner, 0; for loser
Let us now extend the formulation to the scheduling problem. Unlike Bob and Alice's one-time ice-cream purchasing competition, a scheduling problem involves multiple such competitions among the threads at every scheduling quantum. In such cases, we need to incentivize the players to bid in the current round and not save their money for the future auction rounds. Thus, we introduce a winner’s bonus to tempt the players to bid, win, and get the bonus added to their wallet. Thus the equation for utility now becomes:
Utility = (1-μ) * (W-F) + B; for winner, 0; for loser
Another decision that we need to make is regarding the policy to update the wallet balance after every auction round. For the winner, it can be equal to its utility, however, for the loser, we cannot let it retain its old wallet balance. This is because the loser will then continue losing forever and there will be starvation, which is an undesirable phenomenon in a scheduling problem. Thus, we introduce a loser’s subsidy (L) to increment the loser’s wallet balance by a small amount after every auction round. The new wallet balance then becomes:
Wallet (W_new) = (1-μ) * (W-F) + B; for winner,
Wallet (W_new) =W_old + L; for loser
We use these results from game theory to calculate the Nash Equilibrium:
If the distributions of W-F and B are the same for all the N players, then ∀ i ∈ N, μ_i = (N-1)/N, else we find the equilibrium strategies using Monte Carlo simulations accelerated by AI techniques.
3. Experimental Setup and Workloads
Figure 4 shows the experimental setup of the multicore architecture used in this work. The right side of Figure 4 shows the different kinds of benchmarks suites that were used in the experimentation. We used Tejas architectural simulator for our experimentation.
4. Characterization of Workloads
We perform a characterization of the workloads to gain further insights into their execution. We divide the execution of the workloads into smaller intervals, where each interval is identified by its instruction mix (percentage of different types of instructions such as SSE, ALU, MEM). The instruction mix forms the feature vector of the interval and these intervals are then clustered using K-means clustering. We observe five tightly coupled unique clusters as shown in Figure 5. Each cluster is named on the basis of the dominant instructions in the intervals present in these clusters. Thus, the entire execution of the workloads can be thought of as consisting of phases (intervals with certain dominating instructions). The same phases across the workloads are positively correlated.
The left side of Figure 6 shows the modified scheduling problem. We use the phase behavior obtained from the characterization of the workloads to simplify our scheduling decisions. The right side of Figure 6 shows the flow chart of our scheduling scheme, VisSched. It is self-explanatory.
5. Our Approach
Figure 7 shows the overall design of the hardware scheduler. We use wallet queue to store the current wallet balance of all the threads, phase counters to predict the phase at runtime, a phase table to find the appropriate waiting queue to enqueue a thread, ready queue for each core set, and miss rates for all the threads to calculate the cache costs. We also maintain the PC and the register states in the hardware structures to reduce the overheads of context switching.
5.1 Auction Process
We will skip the details of the phase predictor and the phase-to-core mapping. For the auctioning process, we need to plug in our definitions of the bonus (B), auctioneer’s fee (F), and loser’s subsidy (L). Figure 8 shows our definitions of these terms. Here λ, γ, and κ are the constants (determined empirically). W_base is the base wallet balance and C is the migration cost. ICacheCost is the moving average of the ICache miss rate over the last five intervals. Similarly, we can define the cache costs for other cache types.
5.2 Pattern Table
Figure 9 shows a row of the pattern table. We store the bidding strategies for a combination of the competing bidders in a row of the pattern table. This is done to prevent running Monte-Carlo simulations multiple times and reuse the results.
Figure 10 shows the step by step flow chart to fill the pattern table. The PT generation algorithm here refers to finding the Nash Equilibrium. It is done in the same way as explained above in the background section.
We obtain a 17% performance improvement and 14% ED² improvement over the nearest state-of-the-art technique (as shown in Figure 11) for a bag of tasks. Each bag contains multiple different workloads. This is primarily due to our phase behavior and the novel auction theoretic formulation.
Figure 12 shows that we obtain better fairness using our scheme as compared to the state-of-the-art schemes. Fairness quantifies the relative slowdown of the applications in the bag. The idea is that the applications with minimum and maximum slowdowns should not be far apart. This is mainly due to our augmentations to the auction theoretic framework: bonus, auctioneer’s fee, and loser’s subsidy.
Figure 13 shows the search space of the bidding strategies in the case of 2 bidders and 4 bidders. We observe that the search space is nearly convex with slight concave bumps. We observe that the non-convexity ratio is always ≤5.6%. Hence using a greedy approach such as hill-climbing, we can reach a local optimum. Our aim is to reach a non-trivial solution, N1 as opposed to the trivial solution, N2.
In this work, we propose an auction theoretic scheduling scheme that provides a high degree of fairness and is starvation free. Moreover, we identify the unique phase behavior of the vision workloads. Such behavior is absent from the popular CPU workloads such as Spec and Parsec. We also provide theoretical guarantees regarding the quality of the schedule.
Thank you for reading. That’s all folks! Hope you liked the article :)
For any questions, feel free to read the full paper or direct your questions to my email id: diksha [dot] moolchandani [at] cse [dot] iitd [dot] ac [dot] in.