HG's NOTES on lottery and disk ============================== LOTTERY AND DISK Isolation with Flexibility: A Resource Management Framework for Central Servers Filetype http://www.eecs.harvard.edu/~syrah/papers/usenix-00-resmgmt/usenix-00-resmgmt.pdf Disk Scheduling with Quality of Service Guarantees http://www.bell-labs.com/project/eclipse/disk.ps - A Web site's working set of frequently accessed documents expands, the site may require an increasing share of the server's disk bandwidth in order to offer reasonable responsiveness - Ticket exhanges: CPU-intensive application could exhange some of its disk tickets for some of the CPU tickets of an I/O-intensive application - Supports proportional sharing of disk bandwidth using the hierarchical YFQ algorithm [Bru99b] - YFQ approximates ideal proportional sharing by maintaining a virtual time function and per-disk queues of outstanding disk requests for each resource principal. - Each of the queues has an associated finish tag that reflects: * the principal's past disk activity * its current share of the disk * the length of the request at the head of the queue - Request from queues with the smallest finish tags are forwarded to the device driver in small batches that can be reordered by the driver or device to achieve better aggregate throughput - YFQ fairness: * may incur more cost of excessive disk latency and seek overheads, harming aggregate throughput * can solved with batching - A principal's disk ticket are active whenever it has an outstanding request. - To adjust to dynamic changes in the number of active tickets, the base value of principal's disk tickets is recomputed (using cached values if possible) whenever a request reaches the head of its queue, and this value is used to compute the queue's finish tag - Conclusion * the idea of lottery can be expanded to be cross-resources, that is you can exhanges disk-ticket with CPU-ticket, or memory-ticket LOTTERY AND DISK 2 From Lottery and Stride Scheduling (thesis) Section 6.3 Disk Scheduling - Conventional disk scheduling are designed to maxmized disk bandwidth utilization without any consideration for the relative importance of individual requests e.g. Shortest-seek-first, shortest-access-time-first - Hence, some requests are treated very unfairly in terms of response time, worse, can lead to starvation - SCAN solves a bit, but still exhibit high response times for requests to cylinders near the extreme ends of the disk - - Proportional-share control is provided for filesystem buffer cache space LOTTERY AND PHYSICAL MEMORY - Difficulty: determining which processes are actively competing for memory - Problem: * If let's say a process is blocked, hence its memory tickets will not be active, then it will lose a large number of pages * Alternatively, if we let all memory-tickets to be active even though the process is idle, is also bad. Because then the pages belonging to idle processes tend to remain in memory indifinetly - Hence, give memory guarantees only to priviledge processes that explicitly request them, other processes compete for the unreserved portion of memory not wired by the kernel - Algorithm: * The pageout daemons is altered. Pages owned by processes that have not exceeded their memory guarantee are not reclaimed. * The pageout daemon needs to compute the current base value of the processes' memory tickets all the time - He develop FDC (funded delay cost) algorithm that roughly approximates proportional-share scheduling for small ratios. - FDC compues the merit of each request as FD/C * F is the funding associated with the request * D is the delay since it was issued * C is the actual cost of servicing the request - FDC processes the request with the maximum merit - FDC sacrifices aggregate performance in order to give preferential treatment to requests with high ticket allocations - Results * A high-ticket requests are issued by important, time-critical apps will be accelerated by FDC * Low-ticket apps that issue a large number of disk requests are prevented from monopolizing the disk REMZI's NOTE ============ # 1. Basic Concepts: Tickets represent your share - Tickets: represents share of a resource that a process would receive - The percent of tickets that a process has represents its share of the system resource in question - E.g.: 2 process, A has 75 tickets (from 0 to 74), B has 25 tickets (75 to 99) + achieves probabilistically + every time slice, hold a lottery: draw a winning ticket, and load the process who has the winning ticket # 2. Ticket Mechanism - Ticket currency: + allows user with a set of tickets to allocate tickets among their own jobs in whatever currency they would like + system then automatically converts said currency into correct global value. + Hence, currency defines a resource management abstraction barrier + Example: assume users A and B have each been given 100 tickets. User A is running two jobs, A1 and A2, and gives them each 500 tickets (out of 1000 total) in User A's own currency. User B is running only 1 job and gives it 10 tickets (out of 10 total). User A -> 500 (A's currency) to A1 -> 50 (global currency) -> 500 (A's currency) to A2 -> 50 (global currency) User B -> 10 (B's currency) to B1 -> 100 (global currency) - Ticket Transfer: + a process can temporarily hand off its tickets to another process + useful in a client/server setting: ~ a client process sends a message to a server asking it to do some work on the client's behalf ~ To speed up the work, the client can pass the tickets to the server ~ When finished, the server then transfers the tickets back to the client - Ticket inflation: + a process can temporarily raise or lower the number of tickets it owns + one greedy process could give itself a vast number of tickets, and take over the machine + Hence, applied in a trust environment: ~ if any one process knows it needs more CPU time, it can boost its ticket value as a way to reflect that need to the system, all without communicating with any other processes. - Compensation tickets: + client which consumes only a fraction f of its allocated resource quantum can be granted a compensation ticket that inflates its value by 1/f until the client start its next quantum. + This permits I/O-bound tasks that use few processor cycles to start quickly + Example: ~ A and B each holds tickets valued at 400 base units ~ Thread A always consume 100ms (CPU) time quantum ~ Thread B consumes 20 ms before yielding processor (f = 1/5) ==> Thread B is granted a compensation ticket valued at 1600 when it yields the processor # 3. Implementation - Really simple - Need: + a good random number generator to pick the winning ticket, + a simple data structure to track the processes of the system (e.g., a list), + the total number of tickets. - Assume we keeps a list of process, each with its ticket head -> ( A | 100 ) -> ( B | 50 ) -> ( C | 250 ) -> null - Step: 1. Pick winning ticket say 300. 2. Traverse the list, and find the winning process (in this case, C) # 4. How to Assign Ticket? " This problem is a tough one, because of course how the system behaves is strongly dependent on how tickets are allocated. One approach is to assume that the users know best; in such a case, each user is handed some number of tickets, and a user can allocate tickets to any jobs they run as desired. However, this solution is a non-solution: it really doesn't tell you what to do. Thus, given a set of jobs, the "ticket-assignment problem"" remains open. " #. 5. Why not deterministic? - Randomness give us a simple scheduler - But not deliver the exact right proportions, especially over short time scales - Solution: *stride scheduling* + a simple deterministic proportional-share scheduler + Each job in the system has a stride, inverse in proportion to the number of tickets it has + E.g. ~ Number of ticket (A, 100), (B, 50), (C, 250) ~ following stride values for A, B, and C: 100, 200, and 40. + every time a process runs, increment a counter for it (called its *pass* value) by its stride to track its global progress + The scheduler then uses the *stride* and *pass* to determine which process should run next ~ pick the process to run that has the lowest pass value so far. Example: Pass(A) Pass(B) Pass(C) Who Runs? (stride=100) (stride=200) (stride=40) 0 0 0 A 100 0 0 B 100 200 0 C 100 200 40 C 100 200 80 C 100 200 120 A 200 200 160 C 200 200 200 ... PROBLEM WITH STRIDE scheduling: GLOBAL STATE Now suppose a NEW job enter a middle of our stride scheduling example above, what should its pass value be? Should it be set to 0? If so, it will monopolize the CPU. With lottery scheduling, there is no global state per process; we simply add a new process with whatever tickets it has, update the single global variable to track how many total tickets we have, and go from there. In this way, lottery makes it much easier to incorporate new processes in a sensible manner The nice thing about Lottery: - simple scheduler - no global state **LOTTERY SCHEDULING**, my own notes ====================== WHY NOT USE MLFQ WITH PRIORITY? ------------------------------- - Rely on ad-hoc assignment of priorities - priority lack of support for encapsulation and modularity + Say module A has two tasks, and one to control the share among those two - complex: setting priorities and other parameters to obtain is black art + poorly understood - ticket is simpler: provide both responsive and modularity (by the use of currency) SOME OTHER EXAMPLES OF USING LOTTERY? ------------------------------------- - Synchronization resources: + lock currency, funded by tickets of thread that blocks for the lock + implication: thread that acquires the lock run with its own funding plus the funding of all waiting threads ==> similar to priority inheritance, solve priority inversion problem + disk scheduling? Inverse Lottery --------------- - OK, so lottery works for *indivisible time-share resource*, such as entire processors or the spinlock? How about finely divisible space-shared resources (like memory)? - Main Idea: + a "loser" is chosen to relinquish a unit of resource that it holds + conducting an inverse lottery is similar to holding a normal lottery, except that inverse probabilities are used + the probability p that a clients holding t tickets will be selected by an inverse lottery with a total of n clients and T tickets is p = (1- t/T) * 1 / (n-1) Hence the more tickets a client has, the more likely it is to avoid having a unit of its resource revoked How about multiple resources ---------------------------- - Still an open question ... What is bad about lottery? -------------------------- - doesn't deal good with IO-bound jobs + although compensate tickets mitigate the problem a little bit, it is only good for interactive job which actively relinquish the CPU and actively participate in the next allocation. + The story is different with an IO-bound process p. Say p goes to sleep while blocking for an IO, who know when an IO complete. But the fact that p went to sleep is different from p actively relinquish the CPU. Hence, when p is unblocked, it should receive CPU as fast as possible to reduce the response time. - for short time interval, may not provide desirable result - How do we assign/allocate ticket for process? It is hard? There is nothing like MLFQ, that actively adjust the priority of the processes based on their history - Moreover, the process behavior may change overtime (so it may be desirable to adjust the share, e.g. MLFQ like)