HG's NOTES on lottery and disk
==============================
LOTTERY AND DISK

Isolation with Flexibility: A Resource Management Framework for 
Central Servers Filetype
http://www.eecs.harvard.edu/~syrah/papers/usenix-00-resmgmt/usenix-00-resmgmt.pdf

Disk Scheduling with Quality of Service Guarantees
http://www.bell-labs.com/project/eclipse/disk.ps


- A Web site's working set of frequently accessed documents expands, the site
  may require an increasing share of the server's disk bandwidth in
  order to offer reasonable responsiveness
- Ticket exhanges: CPU-intensive application could exhange some of its
  disk tickets for some of the CPU tickets of an I/O-intensive application

- Supports proportional sharing of disk bandwidth using the hierarchical
  YFQ algorithm [Bru99b]
- YFQ approximates ideal proportional sharing by maintaining a virtual time
  function and per-disk queues of outstanding disk requests for each resource
  principal.
- Each of the queues has an associated finish tag that reflects:
    * the principal's past disk activity
    * its current share of the disk
    * the length of the request at the head of the queue
- Request from queues with the smallest finish tags are forwarded to the device
  driver in small batches that can be reordered by the driver or device to
  achieve better aggregate throughput
- YFQ fairness:
    * may incur more cost of excessive disk latency and seek overheads,
      harming aggregate throughput
    * can solved with batching

- A principal's disk ticket are active whenever it has an outstanding request.
- To adjust to dynamic changes in the number of active tickets, the base value
  of principal's disk tickets is recomputed (using cached values if possible)
  whenever a request reaches the head of its queue, and this value is used to
  compute the queue's finish tag

- Conclusion
    * the idea of lottery can be expanded to be cross-resources,
      that is you can exhanges disk-ticket with CPU-ticket, or memory-ticket


LOTTERY AND DISK 2

From Lottery and Stride Scheduling (thesis)
Section 6.3 Disk Scheduling

- Conventional disk scheduling are designed to maxmized disk bandwidth utilization
  without any consideration for  the relative importance of individual requests
  e.g. Shortest-seek-first, shortest-access-time-first
- Hence, some requests are treated very unfairly in terms of response time,
  worse, can lead to starvation
- SCAN solves a bit, but still exhibit high response times for requests to cylinders
  near the extreme ends of the disk
- 


- Proportional-share control is provided for filesystem buffer cache space


LOTTERY AND PHYSICAL MEMORY

- Difficulty: determining which processes are actively competing for memory
- Problem:
    * If let's say a process is blocked, hence its memory tickets will not
      be active, then it will lose a large number of pages
    * Alternatively, if we let all memory-tickets to be active even though
      the process is idle, is also bad. Because then the pages belonging
      to idle processes tend to remain in memory indifinetly
- Hence, give memory guarantees only to priviledge processes that explicitly
  request them, other processes compete for the unreserved portion of memory
  not wired by the kernel
- Algorithm:
    * The pageout daemons is altered. Pages owned by processes that have
      not exceeded their memory guarantee are not reclaimed.
    * The pageout daemon needs to compute the current base value of 
      the processes' memory tickets all the time
- He develop FDC (funded delay cost) algorithm that roughly approximates
  proportional-share scheduling for small ratios.
- FDC compues the merit of each request as FD/C
    * F is the funding associated with the request
    * D is the delay since it was issued
    * C is the actual cost of servicing the request
- FDC processes the request with the maximum merit
- FDC sacrifices aggregate performance in order to give preferential 
  treatment to requests with high ticket allocations
- Results
    * A high-ticket requests are issued by important, time-critical apps
      will be accelerated by FDC
    * Low-ticket apps that issue a large number of disk requests are prevented
      from monopolizing the disk
REMZI's NOTE
============                                    

# 1. Basic Concepts: Tickets represent your share
- Tickets: represents share of a resource that a process would receive
- The percent of tickets that a process has represents its share of the
  system resource in question
- E.g.: 2 process, A has 75 tickets (from 0 to 74), B has 25 tickets (75 to 99)
  + achieves probabilistically
  + every time slice, hold a lottery: draw a winning ticket, and load
    the process who has the winning ticket

# 2. Ticket Mechanism
- Ticket currency: 
  + allows user with a set of tickets to allocate tickets among their own jobs
    in whatever currency they would like
  + system then automatically converts said currency into correct global value.
  + Hence, currency defines a resource management abstraction barrier
  + Example: assume users A and B have each been given 100 tickets. User A is 
    running two jobs, A1 and A2, and gives them each 500 tickets (out of 1000 
    total) in User A's own currency. User B is running only 1 job and gives it   
    10 tickets (out of 10 total).
    User A -> 500 (A's currency) to A1 -> 50 (global currency) 
           -> 500 (A's currency) to A2 -> 50 (global currency) 
    User B -> 10 (B's currency) to B1 -> 100 (global currency)
- Ticket Transfer:
  + a process can temporarily hand off its tickets to another process
  + useful in a client/server setting:
    ~ a client process sends a message to a server asking it to do some work on 
      the client's behalf	
    ~ To speed up the work, the client can pass the tickets to the server
    ~ When finished, the server then transfers the tickets back to the client
- Ticket inflation:
  + a process can temporarily raise or lower the number of tickets it owns
  + one greedy process could give itself a vast number of tickets, and take over 
    the machine
  + Hence, applied in a trust environment:
    ~ if any one process knows it needs more CPU time, it can boost its ticket 
      value as a way to reflect that need to the system, all without   
      communicating with any other processes.
- Compensation tickets:
  + client which consumes only a fraction f of its allocated resource quantum
    can be granted a compensation ticket that inflates its value by 1/f 
    until the client start its next quantum.
  + This permits I/O-bound tasks that use few processor cycles to start quickly
  + Example: 
    ~ A and B each holds tickets valued at 400 base units
    ~ Thread A always consume 100ms (CPU) time quantum
    ~ Thread B consumes 20 ms before yielding processor (f = 1/5)
      ==> Thread B is granted a compensation ticket valued at 1600 when 
          it yields the processor

# 3. Implementation
- Really simple
- Need:
  + a good random number generator to pick the winning ticket, 
  + a simple data structure to track the processes of the system (e.g., a list), 
  + the total number of tickets.
- Assume we keeps a list of process, each with its ticket
  head -> ( A | 100 ) -> ( B | 50 ) -> ( C | 250 ) -> null
- Step:
  1. Pick winning ticket say 300. 
  2. Traverse the list, and find the winning process (in this case, C)

# 4. How to Assign Ticket?
" This problem is a tough one, because of course how the system behaves is 
strongly dependent on how tickets are allocated. One approach is to assume that 
the users know best; in such a case, each user is handed some number of tickets, 
and a user can allocate tickets to any jobs they run as desired. However, this 
solution is a non-solution: it really doesn't tell you what to do. Thus, given 
a set of jobs, the "ticket-assignment problem"" remains open.  "

#. 5. Why not deterministic?
- Randomness give us a simple scheduler
- But not deliver the exact right proportions, especially over short time scales
- Solution: *stride scheduling*
  + a simple deterministic proportional-share scheduler
  + Each job in the system has a stride, inverse in proportion to the number of 
    tickets it has
  + E.g. 
    ~ Number of ticket (A, 100), (B, 50), (C, 250)
    ~ following stride values for A, B, and C: 100, 200, and 40.
  + every time a process runs, increment a counter for it (called its *pass*
    value) by its stride to track its global progress
  + The scheduler then uses the *stride* and *pass* to determine which process 
    should run next
    ~ pick the process to run that has the lowest pass value so far.

Example:
Pass(A)    		Pass(B)			Pass(C)			Who Runs? 
(stride=100)	(stride=200)	(stride=40)
0			0				0				A
100			0				0				B
100			200				0				C
100			200				40 				C
100			200				80 				C
100			200				120 			A
200			200				160 			C
200			200				200 			...
      
PROBLEM WITH STRIDE scheduling: GLOBAL STATE                                                 
Now suppose a NEW job enter a middle of our stride scheduling example above,
what should its pass value be? Should it be set to 0? If so, it will monopolize 
the CPU. With lottery scheduling, there is no global state per process; we 
simply add a new process with whatever tickets it has, update the single global 
variable to track how many total tickets we have, and go from there. In this 
way, lottery makes it much easier to incorporate new processes in a sensible 
manner

The nice thing about Lottery:
- simple scheduler
- no global state

**LOTTERY SCHEDULING**, my own notes
======================

WHY NOT USE MLFQ WITH PRIORITY?
-------------------------------
- Rely on ad-hoc assignment of priorities
- priority lack of support for encapsulation and modularity
  + Say module A has two tasks, and one to control the share among those two
- complex: setting priorities and other parameters to obtain is black art
  + poorly understood
- ticket is simpler: provide both responsive and modularity (by the use of
  currency)

SOME OTHER EXAMPLES OF USING LOTTERY?
-------------------------------------
- Synchronization resources:
  + lock currency, funded by tickets of thread that blocks for the lock
  + implication: thread that acquires the lock run with its own funding plus
    the funding of all waiting threads
    ==> similar to priority inheritance, solve priority inversion problem
  + disk scheduling?

Inverse Lottery
---------------
- OK, so lottery works for *indivisible time-share resource*, such as entire
  processors or the spinlock? How about finely divisible space-shared resources
  (like memory)?
- Main Idea:
  + a "loser" is chosen to relinquish a unit of resource that it holds
  + conducting an inverse lottery is similar to holding a normal lottery,
    except that inverse probabilities are used
  + the probability p that a clients holding t tickets will be selected by an
    inverse lottery with a total of n clients and T tickets is
  
 	p = (1- t/T) * 1 / (n-1)

Hence the more tickets a client has, the more likely it is to avoid having
a unit of its resource revoked

How about multiple resources
----------------------------
- Still an open question ...


What is bad about lottery?
--------------------------
- doesn't deal good with IO-bound jobs
  + although compensate tickets mitigate the problem a little bit, it is only
    good for interactive job which actively relinquish the CPU and actively
    participate in the next allocation.
  + The story is different with an IO-bound process p. Say p goes to sleep
    while blocking for an IO, who know when an IO complete. But the fact that
    p went to sleep is different from p actively relinquish the CPU.
    Hence, when p is unblocked, it should receive CPU as fast as possible
    to reduce the response time.
- for short time interval, may not provide desirable result
- How do we assign/allocate ticket for process?
  It is hard? There is nothing like MLFQ, that actively adjust the priority 
  of the processes based on their history 
- Moreover, the process behavior may change overtime (so it may be desirable
  to adjust the share, e.g. MLFQ like)