Question: * WSClock and WS accuracy? According to the paper: The main dissimilarity between WS and WSClock lies in the difference between W and RW and its use in the WS load control. When a task is activated, W (or RW) frames of memory are "committed" to the task. Memory frames are "allocated" only as the task executes and demands them by referencing them. If, at activation, a p in (W join R') is not reference for tau, the periodic WS-scan of WS will remove p from W, while WSClock lacks a mechanism to perform this operation. WSClock can remove inactive pages only if they are resident. But this inaccuracy of WSClock is only limited in the first tau after each activation. Their claim: if a task has executed for at least tau units of virtual time since activation, W and RW will be identical. The largest discrepancy between R and RW occurs immediately after activation and decreases rapidly during the loading phase. Since RW <= W during the loading phase, WSClock underestimates the amount of memory that a loading task may require. This implies that WSClock has a tendency to make additional activations and overcommit memory during this phase. Thus, the LT/RT control not only prevents overcommitment to auxiliary memory, but also prevents overcommitment of main memory by WSclock. Take away: - what is WS model? algorithm? - why it is impractical? (keeping track of time...) - what is the approximation... (see the wsclock paper) - load control? - WS-clock only approximate "resident" working set --> hence there can be overcommitment. Solution: LT/RT Why LT/RT? # Discussion WSClock # MODEL OF VIRTUAL MEMORY SYSTEM - Task is model by + address space P = {pi | i = 1..m} of pages + reference string {rt | t = 1..T}, each rt is ordered pair (p, d) where > p in P > d is boolean, true if the reference dirties the page - Configuration Model: + a central processor + main memory with M page frames + set of I/O devices - OS model: scheduler, memory manager, load control + Scheduler ~ each task is part of *ready* queue or *active* queue > task remains in active queue until: (1) the time-slice is exhausted; (2) the task is deactivated by the load control; (3) task completes and become dormant ~ use multi-level load-balancing queue + Memory Manager ~ allocate main memory frames to each task, requests paging IO if needed ~ at any given time, each task address space P is partitioned into a resident set R (pages which occupy main memory frames), and a missing set R'. ~ page fault happens when processor execute reference (p,d) and p not in R ~ demand paging (Page in only if need to, that is no pre-page-in) + Load Control: ~ goal: maximize the number of active task without inducing thrashing ~ thrashing occurs when so many tasks are active that the sum of their memory needs exceeds the size of main memory --> overcommitted ~ *load* control: moving task back and forth between ready queue & active queue > when memory is under-committed, may move tasks from ready queue to active > when memory is overcommitted, move task from active to ready queue # WORKING SET POLICY Working set definition ---------------------- - Working set W to all pages p in P referenced in previous θ units of task's virtual time (virtual time of a task is the number of references that have been completed for that task) - Arrival of new p in W is signaled by a page fault - To detect the departure of the page from W is more difficult. It requires + LR(p): last reference time of a page + a procedure to detects pages for which VT - LR(p) >= θ - 2 general approach to calculate LR(p) + Software: ~ use some approximate, e.g. page-frame use-bit. The use-bit is tested and cleared, and if the page was recently referenced, set LR(p) to the task current VT (of course, this process running on background, at some regular interval). ~ there is a WS-scan that performs periodically to update the working set + Hardware support: each page frame is associated with a counter that approximate the virtual-time-since-last-reference directly (i.e. VT-LR(p)) ~ The counter is cleared when the page is referenced ~ The counter is automatically incremented every fixed amount of time ~ There still need a WS-scan performed periodically Load Control and Replacement Algorithm -------------------------------------- - When a task is active |W| frames are committed to the task - Total memory commitment Wactive is the sum of the W of all active tasks - WS load control will activate a ready task unless the W of the first ready task exceed M - Wactive - A: set of available frames - What about page replacement policy: + if there a free page frame in A, use that + otherwise, deactivate an active task (move it to ready queue), and that task's resident pages are placed in A Page Writing and Reclaiming --------------------------- - When a dirty page is placed in A, it must be cleaned before it is replace - Approach 1: couple the writing of a dirty page and the reading of the page that replaces it ==> slow, requires 2 IO - Approach 2: replace only the clean pages in A, and clean the dirty pages in A when there are no outstanding page read requests for a device # CLOCK POLICY - simple approximation of global LRU replacement algorithm - Why not LRU? more complex, need to keep track time a page is referenced - Details algorithm (Figure 1) + all main memory page frames are ordered in a fixed circular list + a pointer always point to the last frame replaced + when a frame is needed to hold a missing place, the pointer is advanced clockwise, scanning the frames in circular order + The use-bit is tested and cleared + if the bit was set, the frame is recently-used and is not replaced + if the bit was clear, it is replaceable if the page is clean + if replaceable page is dirty, then it is scheduled for cleaning and not replaced. + when the clock locates a clean and not-recently-used page, algorithm halts leaving the pointer pointing to the chosen frame to mark the starting point for next scan. - Global policies, such as CLOCK, allow all active tasks to compete for main memory allocation. No mechanism to determine a task's memory needs independently of other task Hence, there is no explicit estimates of memory commitment ==> require an *adaptive feedback control mechanism* (e.g. page fault rate is monitored, if it is too high, deactivate some task, if it is too low, activate some task) # COMPARING WS and CLOCK - Implementation + WS is more complex, since it requires ~ scheduling and executing WS-scan procedure (to keep the WS up-to-date) ~ memory to store LR(p) for every page of the task ~ algorithm to maintain the set A of available pages including methods to reclaim pages from A + Clock ~ simpler ~ less memory ~ but requires adaptive feedback control mechanism (may be difficult to tune, compare with WS) - Policy: + WS has advantages of good task isolation , predictive load control (WS just need to look at M - Wactive, and in Clock, need heuristic, e.g. page fault rate ..) --> local policy + Clock, every task competes for global memory: --> global policy ~ some tasks may monopolize main memory and force other tasks run slowly ~ can lead to thrashing - Performance: + hard to compare # WSCLOCK Basic Features -------------- Combines best feature of both WS and Clock: - thrashing-preventive load control and task isolation properties of WS - Eliminates: 1. WS-scan (in fact, the clock-scan is an approximation of Ws-scan) 2. Space for LR(p) 3. Available frame set A 4. Page reclaimable procedure (come with A) - use simple mechanism found in CLOCK, but does not require adaptive-feedback load control (as in CLOCK) - simpler than both WS and Clock Data Structure -------------- - Main memory are arranged in fixed circular CLock-like list - CLOCK pointer identifies the last frame replaced in the previous CLOCK-scan - LR(p) is defined only for resident pages (Instead of LR(p) for all page p in P) - When a page fault occurs, a page read request is placed in paging queue, when that request is served, the replacement algorithm is invoked to obtain a frame containing a clean replaceable page to hold incoming page Algorithm --------- - Detail Algorithm: Figure 2 (one error in this figure!) + To examine a frame, WSClock tests and clears the frame use-bit + if the use-bit is set, meaning that page is referenced recently, WSClock update the LR(p) to be the owning task's VT. + otherwise (if use-bit is clear) if VT - LR(p) > Tau, the page is removed from the working set. + A page p is replaceable if either: ~ p is not in the working set ~ or the task is inactive + If a replaceable page is dirty, it is scheduled for cleaning and not replaced. + When WSClock finds a clean replaceable page, it halts and returns the pointer at the chosen frame - No need the available page set A because it simply searches for and finds a replaceable page when one is needed + This can be a trade-off right! + With A, you pay the overhead of maintaing that free pool, but it is faster when you need a page... - Page reclamation is eliminated because a page is not removed from a task's resident set R until it is selected for replacement Load Control ------------ - WSCLOCK scans only p in the resident set R, it does not approximate W if W contains some p NOT in resident R ==> WSCLOCK approximates only resident working set RW - In term of load control, WSCLOCK is similar to WS, the only difference is that WSCLOCK use RW = |RW| as memory commitment for each task. + RWactive: sum of RW of all active task + RW of a ready task is the value of RW when that task was deactivated (==> We can only say that WSClock is good approximation when a working set of a task is mostly resident.) - To detect overcommitment: when Clock-scan fails to find a clean replaceable page in full circuit of frames (some task must be deactivated to relieve overcommitment) # LT/RT Control Introduction ------------ - when a task is activated, enter *loading* phase, i.e. it has few p ∈ R and must load missing pages to executes efficiently - then, enter *running* phase: + few page fault occurs + task run efficiently - Typically, virtual time of loading phase is small, while real time of loading phase is large because of paging I/O delays - if activations occur frequently, many loading task may contends for access to IO devices ==> loading time is proportional to # concurrent loading task - This can cause IO device (auxiliary memory) very busy although main memory is undercommitted. ==> Solution: Loading task / running task control (LT/RT control) + reduce the mean loading time of a task + ensure more consistent balance of loading and running tasks Implementation of LT/RT ----------------------- - idea: limits the number of concurrent loading tasks - L: number of maximum concurrent loading tasks + determined empirically + close to # of paging devices which can be accessed simultaneously - How to detect if a task is in a loading phase or running phase? Heuristic: a task is loading until it has executed for t unit of virtual time or has requested an I/O operation - Why consider a task that has request an I/O operation a running task? Because if not, an IO bound task might prevent new task activations for a long time - 3 effects: + reduces the mean delay between the time that main memory becomes available to activate a task and the time that the task enters the productive running phase + main memory is used more effectively, since fewer page frames are committed to unproductive loading tasks. + *improves processor utilization* because it maintains a more consistent balance of loading tasks and running tasks. (i.e CPU is spent more time in serving those tasks already run efficiently, rather than waiting for task that are in loading phase ) - Further optimizations: + deactivate a task when its time-slice end ~ delay deactivation if there are L loading task active Why? Because if not, after deactivation, we need to activate a task in the ready queue, which make the number of loading tasks L+1 ~ deactivate only when # loading task < L, and there are insufficient uncommitted memory to activate first ready task + processing page read for running tasks before reads for loading tasks Why? Because # page fault of running tasks is small, compare to that of loading tasks. Serve it first will improve processor utilizations because we are giving preference to those tasks that execute efficiently - LT/RT is independent of WS and CLock load control + still limits # loading task even if they will not overcommit main memory (Basically, they can do a bunch of hack here, sick of this) # Experiment From the graph, we see that, 1) With LT/RT, processor utilization is increased 2) WSClock performs equivalent to WS