Kai Li @ Princeton & Paul Hudak @ Yale
ACM Trans. on Computer System 1989
Shared
Virtual Memory System in this paper
|
|||||||||
Memory coherence problem in shared VM systems is similar but not identical to that in multicache scheme |
A single address space is shared by all processors
|
|||||||||||||||||||||
Model of parallel program: multiple threads possibly running on multiple processors share a single virtual address space |
|||||||||||||||||||||
To achieve high parallelism, process manager(= tread manager) and memory allocator should be tightly coupled with mapping manager. This paper does not deal with process manager nor memory allocator |
|||||||||||||||||||||
Previous works
|
|||||||||||||||||||||
Alternatives for data sharing in loosely coupled multiprocessor system
|
|||||||||||||||||||||
The system was redesigned for Lotus OS Kernel |
Definition
of coherence
|
|||||||||||||
Why not multi-cache coherence algorithm?
|
|||||||||||||
Two design issues: page size & memory coherence strategy |
Reasons for large page and small page
|
Good reasons for using the same size as the conventional paging system
|
Spectrum of solutions to the memory coherence problem: X-axis - ownership & Y-axis - synchronization |
Fixed |
Dynamic | |||
Centralized Mnger |
Distributed Mnger | |||
Fixed | Dynamic | |||
Invalidation | Not allowed | OK | Good | Good |
Write-broadcast | Very expensive | Very expensive | Very expensive | Very expensive |
This paper only considers green cells |
Each process usually has its own page table, whose fields are:
|
|||||||
lock is an atomic operation which blocks the requesting process until the lock is granted |
|||||||
unlock releases the lock and schedule a process from the queue of processes waiting for the lock |
|||||||
To invalidate a page, the requesting processor should usually broadcast a message to processors in the copy set of the page |
Central manager has a table called 'info' which has an entry for each page, each entry having fields:
|
|||||||
Each processor has a page table, PTable, with fields: access & lock |
|||||||
Read page fault |
(1) requestor: lock PTable[page#]
(2) requestor: ask central manager for read access to the page and a copy of it
(3) manager: lock info[page#]
(4) manager: add requestor to info[page#].copy-set
(5) manager: ask info[page#].owner to send the copy of the page to requestor
(6) owner: lock PTable[page#]
(7) owner: PTable[page#].access = read
(8) owner: send the copy to requestor
(9) owner: unlock PTable[page#]
(8) requestor: receive the page from the owner
(9) requestor: send confirmation to the manager
(10) requestor: PTable[page#].access = read
(11) requestor: unlock PTable[page#]
(9) manager: receive confirmation from requestor
(10) manager: unlock info[page#]
Write page fault (1) requestor: lock PTable[page#] (2) requestor: ask central manager for write access to the page (3) manager: lock info[page#] (4) manager: invalidate(page#, infor[page#].copy-set) (5) manager: info[page#].copy-set = empty (6) manager: ask info[page#].owner to send the copy of the page to requestor (7) manager: info[page#].owner = requestor (8) owner: lock PTable[page#] (9) owner: send the copy to requestor (10) owner: PTable[page#].access = null (11) owner: unlock PTable[page#] (9) requestor: receive the page from the owner (10) requestor: send confirmation to the manager (11) requestor: PTable[page#].access = write (12) requestor: unlock PTable[page#] (10) manager: receive confirmation from requestor (11) manager: unlock info[page#] |
Central manager could become a bottleneck, so synchronization is moved to each owner |
Read page fault |
(1) requestor: lock PTable[page#]
(2) requestor: ask central manager for read access to the page and a copy of it
(3) manager: lock MngerLock
(4) manager: ask info[page#].owner to send the copy of the page to requestor
(5) manager: unlock MngerLock
(5) owner: lock PTable[page#]
(6) owner: add requestor to PTable[page#].copy-set
(7) owner: PTable[page#].access = read
(8) owner: send the copy to requestor
(9) owner: unlock PTable[page#]
(8) requestor: receive the page from the owner
(9) requestor: PTable[page#].access = read
(10) requestor: unlock PTable[page#]
Write page fault (1) requestor: lock PTable[page#] (2) requestor: ask central manager for write access to the page (3) manager: lock MngerLock (4) manager: ask info[page#].owner to send the copy of the page to requestor (5) manager: info[page#].owner = requestor (6) manager: unlock MngerLock (5) owner: lock PTable[page#] (6) owner: send the page and copy-set to requestor (7) owner: PTable[page#].access = null (8) owner: unlock PTable[page#] (6) requestor: any request for the page is queued until requester receives the page from previous owner (7) requestor: receive the page from the owner (8) requestor: invalidate(page, PTable[page#].copy-set) (9) requestor: PTable[page#].access = write (10) requestor: PTable[page#].copy-set = empty (11) requestor: unlock PTable[page#] |
At this point you may want a quick review of Locus file systems |
Each processor is given a predetermined subset of pages to manage |
|
Mapping pages to appropriate processors is a difficult task and uses a hash function |
|
Page fault handling applies the hash function to get the manager for the page and follows the same procedure as centralized manager |
|
It is difficult to find a good fixed distribution function that fits all application well |
Owner = Manager, actually no manager at all |
|||||||
Read page fault (1) requestor: lock PTable[page#] (2) requestor: broadcast read access request (3) owner: lock PTable[page#] (4) owner: add requestor to PTable[page#].copy-set (5) owner: PTable[page#].access = read (6) owner: send the copy to requestor (7) owner: unlock PTable[page#] (6) requestor: receive the page from the owner (7) requestor: PTable[page#].access = read (8) requestor: unlock PTable[page#] |
|||||||
Write page fault (1) requestor: lock PTable[page#] (2) requestor: broadcast write access request (3) owner: lock PTable[page#] (4) owner: send the page and copy-set to requestor (5) owner: PTable[page#].access = null (6) owner: unlock PTable[page#] (4) requestor: receive the page & copy-set from the owner (5) requestor: invalidate(page, PTable[page#].copy-set) (6) requestor: PTable[page#].access = write (7) requestor: PTable[page#].copy-set = empty (8) requestor: PTable[page#].owner = myself (9) requestor: unlock PTable[page#]
|
|||||||
All broadcast operation should be atomic. I don't know why. Consider the situation:
|
|||||||
Too many messages are generated and unrelated processors wastes time processing and ignoring those messages |
Read page fault (1) requestor: lock PTable[page#] (2) requestor: send read access request to PTable[page#].probOwner (3) probOwner: lock PTable[page#] (4) probOwner: forward the request to PTable[page#].probOwner, if I am not the real owner (5) probOwner: unlock PTable[page#] (5) owner: lock PTable[page#] (6) owner: add requestor to PTable[page#].copy-set (7) owner: PTable[page#].access = read (8) owner: send the page to the requestor (9) owner: unlock PTable[page#] (8) requestor: receive the page from the owner (9) requestor: PTable[page#].probOwner = owner (10) requestor: PTable[page#].access = read (11) requestor: unlock PTable[page#] |
|||||
Write page fault: probOwner is supposed as not being the correct owner (1) requestor: lock PTable[page#] (2) requestor: send write access request to PTable[page#].probOwner (3) probOwner: lock PTable[page#] (4) probOwner: forward the request to PTable[page#].probOwner, if not the real owner (5) probOwner: PTable[page#].probOwner = requestor (6) probOwner: unlock PTable[page#] (5) owner: lock PTable[page#] (6) owner: PTable[page#].access = null (7) owner: send the page and copy-set to the requestor (8) owner: PTable[page#].probOwner = requestor (9) owner: unlock PTable[page#] (7) requestor: receive the page & copy-set from the owner (8) requestor: invalidate(page, PTable[page#].copy-set) (9) requestor: PTable[page#].access = write (10) requestor: PTable[page#].copy-set = empty (11) requestor: PTable[page#].probOwner = myself (12) requestor: unlock PTable[page#] (8) copy holders: receive invalidate (9) copy holders: lock PTable[page#] (10) copy holders: PTable[page#].access = null (11) copy holders: PTable[page#].probOwner = requestor (12) copy holders: unlock PTable[page#]
|
|||||
Note: probOwner is just a hint and a starting point to find the real owner of the page |
|||||
Two questions
|
|||||
Important observation
|
By periodically broadcasting the real owner, the number of steps to find the owner can be reduced. Then how often? See the table II and III. |
So far only the copy-set of the owner is correct and complete. This requires that:
|
|||||
Ideas
|
|||||
Read page fault (1) requestor: lock PTable[page#] (2) requestor: send read access request to PTable[page#].probOwner (3) probOwner: lock PTable[page#] (4) probOwner: forward the request to PTable[page#].probOwner, if it does not have the copy (5) probOwner: unlock PTable[page#] (5) copy-holder: lock PTable[page#] (6) copy-holder: add requestor to PTable[page#].copy-set (7) copy-holder: PTable[page#].access = read (8) copy-hoder: send the page to the requestor (9) owner: unlock PTable[page#] (8) requestor: receive the page from the owner (9) requestor: PTable[page#].probOwner = copy-holder (10) requestor: PTable[page#].access = read (11) requestor: unlock PTable[page#] |
|||||
Write page fault (1) requestor: lock PTable[page#] (2) requestor: send write access request to PTable[page#].probOwner (3) probOwner: lock PTable[page#] (4) probOwner: forward the request to PTable[page#].probOwner, if not the owner (5) probOwner: PTable[page#].probOwner = requestor (6) probOwner: unlock PTable[page#] (5) owner: lock PTable[page#] (6) owner: PTable[page#].access = null (7) owner: send the page and copy-set to the requestor (8) owner: PTable[page#].probOwner = requestor (9) owner: unlock PTable[page#] (7) requestor: receive the page & copy-set from the owner (8) requestor: invalidate(page, PTable[page#].copy-set) (9) requestor: PTable[page#].access = write (10) requestor: PTable[page#].copy-set = empty (11) requestor: PTable[page#].probOwner = myself (12) requestor: unlock PTable[page#] (8) copy holders: receive invalidate (9) copy holders: lock PTable[page#] (10) copy holders: propagate invalidation to its copy-set (11) copy holders: PTable[page#].access = null (12) copy holders: PTable[page#].probOwner = requestor (13) copy holders: PTable[page#].copy-set = empty (12) copy holders: unlock PTable[page#] |
Parallel Jacobi program |
|
Parallel sorting |
|
Parallel matrix multiplication |
|
Parallel dot-product |
Super ideal speed-up for Jacobi program!
|
|||
Why did shared virtual memory show poor speed-up for parallel dot-product problem? Communication cost is too high |