Memory Coherence in Shared VM Systems

Kai Li @ Princeton & Paul Hudak @ Yale

ACM Trans. on Computer System 1989

1. Introduction

Shared Virtual Memory System in this paper
A virtual memory system on loosely coupled multiprocessor system
Loosely coupled multiprocessor system
Private memory for each processor
Communication between processors involves software protocol stacks and virtual memory operations
Memory coherence problem in shared VM systems is similar but not identical to that in multicache scheme

2. Shared Virtual Memory

A single address space is shared by all processors

Any processor can access memories of other processors plus its own local memory

Local memory is a big cache for the local processor

Read-only pages can have multiple copies residing multiple local memories and these multiple copies need to be maintained coherent to each other

Each processor has a mapping manager, which:

translates virtual address to physical address

handles page faults by bringing the page from disk or other processor's memory

cooperates with other mapping manager to achieve cache coherence

Model of parallel program: multiple threads possibly running on multiple processors share a single virtual address space

To achieve high parallelism, process manager(= tread manager) and memory allocator should be tightly coupled with mapping manager. This paper does not deal with process manager nor memory allocator

Previous works

Massive works on VM, but mainly on uniprocessor system

Previous works on data sharing were mainly done on distributed computing rather than parallel computing

Alternatives for data sharing in loosely coupled multiprocessor system

Message passing

No transparency: programming application is difficult

Passing complicated data structures is difficult

Providing programmers a set of primitives handling shared global data

Need user's explicit control

Inefficient: even if data are in local memory, they should be accessed via those primitives(=function call)

Difficult to migrate program

Shared virtual memory

Transparent

Once paged in, memory reference is as efficient as conventional one

The system was redesigned for Lotus OS Kernel

3. Memory Coherence Problem

Definition of coherence

The value returned by a read is the same as the value written by the most recent write

Why not multi-cache coherence algorithm?

Multi-cache algorithms are mainly for the setting of

A single memory shared by multiple processors with its own cache

cache is relatively small

bus connecting the memory and caches is fast

Time delay for resolving conflict writes must be short

But in loosely coupled environment of this paper, communication cost is relatively high, so the number of communication should be much reduced

Two design issues: page size & memory coherence strategy

3.1 Granularity: Page size

Reasons for large page and small page

Large page

Small page

Transfer time is a little independent to the page size due to network overhead and the involvement of virtual memory system The larger the page is, the higher the chance to conflict is. False sharing

 

Good reasons for using the same size as the conventional paging system

The existing page-fault scheme can be used

The protection scheme provided by the existing MMU can be used

3.2 Memory Coherence Strategies

Spectrum of solutions to the memory coherence problem: X-axis - ownership & Y-axis - synchronization

Fixed

Dynamic

Centralized Mnger

Distributed Mnger
Fixed Dynamic
Invalidation Not allowed OK Good Good
Write-broadcast Very expensive Very expensive Very expensive Very expensive

This paper only considers green cells

3.3 Page Table, Locking, and Invalidation

Each process usually has its own page table, whose fields are:

access: rights of the processor to the page

copy set: the set of processors which have read copy of the page

lock: tag indicating whether the page is locked or not

lock is an atomic operation which blocks the requesting process until the lock is granted

unlock releases the lock and schedule a process from the queue of processes waiting for the lock

To invalidate a page, the requesting processor should usually broadcast a message to processors in the copy set of the page

4. Centralized Manager Algorithms

4.1 A Monitor-Like Centralized Manager Algorithm

Central manager has a table called 'info' which has an entry for each page, each entry having fields:

owner: the owner process, namely the most recent processor which has write access to it

copy set

lock

Each processor has a page table, PTable, with fields: access & lock

Read page fault

(1) requestor: lock PTable[page#]

(2) requestor: ask central manager for read access to the page and a copy of  it

(3) manager: lock info[page#]

(4) manager: add requestor to info[page#].copy-set

(5) manager: ask info[page#].owner to send the copy of the page to requestor

(6) owner: lock PTable[page#]

(7) owner: PTable[page#].access = read

(8) owner: send the copy to requestor

(9) owner: unlock PTable[page#]

(8) requestor: receive the page from the owner

(9) requestor: send confirmation to the manager

(10) requestor: PTable[page#].access = read

(11) requestor: unlock PTable[page#]

(9) manager: receive confirmation from requestor

(10) manager: unlock info[page#]

Write page fault

(1) requestor: lock PTable[page#]

(2) requestor: ask central manager for write access to the page

(3) manager: lock info[page#]

(4) manager: invalidate(page#, infor[page#].copy-set)

(5) manager: info[page#].copy-set = empty

(6) manager: ask info[page#].owner to send the copy of the page to requestor

(7) manager: info[page#].owner = requestor

(8) owner: lock PTable[page#]

(9) owner: send the copy to requestor

(10) owner: PTable[page#].access = null

(11) owner: unlock PTable[page#]

(9) requestor: receive the page from the owner

(10) requestor: send confirmation to the manager

(11) requestor: PTable[page#].access = write

(12) requestor: unlock PTable[page#]

(10) manager: receive confirmation from requestor

(11) manager: unlock info[page#]

4.2 An Improved Centralized Manager Algorithm

Central manager could become a bottleneck, so synchronization is moved to each owner

Read page fault

(1) requestor: lock PTable[page#]

(2) requestor: ask central manager for read access to the page and a copy of  it

(3) manager: lock MngerLock

(4) manager: ask info[page#].owner to send the copy of the page to requestor

(5) manager: unlock MngerLock

(5) owner: lock PTable[page#]

(6) owner: add requestor to PTable[page#].copy-set

(7) owner: PTable[page#].access = read

(8) owner: send the copy to requestor

(9) owner: unlock PTable[page#]

(8) requestor: receive the page from the owner

(9) requestor: PTable[page#].access = read

(10) requestor: unlock PTable[page#]

Write page fault

(1) requestor: lock PTable[page#]

(2) requestor: ask central manager for write access to the page

(3) manager: lock MngerLock

(4) manager: ask info[page#].owner to send the copy of the page to requestor

(5) manager: info[page#].owner = requestor

(6) manager: unlock MngerLock

(5) owner: lock PTable[page#]

(6) owner: send the page and copy-set to requestor

(7) owner: PTable[page#].access = null

(8) owner: unlock PTable[page#]

(6) requestor: any request for the page is queued until requester receives the page from previous owner

(7) requestor: receive the page from the owner

(8) requestor: invalidate(page, PTable[page#].copy-set)

(9) requestor: PTable[page#].access = write

(10) requestor: PTable[page#].copy-set = empty

(11) requestor: unlock PTable[page#]

At this point you may want a quick review of Locus file systems

5. Distributed Manager Algorithms

5.1 A Fixed Distributed Manager Algorithm

Each processor is given a predetermined subset of pages to manage

Mapping pages to appropriate processors is a difficult task and uses a hash function

Page fault handling applies the hash function to get the manager for the page and follows the same procedure as centralized manager

It is difficult to find a good fixed distribution function that fits all application well

5.2 A Broadcast Distributed Manager Algorithm

Owner = Manager, actually no manager at all

Read page fault

(1) requestor: lock PTable[page#]

(2) requestor: broadcast read access request

(3) owner: lock PTable[page#]

(4) owner: add requestor to PTable[page#].copy-set

(5) owner: PTable[page#].access = read

(6) owner: send the copy to requestor

(7) owner: unlock PTable[page#]

(6) requestor: receive the page from the owner

(7) requestor: PTable[page#].access = read

(8) requestor: unlock PTable[page#]

Write page fault

(1) requestor: lock PTable[page#]

(2) requestor: broadcast write access request

(3) owner: lock PTable[page#]

(4) owner: send the page and copy-set to requestor

(5) owner: PTable[page#].access = null

(6) owner: unlock PTable[page#]

(4) requestor: receive the page & copy-set from the owner

(5) requestor: invalidate(page, PTable[page#].copy-set)

(6) requestor: PTable[page#].access = write

(7) requestor: PTable[page#].copy-set = empty

(8) requestor: PTable[page#].owner = myself

(9) requestor: unlock PTable[page#]

 

All broadcast operation should be atomic. I don't know why. Consider the situation:

P1 and P2 request write access at the same time

P1's request gets to the owner P3 first

In the meantime that P3 releases its ownership and P1 has not accepted the new ownership, P2's broadcast could arrive P1. But the P2's request will be queue because PTable[page#] is locked at this moment!

Too many messages are generated and unrelated processors wastes time processing and ignoring those messages

5.3 A Dynamic Distributed Manager Algorithm

Read page fault

(1) requestor: lock PTable[page#]

(2) requestor: send read access request to PTable[page#].probOwner

(3) probOwner: lock PTable[page#]

(4) probOwner: forward the request to PTable[page#].probOwner, if I am not the real owner

(5) probOwner: unlock PTable[page#]

(5) owner: lock PTable[page#]

(6) owner: add requestor to PTable[page#].copy-set

(7) owner: PTable[page#].access = read

(8) owner: send the page to the requestor

(9) owner: unlock PTable[page#]

(8) requestor: receive the page from the owner

(9) requestor: PTable[page#].probOwner = owner

(10) requestor: PTable[page#].access = read

(11) requestor: unlock PTable[page#]

Write page fault: probOwner is supposed as not being the correct owner

(1) requestor: lock PTable[page#]

(2) requestor: send write access request to PTable[page#].probOwner

(3) probOwner: lock PTable[page#]

(4) probOwner: forward the request to PTable[page#].probOwner, if not the real owner

(5) probOwner: PTable[page#].probOwner = requestor

(6) probOwner: unlock PTable[page#]

(5) owner: lock PTable[page#]

(6) owner: PTable[page#].access = null

(7) owner: send the page and copy-set to the requestor

(8) owner: PTable[page#].probOwner = requestor

(9) owner: unlock PTable[page#]

(7) requestor: receive the page & copy-set from the owner

(8) requestor: invalidate(page, PTable[page#].copy-set)

(9) requestor: PTable[page#].access = write

(10) requestor: PTable[page#].copy-set = empty

(11) requestor: PTable[page#].probOwner = myself

(12) requestor: unlock PTable[page#]

(8) copy holders: receive invalidate

(9) copy holders: lock PTable[page#]

(10) copy holders: PTable[page#].access = null

(11) copy holders: PTable[page#].probOwner = requestor

(12) copy holders: unlock PTable[page#]

 

Note: probOwner is just a hint and a starting point to find the real owner of the page

Two questions

Does forwarding request eventually arrive at the true owner? Yes

How many forwarding requests are needed? At most (N - 1), N = # of processors. O(N + K log N) for K tries for the same page

Important observation

The number of messages for finding the owner is logarithmically proportional to the number of contending processors

5.4 An Improvement by Using Fewer Broadcasts

By periodically broadcasting the real owner, the number of steps to find the owner can be reduced. Then how often? See the table II and III.

5.5 Distribution of Copy Sets

So far only the copy-set of the owner is correct and complete. This requires that:

the page should be copied from the owner to maintain the copy-set correct

invalidation should be broadcasted from the owner to every copy-holders

Ideas

copy the page from any of copy holder

maintain copy-holders in a tree (with the owner as the root)  and propagate invalidation from the owner

Read page fault

(1) requestor: lock PTable[page#]

(2) requestor: send read access request to PTable[page#].probOwner

(3) probOwner: lock PTable[page#]

(4) probOwner: forward the request to PTable[page#].probOwner, if it does not have the copy

(5) probOwner: unlock PTable[page#]

(5) copy-holder: lock PTable[page#]

(6) copy-holder: add requestor to PTable[page#].copy-set

(7) copy-holder: PTable[page#].access = read

(8) copy-hoder: send the page to the requestor

(9) owner: unlock PTable[page#]

(8) requestor: receive the page from the owner

(9) requestor: PTable[page#].probOwner = copy-holder

(10) requestor: PTable[page#].access = read

(11) requestor: unlock PTable[page#]

Write page fault

(1) requestor: lock PTable[page#]

(2) requestor: send write access request to PTable[page#].probOwner

(3) probOwner: lock PTable[page#]

(4) probOwner: forward the request to PTable[page#].probOwner, if not the owner

(5) probOwner: PTable[page#].probOwner = requestor

(6) probOwner: unlock PTable[page#]

(5) owner: lock PTable[page#]

(6) owner: PTable[page#].access = null

(7) owner: send the page and copy-set to the requestor

(8) owner: PTable[page#].probOwner = requestor

(9) owner: unlock PTable[page#]

(7) requestor: receive the page & copy-set from the owner

(8) requestor: invalidate(page, PTable[page#].copy-set)

(9) requestor: PTable[page#].access = write

(10) requestor: PTable[page#].copy-set = empty

(11) requestor: PTable[page#].probOwner = myself

(12) requestor: unlock PTable[page#]

(8) copy holders: receive invalidate

(9) copy holders: lock PTable[page#]

(10) copy holders: propagate invalidation to its copy-set

(11) copy holders: PTable[page#].access = null

(12) copy holders: PTable[page#].probOwner = requestor

(13) copy holders: PTable[page#].copy-set = empty

(12) copy holders: unlock PTable[page#]

6. Experiments

6.1 Four Parallel Programs

Parallel Jacobi program

Parallel sorting

Parallel matrix multiplication

Parallel dot-product

6.2 Speedups

Super ideal speed-up for Jacobi program!

Virtual big memory gives more speed-up to the speed-up gained by multiprocessor

Why did shared virtual memory show poor speed-up for parallel dot-product problem? Communication cost is too high

6.3 Memory Coherence Algorithms