SUMMARY
-------
- Optimize for common case: 
  + read-mostly, but still want to prevent infrequent destructive operation
- Example of Module loading (who knows what will be ask)  
- IDEA: 
  + splitting update into two phases
    1. carry out enough update for new operations to see new state
       while still allowing existing operations see old state
    2. completing update after all active operations have completed
       (garbage collection and stuff)
- Pros:
  + no need to lock
  + optimize for common case (read mostly)
  + eliminate cache-line bouncing
    ~ improve scalability
    ~ improve uniprocessor performance
- Cons:
  + work only if operation can tolerate stale data
    i.e it is ok for an operation to read slightly out of date data 
  + how to detect that all active operations have been completed
  + work only for operation that can be split into two phases above
  + garbage collection
- grace period: time duration that existing old operations see old state
  while new operations see new state
  + HOW TO DETECT: implicitly (see details down below)

Section 2.5: WHY RCU? WHAT ABOUT TRADITIONAL LOCKING: Cache line bouncing
- traditional locking is limited by worst-case memory latency
- each lock acquisition must do a write to that lock's data structure
- the lock's data structure will normally *only* be in "modified" state in the 
  cache of *last acquiring CPU*
- Next acquiring CPU will incur a *remote-memory-access latency penalty*
  when acquiring the lock

SOLUTION for cache-line bouncing
- of course, RCU solve it, because there is no lock at all. Note that RCU is 
  cache-line bouncing free with read workload (not for write workload)
- we have another solution called brlock, or big read lock. The idea is that 
each CPU has a small lock which constitute to a big lock. On read, the thread
needs to acquire only the local lock on its CPU (hence get rid of cache-line
bouncing). But on write, it needs to acquire all the lock, which is costly.


--------------------------------------------------------------------------------


**READ-COPY UPDATE**
====================

Motivation:
-----------
- traditional synchronization techniques (e.g lock or atomic instruction) 
is not good for event-driven system like Linux. Why?
+ increased overhead
+ reduced scalability (brlock) on multiprocessor (due to cache-line bouncing)
+ complex lock hierarchies, which both increase complexity and introduce hard-
to-solve deadlock-avoidances issues when locks must be acquired out of order.
- Event-driven system:
+ process is small, quickly completed
+ vs. CPU bound software such as scientific application, in which a process
  may hold a lock for a long time

Idea
----
Not using any lock, but we can split updates in to two phases:
1) carrying out enough of each update for new operations to see the new state
   while still allowing existing operations to proceed on the old state
2) complete the update after all active operations have completed

When this is good?
-----------------
1) It is possible to divide an update into two phases (as mention above)
2) it is possible for common-case operation to proceed on stale data 
   (e.g continue to handle operations by a module being unload)
3) Destructive updates is very infrequent (i.e, we have read-mostly)
4) event-driven operations complete quickly

How they do it
--------------
- The idea is to provide a "grace period" during which ongoing operations (that 
starts before the beginning of grace period) to continue with old state, but 
new operations (that starts after the grace period begin) will see and 
operate on the new state. 
- The grade period ends until after the end of all old operations. 
- Once the grace period expires, the system may complete any required cleanup

How to detect the end of grace period
-------------------------------------
- indirectly, when every CPU has passed through "quiescent state", a point in 
the code at which it is guaranteed that all previous operation have completed.
(e.g for non-preemptive kernel, a context switch is quiescent state for a 
give CPU)

- using quiescent state is pessimistic in the sense that it guarantees all
pre-existing operation on *all* data structures have completed, when in fact
only a few ( or none!) of them might be using the data structure of interest.
--> This allows us to deduce that a given data structure is no longer referenced
**without having to use locks or atomic operations**, hence
--> improve performance and scalability (get rid of cache-line bouncing, etc..)

- strategies to choose quiescent state:
1) context switch as quiescent state for non-preemptive kernel
2) context switch, execution in idle loop, execution in user mode, system 
   call entry, trap from user mode, CPU offline
3) voluntary context switch, appropriate for kernel that allows preemption but
not voluntary context switch in read-side Critical Section (Why? Because if
a thread voluntarily gives up the CPU during CS, who knows what gonna happen to 
shared data. If a thread is preempted during CS, then this context switch is
not treated as a quiescent state)
4) We can track the beginnings and ends of operations (e.g using a counter per
CPU). When the counter goes to zero, all pre-existing operations on that CPU
have finished --> we have a quiescent state on that CPU

Note that: 1 and 2 keeps track of information per CPU
           3 and 4 keep track of information per-thread basis

What is cache-line bouncing and some solutions?
-----------------------------------------------
In a multiprocessor system, each CPU has its own cache. When a thread in a
particular CPU does a lock, it needs to modify the cache line associated with
that lock variable, and also inform other CPUs about the change it has just
made (by sending message, for example). Hence, for every lock/unlock operation,
costly communication happens among the CPUs to invalidate the cache line.

More details: Traditional locking is limited by worst-case memory latency.
Each acquisition of the lock requires a write to that lock's data structure,
so that the lock's data structure will normally only be in the "modified"
state in the cache of last acquiring CPU. The next acquiring CPU will likely
incur a remote-memory-access latency penalty when acquiring the look.

How to solve cache-line bouncing problem?
- of course, RCU solve it, because there is no lock at all. Note that RCU is 
  cache-line bouncing free with read workload (not for write workload)
- we have another solution called brlock, or big read lock. The idea is that 
each CPU has a small lock which constitute to a big lock. On read, the thread
needs to acquire only the local lock on its CPU (hence get rid of cache-line
bouncing). But on write, it needs to acquire all the lock, which is costly.

Features of RCU
--------------
- eliminate cache-line bouncing for reading tasks, hence good for read mostly
- but can return stale data -> good for operations that can tolerate stale data
  + stale data can either be suppressed or tolerated
- cannot hold a reference to elements across a voluntary context switch
  --> because otherwise, other thread can make update to it, hence inconsistency
  (because there is no lock to protect it)
- requires that modification be compatible with lock-free access (e.g. a read
  access will see either the old or new state of the list)
- it is possible to perform an arbitrary RCU modification of any data structure
  by making a copy of the entire structure, but this is *inefficient* for large
  structure
- need a modest amount of memory to keep track of memory waiting to be free
  + i.e. garbage collection overhead
- Less applicable to non-event-drive software, e.g CPU-bound scientific
  applications (not sure why?)

How to deal with stale data
---------------------------
- it can be flagged, so that it may be ignored
- software that tolerate stale data 
  + e.g. data representing external state, like a routing tables

HOW TO IMPLEMENT?
=================

**Wait_for_rcu**: wait for a grace period to expire : SEE FIGURE 16, 17
- wait_for_rcu(void) can be implemented as an updater schedules to execute 
itself on each CPU in turn. Once this finish, the non-preemptive nature of
Linux kernel guarantees that all operations that were in progress during phase
one must have completed.
- this implementation has some shortcomings:
1) not work with preempt-able kernel unless preemption is prohibited during
   read-side critical sections
2) can not be called from in interrupt handler.
   + really slow, and we want interrupt handler to be fast
3) can not be called while holding a spinlock, because otherwise, no other 
   process can proceed (because it cannot acquire the lock) 
4) it is slow. 
   + Who know how long the "fake" updater gonna take? 
   + require a lot of context switch: one per CPU

Another way to implement this function is using callback queue. All function 
in callback queue will be called once a grace period expires. This has a
benefit of reduce # of context switch. It also solve 2) and 3)

**kfree_rcu()**: wait for a grace period before freeing a block of memory
- Using kmalloc_rcu to allocate memory
- kfree_rcu will schedule a tasklet that wait_for_rcu before doing a real kfree

 
TORNADO/K42 Design: 
-------------------
Note that above implementation is for non-preemptive kernel or for preemptive
kernel in which block or preemption is prohibited during read-side critical
section.

Following is the implementation that **allow read-side critical section block
as well as being preempted**
- maintain 2 generation counter per CPU
- When a operation begins, it is associated with the current counter (by 
  storing a pointer to that counter in the task), and the current counter is 
  incremented)
- When the operation ends, the corresponding generation counter is decremented
- Periodically, the non-current counter is check to see if it is zero. 
If yes, then all associated operations have completed. When this happens,
the roles of the current and non-current generations are reversed. A per-CPU 
separate generation sequence is advanced every time a new generation is
identified as current. 
==> Hence, when a sequence number has advanced twice we can be assured that
all operation in existence prior to advancement have terminated on that CPU.
- To know when all operations have terminated across all the CPUs, simply use
a token. A CPU holds a token and passes it only its sequence number advances
at least by two.

**High-performance design for RCU**  (Section 6.2)
-----------------------------------
There are 5 items
1. Support kmem-cache-malloc()
2. Deal with long lock-hold duration: this could make grace period excessively
   long, degrade response time
 	SOLUTION: record a timestamp at beginning of grace period
3. Batching grace-period measurement requests (i.e. wait_for_rcu)
   + what if each cpu call wait_for_rcu?
   SOLUTION: use a lists, batch multiple callbacks on the lists
4. Per-CPU requests list to reduce per-request overhead
   SOLUTION: replicate the list per-CPU, reduce cacheline bouncing and the need 
   for locks guarding the lists
5. less-costly algorithm for measuring grace period
   + currently, context switch
   + token passing, counter ..
   All are expensive
   SOLUTION: 

Some Applications of RCU 
------------------------
1) Scalable FD management
- what need to be protected: the array of fd
2) CPU Hotplug
- what need to be protected: online processors array
3) Module Unloading

OTHER LOCKING ALGORITHMS
------------------------
1. Data Locking
   - separate lock for each instance of a give data structure
   - prone to deadlock
   - reading task still need to acquire locks, cache-line bouncing
   - list elements may be manipulated in parallel
   - search cannot be done in parallel
2. Reader-writer lock: e.g brlock (big read lock)
   - a cache-aligned array of per-CPU locks
   - reader: acquire only its CPU's lock 
     + hence, reduce cache line bouncing
     + since the cacheline  containing a give CPU's lock is likely
       remains in that CPU's cache
   - writer: acquire ALL CPU's locks
   - quite large, hard to be embedded within data structure
   - Great if read-mostly
3. RCU without stale data: suppress data
   - use a separate lock per data element
   - add a delete flag, hence when read, if delete flag is on, return
   - may introduce cache line bouncing, since testing the flag require locking
   - but only the cache lines contain the elements themselves will bounce
     (since there is no need for a list lock to guard the search)
4. RCU across context switch: 
   - how do hole a reference element across context switch?
   - add refcount to each element
   - cacheline bouncing due to hold()'s locking and manipulation of refcount
   - no list lock