##########################################################################
#
#   Comments and Thoughts on 
#       "Lock behavior characterization of Commercial Workloads"
#  
#   Contributors: Ravi Rajwar, Mikko's group, Min Xu and us
#
#   May 21, 2002  
#
##########################################################################

## About the Conclusion ##

(*) For correctness, it's not necessary to identify user-level thread 
    switches. Our conclusion on this is too strong to be correct. 


## About our Methods ##

(*) Lock identification algorithm: Kevin gave an example of using one 
	casa instruction to insert into a linked list. This can introduce 
	errors in our current scheme, because (1) the locking and critical 
	section are implemented by one instruction; (2) a following normal 
	store can be mistakenly deem as a completing store for critical 
	section.   We don't know how often this happen though. 

(*) Ravi suggested to get the number of cycles spent on lock contention
	since usually access to lock induces dirty misses (100+ cycles).

(*) Jichuan once thought that scheduling a waiting thread off is a good
	thing for throughput-oriented (fine-grained lock abound) workloads,
	but Ravi pointed that scheduling is not free, the overhead itself
	can be large (IEEE Micro had a paper on OLTP optimization issues, 
	they actually restrain thread-yielding on IBM machine, and only 
	switches on I/O. 

(*) The cache behavior of lock accesses can be a factor having 
	performance impact: so sometimes using one coarse-grained under SLE
	can remove all the misses induced by many fine-grained locks. 
   
(*) To get a detailed cycle count, we probably need a detailed processor
	timing-model (to model the multiple-issue, miss-miss overlapping
	CPU). Ravi mentioned an example, when the lock contention is 9% of 
	execution time, but using SLE only have 1% speedup, which could be
	caused by the long-latency-no-overlapping misses inside critical 
	sections. 

(*) We may also get the number for 8 and 64 processors, to see if our 
	16-CPU setting stresses the lock contention. 

(*) Lock-free section is a confusing term, since `lock-free' already 
	has its meaning in the community. We might just call it "the other". 

(*) Ravi mentioned that many DBMS vendors still use simple locking 
	implementations for they are portable and ease-to-use. It's also 
	noticed that most of the contentions (and resulted dirty-misses)
	happen in a thin system-dependent code layer (which operates on 
	a small shared data area). 
     
[+]  System optimization for OLTP workloads, 
	Kunkel, S.; Armstrong, B.; Vitale, P. 
	IEEE Micro , Volume: 19 Issue: 3 , May-June 1999 Page(s): 56 -64