UW-Madison
Computer Sciences Dept.

CS739 Spring 2009: Questions

  1. Survey -- Distributed Operating Systems :
    Andrew S. Tanenbaum and Robbert Van Renesse
    ACM Computing Surveys, Volume 17, Issue 4 (December 1985)
    Question: This paper surveys distributed systems as of 1985. What were the goals of these distributed systems? What were the assumptions (in terms of workload and environment) of these systems? Which design issue (i.e., communication, naming and protection, resource management, fault tolerance, and services) seems most challenging (or interesting)? Why?
  2. Sprite vs. Amoeba : A Comparison of Two Distributed Systems: Amoeba and Sprite
    Fred Douglis, M. Frans Kaashoek, John K. Ousterhout, Andrew S. Tanenbaum.
    Computing Systems, Vol. 4, No. 3, pp. 353-384, December 1991.
    Question: Do you think system projects should pay attention to technology trennds? Why or why not? Do you think Amoeba or Sprite did a better job predicting future technology? Explain.
  3. NFS
    Question: Discuss one of the changes that was made to NFSv3 and one of the changes made to NFSv4. What problem did each change address? Does the change introduce any drawbacks or challenges?
  4. Coda : Disconnected Operation in the Coda File System
    James J. Kistler, M. Satyanarayanan
    13th Symposium on Operating Systems Principles, Asilomar, California, pp. 213-225. October 1991.
    Question: Coda is derived from AFS. What aspects of AFS simplify the design of Coda? Imagine you were instead building Coda on top of NFS; what aspects of Coda would be easier or harder (e.g., within Hoarding, Emulation, and Reintegration)? Make sure you specify which version of NFS you are assuming!
  5. LBFS : A Low-Bandwidth Network File System
    Athicha Muthitacharoen, Benjie Chen (MIT), David Mazieres (NYU), SOSP'01
    Question: Create your own...
  6. Speculator : Speculative execution in a distributed file system
    Edmund B. Nightingale, Peter M. Chen, Jason Flinn
    Proceedings of the twentieth ACM symposium on Operating systems principles (SOSP'05), pages 191 - 205.
    Imagine you are a system administrator who needs to decide whether to deploy SpecNFS (the speculative version of NFSv3) or NFSv4. What are the pros and cons of each version? Which would you choose and why? (You can assume a reputable implementation of each version exists.)
  7. Analysis1 :
    • Black-Box : Performance Debugging for Distributed Systems of Black Boxes
      Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, Athicha Muthitacharoen
      (HP Labs, Duke, and MIT), SOSP'03
    • Paths : Path-Based Failure and Evolution Management
      Mike Y. Chen, University of California, Berkeley; Anthony Accardi, Tellme; Emre Kiciman, Stanford University; Dave Patterson, University of California, Berkeley; Armando Fox, Stanford University; Eric Brewer, University of California, Berkeley, NSDI'04
    • Question: Both of these papers describe techniques for understanding the behavior of large-scale distributed systems. Briefly, how do each of the two techniques determine that certain messages are related? What are the relative strengths and weaknesses of the two approaches and the types of problems one can find?
  8. Centera : Deconstructing Commodity Storage Clusters
    Haryadi Gunawi, Nitin Agrawal, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau
    ISCA'05
    Question: This paper shows that one can actively delay packets to determine whether or not a subsequently sent packet is dependent. What are the strengths and weaknesses of this approach for inferring causality? Did this delay technique discover any non-obvious aspects of the Centera write protocol?
  9. MapReduce : MapReduce: Simplified Data Processing on Large Clusters
    Jeffrey Dean and Sanjay Ghemawat
    OSDI'04
    Improving MapReduce Performance in Heterogeneous Enviornments
    Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, and Ion Stoica, University of California, Berkeley
    OSDI'08
    Question: One of the assumptions made by the Hadoop Scheduler is that tasks in the same category (map or reduce) require roughly the same amount of work (see Section 2.2 of Heterogeneous paper). How does a MapReduce job typically try to ensure this assumption holds true? The LATE scheduler does not directly address this assumption. How does the LATE scheduler handle tasks with more work? How could you modify the scheduler (or any aspect of the MapReduce framework) to better handle jobs with high variance in work across tasks?

    Student Answers

  10. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
    European Conference on Computer Systems (EuroSys), Lisbon, Portugal,
    March 21-23, 2007
    Question: Imagine you are reviewing this paper for SOSP. In your review, summarize the paper and then give your arguments for why or why not this paper should be accepted. What are its contributions? What are the weaknesses?
  11. Migration
    • SpriteMigration -- Transparent Process Migration: Design Alternatives and the Sprite Implementation : meenali
      Fred Douglis and John K. Ousterhout
      Software - Practice and Experience, Volume 21, Number 8, 1991, Pages 757-785.
    • V Migration : Preemptable Remote Execution Facility for the V-System
      M. Theimer, K. Lantz, and D. Cheriton
      10th Symposium on Operating Systems Principles, Orcas Island, WA, December 1985, pp. 2-12.
    • Question: Migration mechanisms make trade-offs between four factors: transparency, residual dependencies, performance, and complexity. What did the Sprite and V designers choose for each factor? How did their assumptions about their environment and usage scenarios influence each of their decisions?
  12. More Migration
    • Zap :The Design and Implementation of Zap: A System for Migrating Computing Environments
      Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh, Columbia University,
      OSDI'04
    • VMmigration : Live Migration of Virtual Machines
      Christopher Clark, Keir Fraser, and Steven Hand, University of Cambridge Computer Laboratory; Jacob Gorm Hansen and Eric Jul, University of Copenhagen; Christian Limpach, Ian Pratt, and Andrew Warfield, University of Cambridge
      Symposium on Networked Systems Design and Implementation (NSDI'05), May 2005
    • Question: Please create three questions that you think would be interesting for everyone to discuss during class. The questions can cover either or both papers. Send your questions by 12:00 (instead of 1:00), please.
  13. Porcupine: Manageability, Availability and Performance in Porcupine: A Highly Scalable Internet Mail Service
    Yasushi Saito, Brian Bershad, and Hank Levy
    17th ACM Symposium on Operating Systems Principles, Dec 1999, Kiawah Island Resort
    Question: Porcupine (and other distributed system services) characterizes state as being either hard state or soft state. What is the difference between the two? What are the advantages of treating some state as soft? Briefly, how does Porcupine recreate each piece of soft state when needed?
  14. xFS : Serverless Network File Systems
    Tom Anderson, Mike Dahlin, Jeanna Neefe, David Patterson, Drew Roselli, Randy Wang.
    SOSP 15, December 1995.
    Question: How does xFS utilize a log for data and meta-data? What is the purpose of the log? How are the data structures maintained? What are the advantages of writing to a log?
  15. GoogleFS : The Google File System
    Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
    SOSP'03
    Where does GoogleFS rely upon soft state and stale information? Discuss the implications and whether or not these appear to be good design decisions.
  16. LOCKSS : Preserving Peer Replicas By Rate-Limited Sampled Voting
    Petros Maniatis, Mema Roussopoulos, TJ Giuli, David S. H. Rosenthal, Mary Baker, Yanto Muliadi
    SOSP'03
    Question: What is the goal of a malign node in this environment? What is the best strategy a malign node can use? Must malign nodes initiate votes of their own (why or why not)? Must malign nodes participate in the votes of others (why or why not)?
  17. Dynamo : Dynamo: Amazon's Highly Available Key-Value Store
    Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels
    Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.
    Question: Amazon's key-value storage server, Dynamo, provides services a trade-off between performance, durability, and availability. What are some of the techniques Dynamo uses to improve one of those three metrics? How does it allow services to control the trade-offs?
  18. Pangaea : Taming Aggressive Replication in the Pangaea Wide-Area File System
    Yasushi Saito, Christos Karamanolis, Magnus Karlsson, and Mallik Mahalingam, HP Labs, OSDI'02
    Question: Why does Pangaea have two classes of replicas: gold and bronze? What is the purpose of each (why not just have gold or just have bronze)? How does Pangaea ensure it has enough replicas?

 
Computer Sciences | UW Home