CS-739: Project Suggestions

This year we have two flavors of projects. The first are more developer-oriented, where you work on something within Ceph that is directly useful today, and hopefully (with a good effort on your part) becomes part of Ceph. Cool! The second set of projects are more research-oriented, and thus seek to learn something about Ceph, and produce a research result. The best of these will turn into real published papers. Also cool! Try to pick a few projects you like, and then, in consultation with me, we will come to a final decision.

For developer-oriented projects, you should consult with Sage as well as me. This consultation will take place Monday, November 19th (although sending him email ahead of that time is ok -- check with me). For research-oriented projects, you only need to consult with me.

Developer-oriented Projects

1. Data-movement Visualization for Dashboard: There are some snazzy visualizations for showing flows between lots of nodes (see here and here for example). In this project, you will: add instrumentation to the OSD to gather the relevant metrics, pass them to the manager, make them available to manager modules, and add a dashboard component to see the results.

2. OSD/cluster Visualization: In large-scale systems, it is challenging to see what is going on. In this project, you will develop a scalable visualization for the OSD/cluster hierarchy. Should you use a tree view? A concentric circle segment view? Figure out different approaches and build tools to showcase your results.

3. OSD: New distributed reserver class: Here, you will implement a distributed algorithm for doing efficient reservations for background work. This should include priorities and preemption. Eventually, this could replace OSD scrub scheduling and AsyncReserver (this part is a difficult integration task; writing the above component can be separated).

4. Dedup/CAS - Reference counting with backpointers and bounded representation: Dedup content-addressible storage tier will reference count objects named by their hash (e.g. sha1 fingerprint). We want backpointers to enable a scrub/consistency check, and/or debugging. Some objects will have zillions of references. Thus, devise a representation that has a complete set of backpointers in the common case of a small nubmer of references, and less precise backpointers for many references, e.g., for small counts, point to actual objects references the chunk, and for large counts, point to rados poolIDs, or rados pool ID and hash prefixes/mask.

5. Network-connectivity Checker: Network connectivity within a cluster is critical. In this project, develop a connectivity checker within Ceph. Do this by adding OSD commands to send network traffic to another OSD, and building a manager module to implmenet point-to-point checks or a whole-mesh check.

6. Identifying Performance Outliers: Develop a manager module to identify performance outliers; can you notice when OSDs are consistently slow? Should you use in-manager metrics, or pull from prometheus? Can you mitigate these problems with primary-affinity? Inject performance problems into Ceph and see if your approach can reliably detect problems.

7. Device Failure-prediction Improvements: See this document for some details, or this video for some background. Develop a failure-prediction algorithm, thus building a better predictor based on available data. Can we do meaningful predictions with imprecise failure data? (i.e., no this-device-failed indication). Also, add server side infrastructure to collect relevant metrics.

Research-oriented Projects

8. PACE Analysis of Ceph: The PACE framework explores the impact of correlated crashes upon distributed systems (see this for details). In this project, you will apply PACE to Ceph to study its behavior under this type of fault. What happens when nodes go down together? Study Ceph in detail and find its flaws.

9. Faults and Divergence in Replicated State Machines: One hard problem in RSMs is ensuring no divergence; even if replicated servers receive the same inputs, there is no guarantee they will behave the same, so careful programming is required. In this project, you will inject faults into Ceph's PAXOS implementation to see if you can get the replicas to diverge (to make different decisions and thus lead to observably different behavior). What happens when a disk write fails? When a memory allocation fails? What if memory becomes corrupt? etc.

10. Science of Scalability: Scalability is a key feature of distributed systems, yet what does it really mean? In this project, you'll develop techniques to study how various systems scale. What are the key limitations on scale, and how can you evaluate a particular system's ability to overcome these limitations? A number of ways to go on this: build a simulation platform, or figure out how to run real code on some kind of emulator to stress test it. Use your tool to simulate or emulate Ceph's behavior at scale. Related: David paper.

11. Software Overheads of Storage: As underlying storage devices become faster (e.g., memory speeds now nearly achievable), software overheads will dominate storage. In this project, you will evaluate the overheads of Ceph assuming an infinitely fast storage medium (e.g., memory). Where does the time go? Use profilers and other related tools to obtain detailed breakdowns, and analyze where there are excessive overheads. Can you fix Ceph software to make it ready for the next generation of storage devices?

12. Tail Latency and Predictable Local Storage Systems: Tail latency has become a central focus in performance of large-scale systems. The question is not what the average response time is, but rather what the 99th percentile of requests will see. Clearly, the closer 99th percentile behavior is to average, the more predictable the behavior of your system is. In this project, you'll start by measuring latencies of OSDs (the key building blocks in distributed storage) to understand what kind of latency profiles are common. What functionality in the OSD can lead to different observed tail performance? If you find some interesting problems, you could then take the next step and start to build a more predictable local storage system; how can you make it such that the local storage system is a highly predictable building block for larger scale systems? Another direction is to then study the impact of long-tail latencies within Ceph -- how much does it affect global operations, to have unpredictability at the local level?

13. Unwritten Contracts: Each system has an unwritten contract (read this paper about SSDs for one example), which are the set of rules you should follow in order to use the system well. In this project, you'll try to develop such a contract for Ceph. What tools and techniques are needed to discover said rules?