CS-739: Project Suggestions

Billion-File File Systems: Local file systems form the core of many distributed storage systems. As local hard drives and SSDs scale, the number of objects stored in each node grows. The challenge here is to study such local systems at scale; how well do file systems handle your billion-file challenge? How can we study such systems? Which file systems hold up well under such workloads, and don't? What techniques are needed to use said file systems effectively? Related href=http://events.linuxfoundation.org/slides/2010/linuxcon2010_wheeler.pdf here.

Cloud-native WiscKey: The cloud-native movement imagines how we can build software systems assuming that the cloud exists. Here, we take one small step: how can we take WiscKey and make it more suitable for deployment in the cloud? How can we use a storage system such as S3 to create snapshots of its data, etc.?

Data-driven Lambdas: Lambda-based computing is increasingly popular. In this project, you'll build support for data-driven lambdas, by taking an existing scalable storage system and integrating it with OpenLambda such that the location of lambda execution will take advantage of locality where possible. Part of this will include building some kind of Lambda benchmark suite to test your ideas.

Distributed Block Allocation: Block allocation is at the heart of distributed storage systems. In this project, you'll first study how block allocation works in one or more existing distributed storage systems. What algorithms are used? What inputs do they take? Then, you'll analyze how well such schemes work at scale, in the presence of heterogeneous space constraints, in the presence of performance problems, and other interesting scenarios. Are current approaches good enough to work well in all these scenarios?

Faults and Divergence in Replicated State Machines: One hard problem in RSMs is ensuring no divergence; even if replicated servers receive the same inputs, there is no guarantee they will behave the same, so careful programming is required. In this project, you will inject faults into some real replicated services and see if you can get the replicas to diverge (to make different decisions and thus lead to observably different behavior). What happens when a disk write fails? When a memory allocation fails? What if memory becomes corrupt? etc.

Fault Analysis of A Distributed Block Store: Take the open-source DRBD replicated block store, and perform an analysis of its reaction faults. Use this paper as a model. How does the system handle machine faults, network faults, disk faults, performance delays and timeouts, etc.? How robust is it in these scenarios? This important class of system is understudied: only you can rectify this through your efforts.

Gray Study, Redone: We read Gray's paper on how systems fail and what to do about it. Can we redo this study for a modern IT workshop? Work with the CSL to get failure logs, etc., and analyze them, and try to reproduce Gray's data. What is the same now, and what is different, in this context? What new things can we learn, especially considering the increasingly cloudy world?

Kill The Provably Correct Systems: Provably correct systems are increasingly the focus of a branch of systems work. In this project, you'll show the folly of such approaches by introducing faults and other bugs into these proven-correct systems and see how they react. Do their proofs guarantee much at all, in the real world? Can you turn the world upside down by finding flaws in these perfect systems?

Lambda Application Suite: While Lambdas represent an increasingly popular way to build distributed applications, little is known about their strengths and weaknesses in this regard. In this project, you'll build one serious application of your choice on top of Lambdas. Pick something substantial (e.g., email? social network?); then use the application as a way to evaluate the success/weakness of Lambda as an approach. Perhaps the best kind of application is one that likely needs to handle bursts as part of its workload. Also OK: find real applications built in this style as a starting point.

Log-structured Merge Tree File System: Log-structured merge trees are increasingly common, but only as user-level storage systems. In this project, you'll build a new linux file system around the LSM concept, to show their advantages at the lower layers of a system. Get experience with what it is like to build an in-kernel file system; also, suffer a lot, because kernel programming is akin to suffering.

Parallel Cloud Build: One of the most important tasks for a developer is the build system. Today's cloud gives us an opportunity to build the best and fastest build system there is. In this project, you'll build a fast parallel build system for the Linux kernel or other incredibly large source tree (in fact, the bigger the source tree, the better). How fast can you make such a build?

Personal Clouds (PC): In this project, you'll make it as easy to use the cloud as it is to run a python script on your local machine. The PC project will have you build a system to enable the easy launch of python/whatever scripts on a cloud service; problems you'll have to solve include how to securely launch a job, how to access data, and how to manage cloud resources within a given budget. This will be a fun implementation project, if you like that sort of thing.

Python At Scale: An increasing amount of data analysis is done via Python. However, Python has problems at scale, especially when using a lot of memory. In this project, you'll quantify the overheads of Python during typical analysis tasks and prototype methods and techniques to build a memory-efficient Python. Such analysis may particularly be important in cloud-execution systems such as AWS Lambda, when using Python as the main developer language. Related: here.

Science of Scalability: Scalability is a key feature of distributed systems, yet what does it really mean? In this project, you'll develop techniques to study how various systems scale. What are the key limitations on scale, and how can you evaluate a particular system's ability to overcome these limitations? A number of ways to go on this: build a simulation platform, or figure out how to run real code on some kind of emulator to stress test it. Use your tool to simulate famous distributed systems of our time, including GFS, BigTable, Dynamo, and others.

SSD Fault Simulation: We read about how SSDs fail in class, and some theories of why the very unusual bathtub shaped failure rates arise. In this project, you'll build a simulator of SSDs that includes localized failure behavior, and see if you can replicate the type of behavior seen in the SSD failure paper. What are important parameters? How does such failure (and the remapping needed in the FTL) affect performance? Does the theory of SSD failure found in that paper match what you can produce via simulation?

Tail Latency and Predictable Local Storage Systems: Tail latency has become a central focus in performance of large-scale systems. The question is not what the average response time is, but rather what the 99th percentile of requests will see. Clearly, the closer 99th percentile behavior is to average, the more predictable the behavior of your system is. In this project, you'll start by measuring latencies of local file systems (the key building blocks in distributed storage) to understand what kind of latency profiles are common. What functionality in the file system can lead to different observed tail performance? (think about reading/writing, caching with different sized memories, path hierarchy depth, lots of files in a directory, and other functionality that could affect performance) If you find some interesting problems, you could then take the next step and start to build a more predictable local storage system; how can you make it such that the local storage system is a highly predictable building block for larger scale systems?

Unwritten Contracts: Each system has an unwritten contract (read this paper about SSDs for one example), which are the set of rules you should follow in order to use the system well. In this project, you'll pick some system (e.g., a distributed key-value store, a local file system) and build tools and techniques to discover said rules. This will be a fun exploration of a complex system of some kind.

User-level Logging Alternatives: As we saw in the ALICE paper, there are a number of different ways to implement local update protocols. In this work, you will study the range of different approaches used across systems, and try to classify them into a taxonomy. What are the common techniques? Can a general approach be developed and plugged in underneath a number of different systems, without losing performance? You will start with a survey of how things are built, and then perhaps try building a generic library that can serve as a substitute, thus making high-performance correct crash consistency achievable for all.