Picture of Nitin Agrawal
nitin agrawal
Department of Computer Sciences
University of Wisconsin-Madison
nitina     at    cs.wisc.edu
home | research | publications | internships
Nitin Agrawal's Home Page

Research

Enabling Realistic and Practical File-System Benchmarking (thesis research):

Everyone cares about data, from scientists running simulations to families storing photos and tax returns. Thus, the file and storage systems that store and retrieve our important data play an essential role in our computer systems. In spite of tremendous advances in file system design, the approaches for benchmarking still lag far behind. My dissertation research bridges this gap with three contributions.

In the first part of my thesis, I perform a large scale analysis of file system metadata collected over a period of five years. The metadata snapshots were used to study temporal changes in file size, file age, type frequency, namespace structure etc., and we give consequent lessons for designers of file systems and related software. We also presented a generative model that explains the namespace structure and the distribution of directory sizes [FAST ’07, TOS ’07].

Once we have a good understanding of the properties of metadata, we need a mechanism to enable this information to be put to useful practice. Most system designers and evaluators rely on ad hoc assumptions and (often inaccurate) rules of thumb. Furthermore, the lack of standardization and reproducibility makes file system benchmarking ineffective. To remedy these problems, I develop Impressions, a framework to generate statistically accurate file-system images with realistic metadata and content. Impressions is flexible in supporting user-specified constraints on various file-system parameters, using a number of statistical techniques to generate consistent images. Using desktop search as a case study, I demonstrate that incorporating the effects of metadata and file content is crucial to understanding the performance and storage characteristics of file systems and applications [FAST ’09].

In the second part of my thesis, I investigate techniques for creating realistic benchmark workloads. Synthetic file system benchmarks are widely used, but largely based on the benchmark writer’s interpretation of the real workload. This approximation is insufficient since even a simple operation through the API may end up exercising the file system in very different ways due to effects of features such as caching and prefetching. I have taken first steps in creating “realistic synthetic” benchmarks by building a tool, CodeMRI, that leverages file-system domain knowledge and a small amount of system profiling in order to better understand how the benchmark is stressing the system and to deconstruct its workload [HotMetrics ’08, PER ’08].

The last part of my thesis addresses the problem of scalable benchmarking. Storage capacities have seen a tremendous increase in the past few years; terabyte-sized disks are now easily available for desktop computers. However, in order to benchmark file systems and applications that operate on such large disk partitions, the setup required is often cumbersome and the benchmark takes an inconvenient amount of time to finish. I am currently working on a system that makes it practical to run benchmarks on large file systems [SUBMIT '09].

Solid-State Storage Devices: Flash-based SSDs have the potential to change the storage landscape. Recently, I worked on design tradeoffs that are relevant to NAND-flash solid-state storage. We analyzed several of these tradeoffs using a trace based disk simulator that we built to characterize different SSD organizations. More specifically, we worked on designing high-performance solid-state drives for I/O intensive workloads. We proposed algorithms for cleaning and wear-leveling flash-media to make it viable for use in environments with high I/O rates, along with improvements in the performance of random writes – a substantial drawback of flash-based disks in their current form. Our analysis was driven by various traces captured from running systems such as a full-scale TPC-C benchmark, an Exchange server workload, and various standard file system benchmarks. From our analysis, we found that SSD performance and lifetime is highly workload-sensitive, and that complex systems problems that normally appear higher in the storage stack, or even in distributed systems, are relevant to device firmware. We also presented the design of high-performance flash-disks and disk-array configurations based on these disks [USENIX ’08].

Reliability in the Storage Stack: Hardware constituents of the storage stack (storage devices, interconnects, etc) fail and software (device firmware, drivers, file systems) exhibits bugs and other inconsistencies, leading to data loss and corruption. I have worked on techniques to help understand causes of failures in the storage stack:
• Analyzed how commodity file systems handle disk failures. I built a type-aware fault injection framework for the Reiser file system and performed analysis of its fault handling towards partial disk failures [SOSP ’05].
• File systems demonstrate inconsistent and inadequate handling of latent sector errors and other partial disk failures. In order to identify the root causes of the observed inconsistencies, I developed Differential Failure Analysis, a combination of static and run-time analysis to achieve a thorough understanding of the failure handling characteristics of file-system source code [USENIX ’06].
• Developed and applied type-aware corruption to understand the effects of disk-pointer corruption on file-system reliability, using Windows NTFS and Linux ext3 as case studies. I analyzed the ext3 file system [DSN ’08].
• Evaluated failure handling of SCSI drivers by injecting faults at the lowest level of the tiered SCSI architecture and observing the detection and recovery mechanisms employed by the upper driver levels.

Commodity Storage Clusters: High-end storage systems are increasingly being built using commodity components. We designed techniques for characterizing complex storage clusters in the context of the EMC Centera storage system. By correlating disk and network traffic with the running workload using observation and delay, we inferred the structure of the software system as well as its policies (e.g., how it performs caching, replication, load-balancing) without any access to the source code [ISCA ’05].

Regulatory Compliant Storage: The introduction of federal regulations such as Sarbanes-Oxley and HIPAA mandate stricter enforcing of data retention, access and tampering guidelines. I worked on an auditing framework to enforce regulatory compliance on archival storage. The focus was to provide continuous verification of system state and support feature-rich querying. As part of a larger project on compliant storage at IBM Almaden, I designed and built a prototype for the auditing engine.

I/O Request Scheduling: Worked on a request scheduling policy for storage systems, called Interference-Aware Smallest-Cost-First policy. While existing scheduling policies are oblivious of the location of blocks, and do not distinguish between blocks on-disk or in-cache, iSCF not only takes into account the location of blocks servicing the request but is also aware of the in-cache position of the blocks [UW-CS '04].

Some of the above work was done as part of research during my internships.

Publications

Generating Realistic Impressions for File-System Benchmarking
Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau.
Proceedings of the 7th Conference on File and Storage Technologies (FAST '09), Feb 2009, San Francisco, CA.
Available as: Abstract, Postscript, PDF, BibTex
Best Paper Award
FAST '09
A Five-Year Study of File-System Metadata
Nitin Agrawal, William J. Bolosky, John R. Douceur, Jacob R. Lorch.
ACM Transactions on Storage (TOS), Volume 3, Issue 3 (Oct 2007)
Available as: Abstract, Postscript, PDF, BibTex
TOS '07
Design Tradeoffs for SSD Performance
Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse, Rina Panigrahy.
Usenix Annual Technical Conference
(USENIX '08), June 2008, Boston, MA.
Available as: Abstract, Postscript, PDF, BibTeX  [Press: Storagemojo]
USENIX '08
Towards Realistic File-System Benchmarks with CodeMRI
Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau.
Appears in ACM HotMetrics '08
, June 2008, Annapolis, MD.
and in SIGMETRICS Performance Evaluation Review (PER), Volume 36, Issue 2 (Sep 2008)
Available as: Abstract, Postscript, PDF, BibTeX
HotMetrics '08
Analyzing the Effects of Disk Pointer Corruption
Lakshmi Bairavasundaram, Meenali Rungta, Nitin Agrawal, Andrea C. Arpaci-Dusseau,
Remzi H. Arpaci-Dusseau, Michael M. Swift.
38th Conference on Dependable Systems and Networks
(DSN '08), June 2008, Alaska, AK.
Available as: Abstract, Postscript, PDF, BibTeX
DSN '08
A Five-Year Study of File-System Metadata
Nitin Agrawal, William J. Bolosky, John R. Douceur, Jacob R. Lorch.
Proceedings of the 5th Conference on File and Storage Technologies
(FAST '07), Feb 2007, San Jose, CA.
Available as: Abstract, Postscript, PDF, BibTex
Selected as a top paper and forwarded to ACM TOS
FAST '07
IRON File Systems
Vijayan Prabhakaran, Lakshmi N. Bairavasundaram, Nitin Agrawal, Haryadi S. Gunawi,
Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau.
Proceedings of the 20th ACM Symposium on Operating Systems Principles
(SOSP'05), October 2005, Brighton, UK
Available as: Abstract, Postscript, PDF, BibTex
SOSP '05
Deconstructing Commodity Storage Clusters
Haryadi S. Gunawi, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Jiri Schindler. 
Proceedings of the 32nd International Symposium on Computer Architecture
(ISCA'05), June 2005, Madison, WI
Available as: Abstract, Postscript, PDF, BibTex
ISCA '05

Other Publications
Speedy and Scalable File-System Benchmarking with Compressions
Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau.
7th Conference on File and Storage Technologies (FAST '09), Feb 2009, San Francisco, CA (WIP Session)
FAST '09
Separating Policy and Mechanism for Failure Handling in Commodity File Systems
Nitin Agrawal, Haryadi S. Gunawi, Vijayan Prabhakaran, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau.
Usenix Annual Technical Conference
(USENIX '06), June 2006, Boston, MA (Poster Session)
USENIX '06
Interference Aware Scheduling in Storage Systems
Nitin Agrawal.
Master's Thesis, Department of Computer Sciences, University of Wisconsin - Madison, May 2004
UW-CS '04
Symbolic Rule-Extraction from Artificial Neural Networks
Nitin Agrawal.
Bachelor's Thesis, Department of Computer Sciences, Institute of Technology, BHU, India, May 2003
IT-BHU '03

Internships

Microsoft Research, Mountain View (Silicon Valley Lab), Summer 2007
I worked on designing high-performance flash-based solid-state storage devices with Ted Wobber, Andrew Birrell and Chuck Thacker. Here is a brief description of what I did. Appears in USENIX '08.
 
Microsoft Research, Redmond (Systems and Networking Group), Summer 2005
Performed a large scale analysis of file system metadata collected over a period of 5 years, with Bill Bolosky, John Douceur and Jay Lorch. The metadata snapshots were used to study temporal changes in file size, file age, type frequency, namespace structure etc and give examples of consequent lessons for designers of file systems. Appears in FAST '07 and ACM TOS '07.
 
IBM Almaden Research Center, San Jose (Storage Systems Group), Summer 2004
The work involved designing a unified auditing framework for regulatory compliant archival storage. The focus was to provide continuous verification of system state and support feature-rich querying. Designed and built a prototype for the auditing engine.
 
University of Dortmund, Germany (Artificial Intelligence Group), Summer 2002
Worked on MiningMart , a long term European Research Project aimed at providing End-User Data warehouse Mining. The work involved designing a chain of preprocessing operators in the MiningMart Meta Model, with SVM as the final learning step.
 
 
Tata Consultancy Services (TCS) Ltd, India, Summer 2001
The work involved developing an automated & interactive system for automation of Project Management Review. Worked as part of a team to develop the prototype and also designed and built some components independently.