Computer Sciences Dept.

Raghu Ramakrishnan

Professor of Computer Sciences

Research Interests

Database systems. In particular, Big Data, cloud computing, data marketplaces, semantics-driven information services, and web-scale data management

I AM NOW AT Microsoft, heading the Information Services Lab (ISL)

The best way to reach me is to send email to first-name AT microsoft DOT com

Picture of Raghu Ramakrishnan

Current Research Projects

(If you're looking for past projects, scroll past the current stuff!)

My two currently funded research projects are in data mining:

Goal-Oriented Privacy Preservation This is a newly funded NSF Cybertrust project, carried out with David DeWitt and Jude Shavlik in CS, Larry Hanrahan (Chief Epidemiologist, State of Wisconsin), and Amy Trentham-Dietz (Population Health Sciences). Sophisticated data analysis and mining techniques are necessary tools for understanding complex datasets in an increasing number of application domains, including bioinformatics, health care, e-commerce, fraud detection, network attacks, and homeland security. The increasing use of these techniques has also created a heightened awareness of their potential for compromising privacy, and the challenge of controlling data access in accordance with privacy policies while enabling useful analysis has emerged as a central and ubiquitous problem. To meet this challenge, we need to go well beyond traditional access-control strategies because analysis and mining techniques can potentially be applied to published data to infer data that is intended to be private. We are investigating the trade-off between privacy and security guarantees and the utility of the published data for specific analysis tasks; we call this goal-oriented privacy preservation.

EDAM: Exploratory Data Analysis and Monitoring This project is a collaborative effort between computer scientists and environmental chemists at Carleton College and UW-Madison, supported by an NSF Medium ITR grant. The goal is to develop data mining techniques for advancing the state of the art in analyzing atmospheric aerosol datasets. There is a great need to better understand the sources, dynamics, and compositions of atmospheric aerosols. The traditional approach for particle measurement, which is the collection of bulk samples of particulates on filters, is not adequate for studying particle dynamics and real-time correlations. This has led to the development of a new generation of real-time instruments that provide continuous or semi-continuous streams of data about certain aerosol properties. However, these instruments have added a significant level of complexity to atmospheric aerosol data, and dramatically increased the amounts of data to be collected, managed, and analyzed. In the EDAM project, we are investigating techniques for automatically labeling mass spectra from different kinds of aerosol mass spectrometers, and then analyzing and exploring the rich spatiotemporal information collected from multiple geographically distributed instruments.

Beyond the specific problems that arise in the context of atmospheric aerosols, we are interested in addressing some fundamental challenges in data mining and monitoring:

  • Support for multi-step analyses: Most data analysis tasks involve several steps, and indeed, several analysis steps. Yet, most of the literature on the topic addresses algorithms for some one step, such as constructing a decision tree or identifying clusters. We are investigating ways to compositionally specify, optimize, and provide life-cycle support for multi-step analyses. In terms of optimization, our focus is on developing cost-based strategies, analogous to database query optimization, for a broad class of compositionally specified multi-step analyses. The underlying objective is to enable to specify a large number of potential analyses (differing in carefully selected parameterized choices), and to develop a system that can automatically, perhaps over days in a Condor-style environment, evaluate these different analyses and bring "interesting" instances to the attention of the analyst. Thus, we hope to minimize the human burden in the iterative process of data exploration. We are also interested in techniques for maintaining detailed provenance information for all data, including instrument readings and other source data as well as results of intermediate analysis steps, and for reasoning with this information for data validation.
  • Exploratory Search for Interesting Subsets of Data: We are exploring a stylized generate-and-test paradigm for identifying interesting subsets of data, which we call subset mining. The key idea is that dimensional hierarchies, originally exploited in multidimensional exploratory interfaces for OLAP, offer a natural class of datasubsets as meaningful, interpretable candidates for exploration. Each of these subsets can be associated with quantitative metrics that reflect a broad range of analysis and summarization techniques (e.g., a subset of a loan application dataset can be summarized by a number that indicates the degree of racial bias reflected in this data, viewed as a training set for a classifier), and the resulting data cube offers a powerful data exploration interface.

Another area in which I have worked in recent years is currency-aware replication:

CICADA: Consistency in Currency-Aware Data Access This is a project that is being conducted jointly with Phil Bernstein and Paul Larson at Microsoft Research. Increasingly, applications are using cached versions of data, even data that is stored in a DBMS. A familiar example is eBay, where several webpages show slightly out-of-date prices for items that are being bid on. Applications use their knowledge of when it is acceptable to use data that is not current in order to improve performance. However, by making copies outside the DBMS, they are forced to take on the responsibility of maintaining these caches, leading to more complex and error-prone application code. Our vision is to extend DBMS support for cached copies, allowing applications to be made simpler without sacrificing the efficiency gains of using copies when appropriate. The main idea is that applications should be able to specify whether a copy of a piece of data (e.g., a table, a row) can be used and how outdated that copy is allowed to be, as part of the SQL statement. The DBMS can exploit this information and use copies to improve performance as long as it guarantees that the application's currency requirements are met. The challenges to be addressed include query optimization and evaluation in the presence of cached copies, workload-aware cache maintenance, and concurrency control and serializability theory for transactions that can see stale data.

A research topic that I have been increasingly active in is the intersection of text and relational data management.

Managing Text My interest in this area began with my work at QUIQ, where we developed a novel system for using online communities for customer support. A key challenge was organizing question and answer threads and efficiently supporting queries involving a combination of structured and keyword criteria against a frequently updated corpus. See:

Raghu Ramakrishnan, Andrew Baptist, Vuk Ercegovac, Matt Hanselman, Navin Kabra, Amit Marathe, Uri Shaft:
Mass Collaboration: A Case Study
8th International Database Engineering and Applications Symposium (IDEAS 2004), 2004, Coimbra, Portugal

Navin Kabra, Raghu Ramakrishnan, Vuk Ercegovac:
The QUIQ Engine: A Hybrid IRDB System
University of Wisconsin-Madison Technical Report TR-1449, 2002

Raghu Ramakrishnan:
Mass Collaboration and Data Mining
Keynote talk at KDD 2001

Recent work has included a benchmark for text in relational database systems:

Vuk Ercegovac, David J. DeWitt, Raghu Ramakrishnan:
The TEXTURE Benchmark: Measuring Performance of Text Queries on a Relational DBMS
VLDB Conference, 2005

Not to mention joint work with the Avatar group at IBM Almaden on doing OLAP queries over imprecise data (such as that obtained by data mining over text):

Douglas Burdick, Prasad Deshpande, T. S. Jayram, Raghu Ramakrishnan, and Shivakumar Vaithyanathan:
OLAP Over Uncertain and Imprecise Data
VLDB 2005

AnHai Doan from UIUC and I have joined forces on a new project that addresses extraction management with Community Information Management (CIM) as a driving application; stay tuned for the initial results ... (If you use dbworld, you'll soon see some exciting spin-offs from this project.)

Past Research Projects

I'm interested in data modeling, database design, data analysis, and management. I've worked extensively on deductive databases; data mining; OLAP; database query optimization; retrieval from repositories of text, images, and sequence data; visual data exploration; and database integration. Pointers to some earlier projects are listed below.

  • SEQ Sequences: Querying ordered relations

    In the SEQ project, we were the first to propose the use of sorted relations to support a broad range of sequence queries, and showed how SQL, the standard relational query language which views relations as unordered collections of rows, can be extended to express such queries. The "window function" extension in the ANSI/ISO SQL standard follows this proposal closely. The query optimization ideas we developed, in particular, "one-pass, stream-access evaluation", anticipated window-based optimizations in the extensive literature on processing data-streams by nearly a decade. (Although we regrettably stopped short of investigating streaming data and continuous query processing!) Another interesting result was a probabilistic optimization framework for top-k queries.

    Some representative papers are listed below; for more, go here.

    Praveen Seshadri, Miron Livny, Raghu Ramakrishnan:
    Sequence Query Processing
    ACM SIGMOD Conference, 1994.

    Praveen Seshadri, Miron Livny, Raghu Ramakrishnan:
    The Design and Implementation of a Sequence Database System
    VLDB Conference, 1996.

    Raghu Ramakrishnan, Donko Donjerkovic, Arvind Ranganathan, Kevin S. Beyer, Muralidhar Krishnaprasad:
    SRQL: Sorted Relational Query Language
    International Conference on Scientific and Statistical Database Systems (SSDBM), 1998.

    Donko Donjerkovic, Raghu Ramakrishnan:
    Probabilistic Optimization of Top N Queries
    VLDB Conference, 1999.

  • DEMON Mining evolving and streaming data.

    We did some of the earliest work on mining data streams in the DEMON project, which was a follow-on to the SEQ project and developed algorithms for finding clusters in multidimensional data and the algorithms for constructing decision trees. While they can be optimized further with multiple passes, these algorithms are capable of operating on continuous data streams, and are part of a robust framework for incrementally mining evolving data and measuring the changes (proposed in an ICDE 2000 paper that won the Best Student Paper award). These issues have been actively researched since then, and continue to be hot research topics. UW-Madison's WARF division has patented two of our algorithms, including BIRCH (scalable clustering; one of the 3 most cited database publications in the past decade according to a recent study by Rahm and Thor) and BOAT (scalable decision tree construction).

    This survey paper is a good starting point:

    Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan:
    Mining Data Streams under Block Evolution
    SIGKDD Explorations, January 2002.

    Some representative papers are listed below; for more, go here.

    Tian Zhang, Raghu Ramakrishnan, and Miron Livny:
    BIRCH: An Efficient Data Clustering Method for Very Large Databases
    ACM SIGMOD Conference, 1996.

    Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan:
    DEMON--Data Evolution and Monitoring
    IEEE Transactions on Knowledge Discovery and Data Mining, 2001. (Full version of paper in ICDE 2000 that won the Best student paper award.)

    Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan, and Wei-Yin Loh:
    A Framework for Measuring Changes in Data Characteristics
    Journal of Computer and System Sciences, 2002. (Full version of paper in PODS 1999.)

    Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan, and Wei-Yin Loh:
    BOAT -- Optimistic Decision Tree Construction.
    ACM SIGMOD Conference, 1999.

    Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan:
    CACTUS--Clustering Categorical Data Using Summaries
    ACM SIGKDD Conference, 1999.

  • CORAL Deductive databases.

    The CORAL project on deductive databases contributed a number of central algorithms and theoretical results, including the widely used magic sets query transformation and several refinements of incremental fixpoint computation. While deductive databases have not been commercially successful (although IBM's DB2 database system supports recursive queries and the SQL standard includes support for recursion), the results have had significant impact in other contexts. The magic sets idea greatly improved the performance of nested SQL queries, which are important in decision-support applications (and a significant component of the TPC-D benchmark), and is now used in commercial database systems for that reason. The ideas behind incremental fixpoint evaluation have found extensive application in view materialization, now supported by all the major database vendors. And the ideas behind proof-tree based explanation systems are now finding applicability for maintaining provenance of data-we built the first proof-tree based explanation system, EXPLAIN, as part of the widely distributed CORAL system, and even today, it is arguably the most complete proof-based explanation system for queries, supporting aggregation, negation, and recursion.

    This survey paper is a good starting point:

    Raghu Ramakrishnan, Divesh Srivastava and S. Sudarshan:
    Efficient Bottom-Up Evaluation of Logic Programs
    The State of the Art in Computer Systems and Software Engineering, 1992.

    Some representative papers are listed below; for more, go here.

    Raghu Ramakrishnan, Divesh Srivastava, S. Sudarshan and Praveen Seshadri:
    The CORAL Deductive System
    The VLDB Journal, 1994.

    Catriel Beeri and Raghu Ramakrishnan:
    On the Power of Magic
    Journal of Logic Programming, 1991.

    Raghu Ramakrishnan:
    Magic Templates: A Spellbinding Approach to Logic Programs
    Journal of Logic Programming, 1991.

    Tarun Arora, Raghu Ramakrishnan, William G. Roth, Praveen Seshadri and Divesh Srivastava:
    Explaining Program Execution in Deductive Systems
    International Conference on Deductive and Object-Oriented Databases, 1993.

    S. Sudarshan and Raghu Ramakrishnan:
    Aggregation and Relevance in Deductive Databases
    VLDB Conference, 1991.

  • PIQ Query by image content.

    The most notable achievements of this project were the results on high-dimensional indexing, including the following widely cited paper:

    Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft
    When is "Nearest Neighbor" Meaningful?
    International Conference on Database Theory (ICDT), 1999.

  • DEVise Interactive visualization of large datasets.

    The DEVise tool is still available! Originally began as an investigation of scalable interactive visualization, but we never really got into the scalability aspects, unfortunately. The basic framework remains promising, and the resulting tool is very flexible, and fun to play with ... and this project also supported the work on the paper that proposed iceberg cubes.

    Some representative papers are listed below; for more, go here.

    Miron Livny, Raghu Ramakrishnan, Kevin S. Beyer, Guangshun Chen, Donko Donjerkovic, Shilpa Lawande, Jussi Myllymaki, R. Kent Wenger:
    DEVise: Integrated Querying and Visual Exploration of Large Datasets
    ACM SIGMOD Conference, 1997.

    Kevin S. Beyer and Raghu Ramakrishnan:
    Bottom-Up Computation of Sparse and Iceberg CUBEs
    ACM SIGMOD Conference, 1999.

  • COD Semantic database integration.

    We investigated the problem of relational schema matching using information theoretic techniques, and showed several problems to be undecidable, leading to a best paper award at EDBT 1994.

  • GESTALTS Internet-scale database federation.

    The project that never quite was. QUIQ happened instead.

Computer Sciences | UW Home