| 
      
      
       
      
      My UW
      | 
      UW Search
       
      
          
          
      
      Computer Science Home Page 
      > ~raghu  > Research
 
      
	  
	  
	  
	   
      
       
      Biography 
      
       
      Contact   Information
       
      
       
      Students 
      
       
      Teaching  
      
      
      Research  
      
      
      Publications  
      
      
      Professional Activities  
      
      
      Miscellaneous  
      
       
      
      C.S. Dept.   Home Page 
	  
  
  
  
         | 
  | 
 
      
 
   
     
     
    | 
 
 
  |  
      
  
 
Raghu Ramakrishnan
Professor of Computer Sciences
  
Database systems. In particular, Big Data, cloud computing, data marketplaces, semantics-driven 
information services, and web-scale data management 
I AM NOW AT Microsoft, heading the Information Services Lab (ISL)
 
The best way to reach me is to send email to first-name AT microsoft DOT com
   | 
   | 
 
 
Current Research Projects 
(If you're looking for past projects, scroll past the current stuff!)
My two currently funded research projects are in data mining:
 
Goal-Oriented Privacy Preservation  This is a newly funded NSF Cybertrust project, 
carried out with David DeWitt and Jude Shavlik in CS, Larry Hanrahan (Chief Epidemiologist, State of Wisconsin),
and Amy Trentham-Dietz (Population Health Sciences). Sophisticated data analysis and mining techniques 
are necessary tools for understanding complex datasets in an increasing number of application domains, 
including bioinformatics, health care, e-commerce, fraud detection, network attacks, and homeland security. 
The increasing use of these techniques has also created a heightened awareness of their 
potential for compromising privacy, and the challenge of controlling data access in accordance 
with privacy policies while enabling useful analysis has emerged as a central and ubiquitous problem. 
To meet this challenge, we need to go well beyond traditional access-control strategies because 
analysis and mining techniques can potentially be applied to published data to infer data that is 
intended to be private.  We are investigating the trade-off between privacy and 
security guarantees and the utility of the published data for specific analysis tasks; 
we call this goal-oriented privacy preservation. 
 
EDAM: Exploratory Data Analysis and Monitoring This project is a collaborative effort
between computer scientists and environmental chemists at Carleton College and UW-Madison,
supported by an NSF Medium ITR grant.  The goal is
to develop data mining techniques for advancing the state of the art
in analyzing atmospheric aerosol datasets. There is a great need to better 
understand the sources, dynamics, and compositions of atmospheric aerosols. The
traditional approach for particle measurement, which is the collection of 
bulk samples of particulates on filters, is not adequate for studying particle dynamics 
and real-time correlations. This has led to the development of a new generation 
of real-time instruments that provide continuous or semi-continuous streams of data about
certain aerosol properties. However, these instruments have added a significant 
level of complexity to atmospheric aerosol data, and dramatically 
increased the amounts of data to be collected, managed, and analyzed.  
In the EDAM project, we are investigating techniques for automatically labeling mass
spectra from different kinds of aerosol mass spectrometers, and then analyzing and
exploring the rich spatiotemporal information collected from multiple geographically
distributed instruments.
 
Beyond the specific problems that arise in the context of atmospheric aerosols, we are
interested in addressing some fundamental challenges in data mining and monitoring: 
 
-  Support for multi-step analyses:  Most data analysis tasks involve several steps, and indeed,
several analysis steps.  Yet, most of the literature on the topic addresses algorithms
for some one step, such as constructing a decision tree or identifying clusters.  We are 
investigating ways to compositionally 
specify, optimize, and provide life-cycle support for multi-step
analyses.  
In terms of optimization, our focus is on developing cost-based strategies, analogous
to database query optimization, for a broad class of compositionally specified multi-step
analyses.  The underlying objective is to enable to specify a large number of potential
analyses (differing in carefully selected parameterized choices), and to develop a system
that can automatically, perhaps over days in a Condor-style environment, evaluate these
different analyses and bring "interesting" instances to the attention of the analyst.  Thus, we
hope to minimize the human burden in the iterative process of data exploration.
We are also interested in techniques for maintaining detailed provenance information for
all data, including instrument readings and other source data as well as 
results of intermediate analysis steps, and for reasoning with this information for
data validation.
 -  Exploratory Search for Interesting Subsets of Data:  
We are exploring a stylized generate-and-test paradigm for identifying
interesting subsets of data, which we call subset mining.
The key idea is that dimensional hierarchies, originally exploited in multidimensional
exploratory interfaces for OLAP, offer a natural class of datasubsets as meaningful,
interpretable candidates for exploration.  Each of these subsets can be associated with quantitative metrics
that reflect a broad range of analysis and summarization techniques (e.g., a subset of a loan
application dataset can be summarized by a number that indicates the degree of racial bias reflected
in this data, viewed as a training set for a classifier), and the resulting data cube offers
a powerful data exploration interface.
  
 
Another area in which I have worked in recent years is currency-aware replication:
 
CICADA: Consistency in Currency-Aware Data Access This is a project that is being conducted
jointly with Phil Bernstein and Paul Larson at Microsoft Research.
Increasingly, applications are using cached versions of data, even data that is stored in a
DBMS. A familiar example is eBay, where several webpages show slightly out-of-date prices 
for items that are being bid on.  Applications use their knowledge of when it is acceptable
to use data that is not current in order to improve performance.  However, by making copies 
outside the DBMS, they are forced to take on the responsibility of maintaining these caches,
leading to more complex and error-prone application code.  Our vision is to extend DBMS support
for cached copies, allowing applications to be made simpler without sacrificing the
efficiency gains of using copies when appropriate.  The main idea is that applications should
be able to specify whether a copy of a piece of data (e.g., a table, a row) can be used and
how outdated that copy is allowed to be, as part of the SQL statement.  The DBMS can exploit
this information and use copies to improve performance as long as it 
guarantees that the application's currency requirements are met.
The challenges to be addressed include query optimization and evaluation in the
presence of cached copies, workload-aware cache maintenance, and concurrency control
and serializability theory for transactions that can see stale data.
 
 
A research topic that 
I have been increasingly active in is the intersection of text and relational data management.
 
Managing Text  My interest in this area began with my work at QUIQ, where we 
developed a novel system for using online communities for customer support.  
A key challenge was organizing question and answer threads and efficiently supporting 
queries involving a combination of structured and keyword criteria 
against a frequently updated corpus. See:
 
Raghu Ramakrishnan, Andrew Baptist, Vuk Ercegovac, Matt Hanselman, Navin Kabra, Amit Marathe, Uri Shaft:
 
Mass Collaboration: A Case Study
 
8th International Database Engineering and Applications Symposium (IDEAS 2004), 2004, Coimbra, Portugal
 
Navin Kabra, Raghu Ramakrishnan, Vuk Ercegovac: 
 
The QUIQ Engine: A Hybrid IRDB System
 
University of Wisconsin-Madison Technical Report TR-1449, 2002
 
Raghu Ramakrishnan:
 
Mass Collaboration and Data Mining
 
Keynote talk at KDD 2001
 
Recent work has included a benchmark for text in relational database systems:
 
Vuk Ercegovac, David J. DeWitt, Raghu Ramakrishnan:
 
The TEXTURE Benchmark: Measuring Performance of Text Queries on a Relational DBMS
 
VLDB Conference, 2005
 
Not to mention joint work with the Avatar group at IBM Almaden on doing OLAP queries over imprecise data (such
as that obtained by data mining over text):
 
Douglas Burdick, Prasad Deshpande, T. S. Jayram, Raghu Ramakrishnan, and Shivakumar Vaithyanathan:  
 
OLAP Over Uncertain and Imprecise Data 
 
VLDB 2005
 
AnHai Doan from UIUC and I have joined forces on a new project that addresses extraction management
with Community Information Management (CIM) as a driving application; stay tuned for the 
initial results ... (If you use dbworld, you'll soon see some exciting spin-offs from this project.)
 
 
 
 
 
I'm interested in data modeling, database design, data analysis, 
and management. I've worked extensively on deductive databases; data mining;
OLAP; database query optimization; retrieval from repositories of 
text, images, and sequence data; visual data exploration; 
and database integration.  Pointers to some earlier projects are listed below.
 
 
-   SEQ Sequences: Querying ordered relations
In the SEQ project, we were the first to propose the use of sorted relations to support a 
broad range of sequence queries, and showed how SQL, the standard relational query 
language which views relations as unordered collections of rows, can be 
extended to express such queries.  The "window function" extension in the ANSI/ISO SQL standard 
follows this proposal closely.  The query optimization ideas we developed, 
in particular, "one-pass, stream-access evaluation", anticipated window-based optimizations in the 
extensive literature on processing data-streams by nearly a decade. (Although we regrettably
stopped short of investigating streaming data and continuous query processing!)
Another interesting result was a probabilistic optimization framework for top-k queries.
 
Some representative papers are listed below; for more,  go here.
 
Praveen Seshadri, Miron Livny, Raghu Ramakrishnan:
 
Sequence Query Processing 
 
ACM SIGMOD Conference, 1994. 
 
Praveen Seshadri, Miron Livny, Raghu Ramakrishnan:
 
The Design and Implementation of a Sequence Database System 
 
VLDB Conference, 1996. 
 
Raghu Ramakrishnan, Donko Donjerkovic, Arvind Ranganathan, Kevin S. Beyer, Muralidhar Krishnaprasad:
 
SRQL: Sorted Relational Query Language 
 
International Conference on Scientific and Statistical Database Systems (SSDBM), 1998.
 
Donko Donjerkovic, Raghu Ramakrishnan:
 
Probabilistic Optimization of Top N Queries 
 
VLDB Conference, 1999. 
  -   DEMON Mining evolving and streaming data.
We did some of the earliest work on mining data streams in the DEMON project, which was a follow-on 
to the SEQ project and developed algorithms for finding clusters in multidimensional data 
and the algorithms for constructing decision trees. While they can be optimized further 
with multiple passes, these algorithms are capable of operating on continuous data streams, 
and are part of a robust framework for incrementally mining evolving data and measuring the 
changes (proposed in an ICDE 2000 paper that won the Best Student Paper award).  These issues have been actively researched since then, and continue to be 
hot research topics. UW-Madison's WARF division has patented two of our algorithms, 
including BIRCH (scalable clustering; one of the 3 most cited database publications in the past
decade according to a recent study by Rahm and Thor) 
and BOAT (scalable decision tree construction).
 
This survey paper is a good starting point:
 
Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan:
 
Mining Data Streams under Block Evolution 
 
SIGKDD Explorations, January 2002.
 
Some representative papers are listed below; for more,  go here.
 
Tian Zhang, Raghu Ramakrishnan, and Miron Livny:
 
BIRCH: An Efficient Data Clustering Method for Very Large Databases 
 
ACM SIGMOD Conference, 1996.
 
Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan:
 
DEMON--Data Evolution and Monitoring 
 
IEEE Transactions on Knowledge Discovery and Data Mining, 2001.  (Full version of paper in ICDE 2000 that won the
Best student paper award.)
 
Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan, and Wei-Yin Loh:
 
 A Framework for Measuring Changes in Data Characteristics 
 
Journal of Computer and System Sciences, 2002. (Full version of paper in PODS 1999.)
 
Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan, and Wei-Yin Loh:
 
BOAT -- Optimistic Decision Tree Construction.  
 
ACM SIGMOD Conference, 1999. 
 
Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan:
 
  CACTUS--Clustering Categorical Data Using Summaries 
 
ACM SIGKDD Conference, 1999.
  -   CORAL Deductive databases.
The CORAL project on deductive databases contributed a number of central algorithms and theoretical results, including the widely used magic sets query transformation and several refinements of incremental fixpoint computation.  While deductive databases have not been commercially successful (although IBM's DB2 database system supports recursive queries and the SQL standard includes support for recursion), the results have had significant impact in other contexts.  The magic sets idea greatly improved the performance of nested SQL queries, which are important in decision-support applications (and a significant component of the TPC-D benchmark), and is now used in commercial database systems for that reason.  The ideas behind incremental fixpoint evaluation have found extensive application in view materialization, now supported by all the major database vendors.  And the ideas behind proof-tree based explanation systems are now finding applicability for maintaining provenance of data-we built the first proof-tree based explanation system, EXPLAIN, as part of the widely distributed CORAL system, and even today, it is arguably the most complete proof-based explanation system for queries, supporting aggregation, negation, and recursion.
 
This survey paper is a good starting point:
 
Raghu Ramakrishnan, Divesh Srivastava and S. Sudarshan:
 Efficient Bottom-Up Evaluation of Logic Programs 
 
The State of the Art in Computer Systems and Software Engineering, 1992. 
 
Some representative papers are listed below; for more,  go here.
 
Raghu Ramakrishnan, Divesh Srivastava, S. Sudarshan and Praveen Seshadri:
 
The CORAL Deductive System 
 
The VLDB Journal, 1994. 
 
Catriel Beeri and Raghu Ramakrishnan:
 
On the Power of Magic 
 
Journal of Logic Programming, 1991.
 
Raghu Ramakrishnan:
 
Magic Templates: A Spellbinding Approach to Logic Programs 
 
Journal of Logic Programming, 1991.
 
Tarun Arora, Raghu Ramakrishnan, William G. Roth, Praveen Seshadri and Divesh Srivastava:
 
Explaining Program Execution in Deductive Systems 
 
International Conference on Deductive and Object-Oriented Databases, 1993. 
 
S. Sudarshan and Raghu Ramakrishnan:
 
Aggregation and Relevance in Deductive Databases 
 
VLDB Conference, 1991.
  -   PIQ Query by image content.
The most notable achievements of this project were the results on high-dimensional indexing, including 
the following widely cited paper:
 
Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan, Uri Shaft
 
When is "Nearest Neighbor" Meaningful? 
 
International Conference on Database Theory (ICDT), 1999.
  -   DEVise Interactive visualization of large datasets.
The DEVise tool is still available!  Originally began as an investigation of scalable interactive
visualization, but we never really got into the scalability aspects, unfortunately.  The basic 
framework remains promising, and the resulting tool is very flexible, and fun to play with ...
and this project also supported the work on the paper that proposed iceberg cubes.
 
Some representative papers are listed below; for more,  go here.
 
 
Miron Livny, Raghu Ramakrishnan, Kevin S. Beyer, Guangshun Chen, Donko Donjerkovic, Shilpa Lawande, Jussi Myllymaki, R. Kent Wenger:
 
DEVise: Integrated Querying and Visual Exploration of Large Datasets 
 
ACM SIGMOD Conference, 1997.
 
Kevin S. Beyer and Raghu Ramakrishnan:
 
Bottom-Up Computation of Sparse and Iceberg CUBEs 
 
ACM SIGMOD Conference, 1999.
  -   COD Semantic database integration.
We investigated the problem of relational schema matching using information theoretic techniques, and showed several problems to be undecidable, leading to a best paper award at EDBT 1994.
  -   GESTALTS Internet-scale database federation.
The project that never quite was.  QUIQ happened instead.
   
 
  
   | 
   
 
 | 
  |