DAWN'08: Database Aspects explored by Wisconsin's New researchers

December 2 and 4, 2008, Madison, WI


Campus Photo Library, U. Wisconsin-Madison

Program

Day 1: December 2, 2008 (Rm 2310CS)
1:00-1:15 PECARI: Pig Engineers Care About R Improvements
Tristan Ravitch, Evan Driscoll and James Jolly

Pig Latin is an imperative query language geared toward expressing transformations of data stored in large files sharded across many machines.  The Pig environment gives programmers more control over the order operations are applied to data relative to environments tied to declarative languages like SQL.

No significant optimizations are currently implemented in Pig. Our project aims to find a middle-ground in the automatic query optimization spectrum, applying simple join order optimizations to Pig query execution plans while still preserving the remaining specificity of the language.

Our talk outlines these optimizations, discusses what makes them possible, and notes how they need to be modified to work in the current Pig environment.
1:15-1:30 SigMatch – Fast Signature Matching
Ramakrishnan Kandhan and Nikhil Teletia

Virus has become a major threat to today’s internet security and virus detection/removal has become an integral part of computing. Fast virus scanning has become increasingly important over the years due to the massive increase in storage capacity and exponential increase in virus signatures. In our project we propose to build a high speed and scalable virus scanning algorithm which could be integrated into existing anti-virus applications such as Clam-AV.
1:30-1:45 Rendering Arbitrary SQL Data through MediaWiki
Jeff Ballard and Khai Tran

Accessing structured data via the Web is very important yet many challenges still remain. Typical methods to explore the data only provide partial functionality. HTML form mechanisms are well understood and are easy to use by novices, but it only supports a limited set of queries with a fixed set of parameters. In contrast, SQL can provide the flexibility desired except that it is only useful to experts. Another method is a keyword search over structured data, however, it does not exploit the relations expressed by the schema well.

In this project, we propose an end-to-end system that can combine these methods to overcome their weaknesses. First, we extend a web page to create Microsoft Excel-style "cells" so that we can easily collect arguments and render the results in a natural language way. Then, we use a keyword search to lead the user to the correct page. To allow expert users to modify and write additional SQL queries, we implemented our system in a Wiki format.

We believe that this provides a useful extension to the HTML paradigm to allow users to explore related structured data. We have applied our solution to a slice of Wikipedia -- the Major League Baseball domain, that helps users to navigate the information within this domain better.
1:45-2:00 SSD-Join: A Cost Model of Join Algorithms on Flash Memory SSD
Jongwon Yoon and Jaeyoung Do

In relational database, join is not only the most commonly used way to combine information, but also one of the most expensive operation. Therefore it is important to select the most appropriate join algorithm by a query optimizer when generating a query plan with an accurate cost model of join. Many researches have been studied to derive detailed cost model for join methods, however, there is no known research on flash memory SSD which is being rapidly expanded as a new data storage medium that could be a substitute for magnetic disk.  From this work, we derive detailed cost model on SSD considering the different characteristics of SSD. Moreover, we develop our cost model to choose a good I/O buffer allocation for join algorithms.
2:00-2:15 CALF: Comparison of Attribute Layouts on Flash
Satish Kotha and Priyananda Shenoy

Most databases today use row-store column layout to physically store the data, in spite of results showing the advantages of column-store layout. This is mostly due to the seek cost of disk drives imposing a steep penalty on random access, which column store relies on. With Solid State drives replacing disk drives in database applications, there is a need to re-evaluate the question of column layouts. Also, existing databases and evaluations tend to focus on read cases, whereas write costs are significant for Solid State disks.

This project reexamines the question of column layout models for Flash based databases. We propose a flexible data storage model which partitions attributes based on a given workload, taking into account both reads and writes. Based on the workload, we find the optimal layout of attributes into groups, each group stored in a different page, which minimizes the total cost for that workload. We evaluate the performance of this intelligent partitioning against n-ary and column storage models.

Day 2: December 4, 2008 (Rm 3310CS)
1:00-1:15 FABLE: identiFying unique Approximate suBgraphs in cLusters Expeditiously
Amanda Burton, Debbie Chasman and Dalibor Zeleny

Recent biological research has produced data in the form of graphs describing the gene expression of stem cells at different stages of differentiation. Probing for similarities and differences between graphs from different kinds of differentiated cells can give us insight into the mechanism of differentiation. Other biological graph data may be approached in this way as well. For example, features unique to clusters representing different species or cellular processes may shed light on events in genetic history. We can generalize the problem thus: given a database of graphs clustered by some measure of similarity, find subgraphs that occur frequently in one cluster but infrequently in all others.

While the mining of frequent subgraphs is well studied, only the GeneGO Analysis by Tian and Patel attempts to discover unique subgraphs. They use a brute-force algorithm whose computational complexity grows rapidly as the requested size of unique subgraphs increases. The FABLE project began where GeneGO left off, using the same algorithm to look for larger unique subgraphs.  We also explore options for improving the algorithm's run-time using heuristics.
1:15-1:30 Improved Aggregation for Graph Summarization
Shan-Hslang Shen and Ning Zhang

Graphs are widely used to model real world objects and their relationships, and large graph datasets are common in many application domains. To understand the underlying characteristics of large graphs, graph summarization techniques are important. Two recent work addressed this problems in different aspects of view, and their results are convincing and useful, but they still suffer some limitations and are open to discuss. In this paper, we follow up the previous work and introduce the new strategies to solve the open questions. The first advantage of our methods is that we also relax the homogeneity requirements of node attribute. In other words, the summarization groups the nodes based on the similarities of attribute values as well as the relationships between nodes. Second, for numerical attribute values, we mine the perfect cut-offs to classify the numerical attribute to categorical attribute. Third, rather than letting user exhaustively explore summaries with different resolutions, we suggest good summaries based on the interestingness of different summaries. Finally, we evaluate our strategies through extensive experiments.
1:30-1:45 Online Emerging Pattern Detection
Chao Xie, Min Qiu and Zhuo Tao

More graph mining and analysis techniques are required to extract crucial information from graph as large graph database becomes common in many emerging database application. Previous work has largely focused on detecting frequent patterns in static graph database. The task we are going to address in the project is to detect new emerging pattern on a stream of graphs.

We considered detecting emerging patterns as a complementary task to finding frequent patterns, while the latter has been the focus of data mining research for many years. Indexing on node level (and neighborhood information) is efficient for matching sub-graph in the database in a node-expansion  manner, and scalable linearly with the size of the database. But it implied the loss of sub-graph information which amounts to data of exponential size. It's obvious that many sub-graphs are redundant due to its nature. And it makes great sense to only reserve the larger sub-graphs instead of smaller ones. Therefore we proposed a two-level indexing system which index on the frequent pattern and overall graph, hoping that most of the matching work can be done in the first level, while graphs containing emerging patterns go through to the second level. We also proposed an mechanism to dynamically update the pattern frequency as time goes on.
1:45-1:55 Tree-based parallel Smith-Waterman
Avrilia Floratou

Biological sequence comparison is an important tool for researchers in molecular biology. The Smith-Waterman algorithm based on dynamic programming is one of the most fundamental algorithms in bioinformatics, as it obtains the best local alignments. However, it is not widely used due to its huge memory requirements and high computing power. Several parallel implementations of the Smith-Waterman algorithm which improve its performance have been proposed. We focus on the PSW-DC algorithm, a parallel Smith-Waterman algorithm, based on the divide and conquer method. The memory space required in this algorithm is reduced significantly in comparison with the existing parallel implementations. The goal of this project is to further improve the performance of this algorithm by deploying it over a tree of distributed nodes. Moreover, we describe some techniques that will further reduce the amount of memory needed without affecting the result of the algorithm.
1:55-2:05 Characterization of Hash Join Algorithms in Multicore Architectures
Derek Hower

As we continue our journey deeper into the multicore era, it will be necessary for Database Management Systems to adopt new algorithms that can utilize multiple hardware threads if they hope to continue scaling with new hardware generations. Towards that end, we analyze the characteristics of a multithreaded hash-join on different multicore architectures. We propose three different partitioning schemes designed to exploit the fast interthread communication available in multicore architectures. Our experiments are run on two current architectures that represent different tradeoffs in terms of communication latency and resource contention, namely the Sun Niagara (a.k.a. T1) and the Intel Clovertown platforms. Results are pending.
2:05-2:15iKnow: An automatic information extraction system
Vidhya Murali

The iKnow project aims at automating the process of key word extraction. Automatic keyword extraction is a process by which representative terms are systematically extracted from a text with either minimal or no human intervention. Keywords thus extracted can help in summarizing documents, document indexing, classifying web pages and improvise information retrieval. In this work, we use a combination of statistical and linguistic techniques to extract keywords. A combination of term frequency inverse document frequency, n-gram, POS tagging and concept clustering is employed to identify keywords and a marked improvement in the precision of the results is seen. This way we achieve automating the process of identifying entity (concepts) and relationships between the same, given domain specific web pages in an efficient manner.