DAWN'12
Workshop on Database Aspects Explored by Wisconsin's New DB Researchers 

December 12, 2012, 11:00AM-12:15PM
and
December 14, 2012, 11:00AM-12:15PM
CS 2310 (Note: not the usual classroom)
Madison, WI

Description: icnic Point at dawn
Picnic Point at Dawn. Photo by: Jeff Miller, UW-Madison University Communications



December 12, 2012. Talks are in room CS 2310.

11:00-11:15

Genome-Health Risk Prediction of Residual Feed Intake in Dairy Cattle
Chen Yao

Considering health history information improved the predictive accuracy of genomic evaluation in residual feed intake (RFI) in dairy cattle. The random forests (RF) algorithm showed more advantages in incorporation this extra information than GBLUP model, which may be benefited from its ability to utilize complex interactions. Within the 11 health traits being considered, birth body weight, scour in the calf period, and having twin calves showed significant effects on RFI. Further study about the interactions between these health traits and RFI will be needed

11:15-11:30

Complex Event Processing
Vinitha Gankidi and Harshad Deshmukh

Complex Event Processing (CEP) is an emerging area where streams of incoming data are examined to fi
nd complex and meaningful patterns. For example, a live feed of tweets or updates from Facebook can be processed to detect hot topics as they occur. Storm from Twitter is one such CEP engine which makes it easy to reliably process unbounded streams of data, doing for real time processing what Hadoop did for batch processing. In order to explore the scalability of Storm, we ran SQL class of queries on twitter data with increasing workloads and analyzed the response time and CPU usage.

11:30-11:45

Mitigating Skew in MapReduce: Black Boxes are Evil
Qiang Zeng

The user-defined partitioning and grouping functions in MapReduce close the door to good reduce-side skew mitigation. Existing skew handling algorithms fail to guarantee correctness in presence of these black boxes. The transparency of user-defined reduce functions make it impossible to distribute a large, or computationally-expensive, group over more than one reducer, resulting in skewed execution. Our proposal is to expose the grouping to enable workload distribution optimizations, and develop a skew mitigation approach with theoretical performance guarantees, which is particularly optimized for join-style reduce tasks.  Preliminary results show an order-of-magnitude speedups for skewed joins in our Hadoop prototype.

11:45-noon

An Evaluation of Lucene and PostgreSQL Full Text Search Engine
Lihao Wang and Kan Tao

Within the background of “big data”, full text searching is a critical tool that provides the capability to identify natural language documents that satisfy a user query. Although most traditional DBMS have built-in full text search engine, the Apache Lucene, which is an independent information retrieval library, gains more popularity for indexing and searching large datasets. In this project, we propose a benchmark integrating index building time, index storage and query speed to compare the performance of Apache Lucene and PostgreSQL full text search engine. Then, we move to investigate which software performs better considering various indexing scenarios, such as indexing with generalized inverted index (GIN) and generalized search tree (GiST) for PostgreSQL, and searching scenarios, such as varying the size of datasets.

Noon-12:15

The Impact of Multi-Tenancy in Desktop and Mobile Systems using LevelDB and SQLite3
Derek Severson

Interactive web applications have increased in sophistication and now need browsers to offer efficient client-side storage. The push to standardize the web via HTML5 has lead to the IndexedDB API fulfilling web applications' local storage needs. Due to the vast number of webpages users visit at any given time, browsers must provide an implementation of IndexedDB that performs well in a multi-tenancy environment. My goal is to benchmark and compare the performance trends of SQLite3 and LevelDB as the level of multi-tenancy increase to Internet scale on a desktop and a mobile device.


 

December 14, 2012. Talks are in room CS 2310.

11:00-11:15

SQL and NoSQL comparison on Interactive Data-Serving Environments

Junyan Chen

In this new era of “big data”, traditional RDBMSs are no longer the only viable alternative for data-driven applications. NoSQL systems act as a strong competitor to traditional RDBMSs in terms of interactive data-serving environments and analytical decision support systems workload processing. A recent study compared the performance of NoSQL and SQL database for OLTP and DW workloads. In this project, we aim to expend the study to include systems not covered in that study and we only focus on NoSQL systems for interactive data-serving environments. Particularly, we compare SQL server and Cassandra, a popular NoSQL system designed by Facebook, using the YCSB benchmark to characterize how these systems compare on interactive data-serving environments.

11:15-11:30

Optimizing image storage

Joy Arulraj

Efficient image storage and retrieval mechanisms are crucial to improve end-user experience with minimal cost in modern storage systems. Our focus is primarily on image objects, in particular those encoded in JPEG format. We intend to tile the JPEG images, identify similar image tiles and then reduce storage cost by leveraging these data patterns. The trade-off between storage benefits and associated image retrieval latency is also evaluated. We have observed 5-10% savings in overall storage cost using our current image storage mechanism.

11:30-11:45

Comparison of Clustering Algorithms for Large Data Sets

Evan Samanas and Halit Erdogan

We present an empirical comparison of large-scale clustering algorithms. We run experiments on both real and synthetic data sets and investigate the performance of the algorithms in terms of speed-up and scale-up. We also examine the quality of the clusterings that are produced by the algorithms using available quality metrics. Finally, we present interesting applications that show how clustering cloud be important and useful in some large-scale applications.

11:45-noon

A Rule-Based Stand Alone Query Optimizer for Main Memory Systems

Brian Sullivan

Changing database systems with different storage methods and performance properties has provided a challenge to query optimization. A customizable query optimizer that made no assumptions about how the database was configured would have an advantage in this changing landscape. Additionally, systems having larger amounts of main memory will benefit query optimizers that can intelligently use this extra memory. This project implements a top-down query optimization using a rules engine to optimize queries and output the query plan in a standardized format as well as providing a caching strategy for speeding up optimization between multiple queries.

Noon-12:15

Evaluating Database Sort on Modern GPUs

Jason Power

In the past few years, graphics processing units (GPUs) have become much more programmable with improved languages and faster drivers. Previous work evaluated using GPUs for sorting within database management systems and found they were competitive with CPUs. However, in the most recent entries to the sort benchmark the CPU reigned king. By applying modern GPU algorithms and programming environments, we show that GPUs are once again competitive, and in fact can outperform, CPU sorting algorithms.


 

If you are looking for the CS 764 course home page, click here.