If you are looking for the CS 764 course home page, click here.
Bascom Hall at Dawn. Photo by: Jeff
Miller, UW-Madison University Communications
December 12, 2011. Talks are in room CS 2310. |
|
9:00-9:15 |
Fast Querying on Non-Key Values for Main Memory Sparse Tables The reduction in cost of main memory combined with the relatively high performance cost of storing and retrieving tables in disks has lead to infrastructures that enable storage of entire databases as main memory objects. This has lead to new areas of research that focus on topics like evaluating distributed databases that have flexible key-value store data models that reside completely in main memory. With the database now moving to main memory, we need to rethink the traditional data access structures and algorithms used for relational database operators so that it is optimized in other aspects like improving storage utilization even at the cost of a few additional CPU cycles for a given operation. In this project, we build databases using CSB+-trees, Sparse Hash indices and Extended Hash indices that reside completely in the main memory and evaluate the Search and Insert operations on these structures. The data models evaluated are sparse data tables stored in two formats namely the Horizontal Schema and the Interpreted Storage. |
9:15-9:30 |
An Evaluation of Machine Learning Algorithms on Multicore In the multicore era, with the increasingly affordable storage and computing resources, traditional machine learning techniques cannot satisfy some sophisticated analysis requests. And massive parallel processing on big data is critical for the success of an organization. In this course project, we design and evaluate techniques to parallelize machine learning algorithms on multicore. |
9:30-9:45 |
PALM: A B+ Tree for Main Memory Databases Tremendous growth in the size of the main memory has led to proliferation of databases that reside completely in memory. Focus has now shifted towards designing algorithms and data structures for these main memory databases. A B+ tree is often used as indexing structure in databases. In our project, we have implemented PALM, recently proposed technique for performing multiple concurrent queries on an in-memory B+ tree. PALM uses Bulk Synchronous Parallel model for carrying out concurrent operations without using latches. PALM is thus much more scalable and has better performance compared to previously proposed approaches. |
9:45-9:55 |
Break/Catchup |
9:55-10:10 |
The changing game of SIMD-optimized join algorithms The multiprocessor optimized join has been a topic of academic interest since the 1980s. This involves interfacing the advances in both software (e.g. algorithm performance, compiler design) and hardware (e.g. CPU parallelism, memory capacity). The algorithmic design of joins have been designed to take advantage of improvements such as SIMD, however because the context is continually changing, designs that have been considered the state of the art may sometimes return surprising results. This project demonstrates this incongruity in the SIMD-optimized block nested loop and compares its potential performance against hash joins. |
10:10-10:25 |
Sort Merge vs. Main Memory Hash Join Algorithm for Multi-core CPUs Join is a base but important database operation. Nowadays large scale multi-core processors present and the future will be dominated by the larger and larger multi-core processors. How to exploit the salient features of modern processors to execute the join algorithms more efficiently is a challenging and crucial work. Hash join and sort merge join are two typical and popular join algorithms. And different views held by different people on which algorithm works more efficiently. For modern multi-core cpus, Changkyu Kim et. al concludes that multiple hardware threads on each core and vector instructions operating on 128-bit vectors favor the sort merge join algorithm, and in the end sort merge join algorithm will outperform hash join algorithm. However, Spyros et al. points out that the simple hash join algorithm which does not partition the input relations outperforms all other algorithms. Consequently, this project focuses on the comparison of the performance between multi-threaded SIMD sort merge algorithm and simple hash join algorithm. |
10:25-10:40 |
Power Management in High Throughput Computing Environment Since the dramatic increasing of data storage and data processing needs, the clusters are largely scaled up to satisfy the increasing demands. Currently, it is very common for a commercial cluster to have thousands of nodes that are either idle or busy for the coming requests. Therefore, it becomes really urgent for many cluster systems to have an effective power saving management to reduce the power consumption while not compensating too much performance. The usual simple approach that is to turn off low-utilized machine is not good enough, because it does not consider the penalty of kill long-running, non-checkpoitbale jobs. The potential of killing such jobs affects both the power consumption and job response time. Our solution is to take both the utilization and job running time into consideration to apply turn-off/on policy to reach our goal that saves as much as power as possible while less cluster performance will be degraded. |
December 14, 2011. Talks are in room CS 2310. |
|
9:00-9:15 |
NoSQL Database System Benchmark : HBase, Cassandra, CouchDB As the NoSQL database rise, this schema-less database has present as a new solution for storing more flexible and larger data-type. So far, major NoSQL database has been classified into four major categories. In this project, we are going to introduce two categories, the column-oriented store (HBase, Cassandra) and the document-oriented store (CouchDB). We propose a query-based benchmark system to understand how these NoSQL database systems perform under several of traditional query conditions, such as projection, aggregation, range search, bulk load and join. We will start with understanding the architecture and the operations of these NoSQL systems upon data models. Then, we move forward to add the secondary index to boost the performance of each database system. |
9:15-9:30 |
Analysis of SIMD String Processing Instructions SIMD vector instructions help improve CPU performance by processing
multiple data values using a single instruction. While most newer x86 based processors
support them, few peer-reviewed studies have been performed on the impacts
that these instructions could have on database systems. Older SIMD
instructions, as well as newer SIMD instructions for processing strings, are
especially relevant to some database operations, such as processing
aggregates, bitmaps, column-stored data, and strings. String comparisons and
regular expression matching are the latest areas for which SIMD instructions
may be useful. Here, we present a quantitative analysis of these impacts
using SQLite, Quickstep, and an open source regular expression library. |
9:30-9:45 |
Concurrency Controls of Cache-Sensitive B+-Tree on QuickStep Recent trend in database systems focused on optimizing the cache utilization on database indexes stored in main memory. On the other hand, the availability of many cores architectures present interesting topics in scaling performance of concurrent query processing. The traditional B+-tree has been studied extensively in these two regards. In this project, we extended a latch-minimized concurrency protocol and a tree-partition protocol to Cache-Sensitive B+-tree (CSBtree), and compare their performance in multi-core environment with single-threaded CSBtree. We aim to incorporate the previous knowledge of larger index-node-size effect, and partitioning techniques in shared-nothing and shared-memory to research on new possible concurrency control design. |
9:45-9:55 |
Break/Catchup |
9:55-10:10 |
Performance Evaluation and Improvement for Persistent Storage in HTML5 IndexedDB is an API in HTML5 as to store significant amounts of structured data locally inside user’s browser to be used offline. Rather than traditional relational model which uses tables to store information and the relation, IndexedDB could avoid some large-cost operations such as join. IndexedDB enables high-performance search for locally stored data with indices. We aim at studying the impact on performance by using CSB-tree index structure to implement IndexedDB in Webkit for storing large amount of structured data inside local storage. The result is shows that the CSB-tree index could improve performance of IndexedDB than that implemented with traditional index structures. |
10:10-10:25 |
Evaluating W3C Web Storage: A Benchmark and Application Study of a
Key-Value Store In the past several years, interactive web applications have grown increasingly more sophisticated, providing functionality previously only seen in desktop applications. As these applications grow more advanced, however, they will require more robust client-side storage capabilities from the browser to facilitate good performance. The W3C provides two standards for this: IndexedDB and Web Storage. This paper examines the Web Storage interface, specifically Local Storage, which provides a key-value store for each domain that persists across browser sessions. Most browsers impose tight limitations on the amount of space each such store can use; this paper explores how Web Storage behaves when these restrictions are lifted, focusing specifically on performance in Google's Chrome browser. We compare performance results to those of a raw SQLite3 database to determine the overhead of the web browser implementation, as well as to those of an HTML5/Javascript-based web application running Local Storage queries and rendering the results in the HTML DOM. |
10:25-10:40 |
Event Detection over the Twitter Stream with Topic Keyword Clusters and
Big Data Clustering Algorithm Since 1990s, there have been many researches on how to detect events from online resources. The most representative research is Topic Detection and Tracking (TDT) project that was supported by the U.S government in 1990s. Main task in TDT project was to identify topics from online news documents. In recent years, as activities in social media (e.g., Twitter, Facebook , MySpace , Youtube) become common for most people, individuals daily involve in creating enormous online data in various forms. Real-time characteristic of twitter allows us to identify emerging topics faster than any other online resources. In our project, we present a topic detecting system that identifies emerging events via thousands of millions of twitter messages. The main task of our system is to cluster topic (bursty) keywords accurately and relevant twitter messages in a scalable manner. |
10:40-10:55 |
Distributed Select Query Processing using SQLite There has been a great amount of research devoted to Distributed Database Systems. Topics under the umbrella of Distributed Database Systems (DDMS) include concurrency control, backups, and storing data at distances far away from each other. We chose to consider a more local aspect of distributed database system, similar to a cluster. The focus of this paper is to show how one can use a distributed database over multiple machines (but in close proximity) to increase performance of single block select queries and joins. We compare few distributed join algorithms. We used SQLite as the database engine to process our queries. We chose SQLite because it was a light-weight open source flat file database capable of being easily studied and tweaked. |
If you are looking for the CS 764 course home page, click here.