I am an Assistant Professor in the Computer Sciences Department at University of Wisconsin-Madison.
Before joining UW-Madison, I was a Postdoctoral Associate in the
database group at
CSAIL,
MIT working with Prof.
Michael Stonebraker and Prof.
Samuel Madden. I completed my Ph.D. in Computer Science at MIT in 2017 working with Prof.
Srinivas Devadas. I earned my Bachelor of Science (B.S.) in 2012 from
Institute of Microelectronics at
Tsinghua University, Beijing, China.
I work on database systems and currently focus on (1) GPU-native analytics, (2) robust query processing, (3) cloud-native databases, and (4) transaction processing.
I am actively looking for Graduate/Undergraduate students interested in database systems. Please email me your CV if you are interested in working with me.
My group builds database systems, focusing on four thrusts: (1) GPU-native analytics, (2) robust query processing, (3) cloud-native databases, and (4) transaction processing. Selected projects in each thrust are highlighted below.
Thrust 1: GPU-Native Analytics
GPUs are a natural fit for data analytics due to their massive parallelism, but GPU query processing at scale remains challenging. We study how to build OLAP databases that run natively on GPUs.

This line of research has matured into Sirius, an open-source GPU-native SQL engine [
code][
website][
NVIDIA Dev Blog][
slack] · talks:[
GTC'26][
CMU DB seminar]. Sirius enables drop-in GPU acceleration for DuckDB and other SQL databases, without changing the user interface. Sirius supports CPU fallback for full compatibility. Below are related projects and publications.
-
Sirius [CIDR'26]: Demonstrates that GPU hardware and software trends make GPU data analytics viable at scale. Explains Sirius' design principles, architecture, performance, and future roadmap.
-
Lancelot [code][VLDB'24]: Lancelot scales Mordred to multiple GPUs in a single node by improving data placement and query execution.
-
GPU-UDAF [DaMoN@SIGMOD'23]: This work optimizes user-defined aggregate function (UDAF) in cuDF through block-wide execution model and just-in-time (JIT) compilation, achieving 3600x speedup. The work has been integrated and released in NVIDIA RAPIDS cuDF version 23.02.
-
Mordred [code][VLDB'22]: A heterogeneous CPU-GPU query execution engine that optimizes data placement (i.e., semantic-aware caching) and query execution (i.e., segment-level query execution).
-
GPU-compression [code][SIGMOD'22]: A highly optimized GPU compression scheme that achieves high compression ratio and fast decompression speed.
-
Crystal [code][SIGMOD'20]: A library that can run full SQL queries in GPU and saturate GPU memory bandwidth.
Thrust 2: Robust Query Processing
OLAP databases lack robustness when executing complex multi-join queries—performance can degrade due to bad join order or data skew. We aim to improve query robustness by closing the gap between database theory and systems. This line of research has matured into
robust, a DuckDB community extension [
code].
-
SplitJoin [code][arXiv][VLDB'26]: A framework that introduces split as first-class query operator. SplitJoin partitions input tables and use different join order for different partitions, thereby reducing intermediate table sizes.
-
Yannakakis Survey [ICDT'26]: Surveys recent advancements in making Yannakakis' algorithm more practical, in both theory and system perspectives.
-
Distributed Predicate Transfer [code][SIGMOD'25]: This work extends predicate transfer to support distributed query processing and develops a transfer pruning algorithm to further improve efficiency.
-
Robust Predicate Transfer [code][SIGMOD'25]: This work implements predicate transfer in DuckDB, improves its robustness by enforcing a single root in the join tree, and conducts extensive evaluation on TPC-H, JOB, and TPC-DS.
-
Predicate Transfer [code][CIDR'24]: A method that optimizes multi-join queries by pre-filtering tables to reduce the join input size. Predicate transfer is inspired by the seminal theoretical results by Yannakakis but leverages Bloom filters to become more practical.
Thrust 3: Cloud-Native Databases
Databases are moving to the cloud due to offered elasticity, high-availability, and cost competitiveness. Modern cloud-native databases adopt a unique storage-disaggregation architecture, where the computation and storage are decoupled. This architecture brings new challenges and opportunities in DBMS design.
-
Disaggregation [VLDB'25][disseminate podcast]: This paper offers a perspective on the disaggregation trend, tracing its evolution, and presents a set of research efforts in this architecture.
Cloud-native data warehouse:
-
FlexPushdownDB journal [code][VLDBJ'24]: Extends FPDB to support advanced pushdown operators (e.g., Bloom filter, selection bitmap, and shuffle) and adaptive pushdown, which pushes tasks back to compute servers when storage layer computation is limited.
-
FlexPushdownDB [code][VLDB'21]: A cloud-native OLAP DBMS that combines caching and pushdown at a fine-granularity in a storage disaggregation architecture.
-
PushdownDB [code][ICDE'20]: A cloud-native OLAP system that leverages AWS S3 Select to push down selection, projection, and aggregation to speedup query processing.
-
Cloud-DW [VLDB'19]: Evaluation of several popular cloud-native data warehouse systems that have different architectures.
Cloud-native transaction processing:
-
Marlin [code][SIGMOD'26]: Marlin is a cloud-native coordination mechanism that fully embraces storage disaggregation. Marlin eliminates the need for external coordination services by consolidating coordination functionality into the existing cloud-native database it manages.
-
Hermes [code][VLDB'25]: Hermes is a middleware sitting above the cloud storage that enables off-the-shelf real-time analytics, allowing ACID transactions across existing TP and AP engines in the cloud environment.
-
R^3 [code][VLDB'23]: R3 is a Record-Replay-Retroaction tool that simplifies debugging database-backed applications. R3 can replay an application in the same order as the original execution; it also enables retroaction, allowing the replay to run modified code instead of the original code.
-
Epoxy [code][VLDB'23]: Epoxy provides ACID transactions across heterogeneous data stores (e.g., MongoDB, ElasticSearch, GCS, MySQL) to simplify cloud application development.
-
Cornus [code][VLDB'22]: An optimized two-phase commit protocol in a cloud-native database. Cornus reduces 2PC latency and eliminates blocking by leveraging the unique architectural features of storage disaggregation.
-
Litmus [code][SIGMOD'22]: A DBMS that provides verifiable proofs of atomicity and serializability for transactions, through the codesign of database and cryptographic tools.
Thrust 4: Transaction Processing
Scalable transaction processing on multicore CPUs:
Computer architectures are moving towards manycore machines with dozens or even hundreds of cores on a single chip. We develop new techniques for modern database management systems (DBMSs) to make transaction processing scalable for this level of massive parallelism.
-
Three-Tree [code][SIGMOD'24]: Exploration of OLTP buffer management strategies with two-tier main memory; no existing design can win in all measured dimensions.
-
Two-Tree [code][CIDR'23][VLDBJ'25]: Two-Tree splits a single index structure (e.g., B-tree) into a top in-memory tree for hot records, and a bottom tree for cold pages, and achieves 1.7x higher throughput than conventional One-Tree design.
-
Polaris [code][SIGMOD'23]: Polaris enables priority among transactions for state-of-the-art OCC protocol, Silo, and achieves 17x lower tail latency for high-contention workloads.
-
Blink-Hash [code][VLDB'23]: A new index design that enhances a tree-based index with hash leaf nodes to mitigate the contention of monotonic insertions, a pattern common in time-series workloads.
-
HATtrick [code][SIGMOD'22]: A benchmark for HTAP systems that uses two new performance metrics: throughput frontier and freshness score. Three representative systems are evaluated.
-
Plor [code][SIGMOD'22]: A technique called pessimistic locking and optimistic reading (Plor) to reduce tail latency for high-contention transactional workloads, while maintaining high throughput.
-
Bamboo [code][SIGMOD'21]: An optimized two-phase locking (2PL) protocol that mitigates hotspot overhead by releasing locks early during transaction execution.
-
Taurus [code][VLDB'20]: A lightweight parallel logging scheme that avoids the central logging bottleneck by writing to multiple log streams.
-
TicToc [code][SIGMOD'16]: A scalable timestamp-based concurrency control protocol that resolves the timestamp allocation bottleneck through data-driven timestamp management.
-
DBx1000 [code][VLDB'14]: Scalability analysis of seven classic concurrency control protocols on a simulated 1000-core CPU.
Scalable distributed transaction processing:
Online transaction processing (OLTP) DBMSs are increasingly deployed on distributed machines. Compared to a centralized systems, distributed DBMSs face new challenges including extra network latency, requirements of high availability and distributed commitment.
-
Lotus[code][VLDB'22]: Optimize multi-partition transactions in a distributed and partitioned database.
-
Coco [code][VLDB'21]: A distributed OLTP DBMS that mitigates the synchronization overhead of distributed commitment and data replication by committing transactions in epochs.
-
Aria [code][VLDB'20]: A deterministic distributed DBMS that no longer requires knowing transactions' read/write sets before execution. Aria also achieves higher throughput than previous deterministic DBMSs.
-
STAR [code][VLDB'19]: A distributed DBMS where data replicas use asymmetric architectures (e.g., non-partitioned and partition-based). A transaction is executed in the replica that delivers better performance.
-
Sundial [code][VLDB'18]: A distributed concurrency control protocol that is algorithmically similar to TicToc; Sundial integrates cache coherence and concurrency control into a unified protocol.