I am thrilled to announce that I will be joining the faculty at the Computer Science and Engineering department of the University of California, San Diego in Fall 2016.
I am looking for strong and motivated students to join my research group. If you are (or will be) a student at UCSD CSE this Fall or later and you are interested in the intersection of data management and machine learning, please email me. Go Tritons!


Arun Kumar is a PhD candidate in the Department of Computer Sciences at the University of Wisconsin-Madison, from where he obtained his Masters in Computer Sciences in 2011. He obtained his Bachelors in Computer Science and Engineering from the Indian Institute of Technology, Madras in 2009.

He is co-advised by professors Jeffrey Naughton and Jignesh Patel, and he is a research assistant at the Microsoft Jim Gray Systems Lab, which is headed by emeritus professor David DeWitt. He also collaborates with professors Steve Wright and Jerry Zhu as well as with Microsoft. Previously, he worked with professor Christopher Ré.

Systems and ideas based on his research have been released as part of the MADlib open-source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and are being explored for use in production by Microsoft and LogicBlox. A paper he co-authored was accorded the Best Paper Award at ACM SIGMOD 2014. He was awarded the Anthony C. Klug NCR Fellowship in Database Systems in 2015. He is a co-winner of the 2016 UW-Madison CS Graduate Student Research Award.

New! Humbled to be a co-winner of the department's annual Graduate Student Research Award for best PhD research!
• Our vision paper on managing the iterative process of model selection appeared in ACM SIGMOD Record! Time to look beyond individual ML implementations and build holistic Model Selection Management Systems!
• The Hamlet paper got accepted to SIGMOD'16. Looking to get existential and dramatic in San Francisco!


My primary research interests are in data management, especially the intersection of data management and machine learning (popularly known as "Data Science," or "Big Data Analytics"). I enjoy working on problems that are motivated by real applications and are formally grounded, mostly with a focus on usability, developability, performance, and scalability. I also enjoy insightful conversations with analysts and developers on the front lines of data management and data analysis.


Building an ML model is seldom a one-shot slam dunk; it is usually an iterative process. To make this process of "model selection" easier and faster, we repurpose classical database ideas and envision a new class of analytics systems we call Model Selection Management Systems (MSMS).
To join or not to join? That is the question. In this project, we connect statistical learning theory and relational joins to show why, and how, we can often avoid entire input tables when learning over normalized data without reducing accuracy significantly, but improving performance.
In this project, we extend our paradigm of factorized learning to several ML models in the popular R environment and also introduce factorized scoring. We devise a cost-based optimizer to pick the fastest approach and also help analysts with comparing features from multiple tables.
In this project, we make it easier to apply machine learning over normalized data, which requires joins during feature engineering. We devise novel techniques to push machine learning computations down through joins and study the tradeoffs involved in improving performance.
In this project, we formulate a framework of declarative operations for the black art of exploratory feature selection in analytics based on our conversations with analysts in many enterprise settings. We design a novel optimizer that improves performance.
In this project, we build a unified system to implement several data analytics techniques by integrating incremental gradient descent into an RDBMS. This work has been incorporated into products from Oracle and EMC. We also contributed code to the open-source library MADlib.
In this project, we integrate the management of uncertain content, specifically Optical Character Recognition (OCR) data, with an RDBMS. We use a probabilistic model and devise a novel approximation framework to trade off between quality and performance.


Program Committee:
ACM SIGMOD 2016 Undergraduate Research Poster Competition
USENIX HotCloud 2016

ACM Transactions on Database Systems (TODS) 2015
IEEE Transactions on Knowledge and Data Engineering (TKDE) 2014


Lingjiao Chen (MS, UW-Madison)
Zhiwei Fan (BS, UW-Madison)
Fengan Li (MS, UW-Madison)
Fujie Zhan (BS, UW-Madison)

Mona Jalal (MS, UW-Madison)
Boqun Yan (BS, UW-Madison; First employment: Google)