I am thrilled to announce that I will be joining the faculty at the Computer Science and Engineering department of the University of California, San Diego in Fall 2016.
I am looking for strong and motivated students to join my research group. If you are (or will be) a student at UCSD CSE this Fall or later and you are interested in the intersection of data management and machine learning, please email me. Go Tritons!


Arun Kumar is a PhD candidate in the Department of Computer Sciences at the University of Wisconsin-Madison, from where he obtained his Masters in Computer Sciences in 2011. He obtained his Bachelors in Computer Science and Engineering from the Indian Institute of Technology, Madras in 2009.

He is co-advised by professors Jeffrey Naughton and Jignesh Patel, and he is a research assistant at the Microsoft Jim Gray Systems Lab, which is headed by emeritus professor David DeWitt. He also collaborates with professors Steve Wright and Jerry Zhu as well as with Microsoft. Previously, he worked with professor Christopher Ré.

Systems and ideas based on his research have been released as part of the MADlib open-source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and are being explored for use in production by Microsoft and LogicBlox. A paper he co-authored was accorded the Best Paper Award at ACM SIGMOD 2014. He was awarded the Anthony C. Klug NCR Fellowship in Database Systems in 2015. He is a co-winner of the 2016 UW-Madison CS Graduate Student Research Award.

Curriculum Vitae

Recent News

New! Humbled to be a co-winner of the department's annual Graduate Student Research Award for best PhD research!
• Our vision paper on managing the iterative process of model selection appeared in ACM SIGMOD Record! Time to look beyond individual ML implementations and build holistic Model Selection Management Systems!
• The Hamlet paper got accepted to SIGMOD'16. Looking to get existential and dramatic in San Francisco!


My primary research interests are in data management, especially the intersection of data management and machine learning (popularly known as "Data Science," or "Big Data Analytics"). I enjoy working on problems that are motivated by real applications and are formally grounded, mostly with a focus on usability, developability, performance, and scalability. I also enjoy insightful conversations with analysts and developers on the front lines of data management and data analysis.


Building an ML model is seldom a one-shot slam dunk; it is usually an iterative process. To make this process of "model selection" easier and faster, we repurpose classical database ideas and envision a new class of analytics systems we call Model Selection Management Systems (MSMS).
To join or not to join? That is the question. In this project, we connect statistical learning theory and relational joins to show why, and how, we can often avoid entire input tables when learning over normalized data without reducing accuracy significantly, but improving performance.
In this project, we extend our paradigm of factorized learning to several ML models in the popular R environment and also introduce factorized scoring. We devise a cost-based optimizer to pick the fastest approach and also help analysts with comparing features from multiple tables.
In this project, we make it easier to apply machine learning over normalized data, which requires joins during feature engineering. We devise novel techniques to push machine learning computations down through joins and study the tradeoffs involved in improving performance.
In this project, we formulate a framework of declarative operations for the black art of exploratory feature selection in analytics based on our conversations with analysts in many enterprise settings. We design a novel optimizer that improves performance.
In this project, we build a unified system to implement several data analytics techniques by integrating incremental gradient descent into an RDBMS. This work has been incorporated into products from Oracle and EMC. We also contributed code to the open-source library MADlib.
In this project, we integrate the management of uncertain content, specifically Optical Character Recognition (OCR) data, with an RDBMS. We use a probabilistic model and devise a novel approximation framework to trade off between quality and performance.


  • New! To Join or Not to Join? Thinking Twice about Joins before Feature Selection
    Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu
    ACM SIGMOD 2016 [Paper] [Tech Report] [Code and Data]
  • New! Model Selection Management Systems: The Next Frontier of Advanced Analytics
    Arun Kumar, Robert McCann, Jeffrey Naughton, and Jignesh M. Patel
    ACM SIGMOD Record Dec 2015 (Vision Track) [Paper] [Survey]
  • Demonstration of Santoku: Optimizing Machine Learning over Normalized Data
    Arun Kumar, Mona Jalal, Boqun Yan, Jeffrey Naughton, and Jignesh M. Patel
    VLDB 2015 (Demo) [Paper] [Code and Data]
  • Learning Generalized Linear Models Over Normalized Data
    Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel
    ACM SIGMOD 2015 [Paper] [Code]
  • Materialization Optimizations for Feature Selection Workloads Best Paper Award
    Ce Zhang, Arun Kumar, and Christopher Ré
    ACM SIGMOD 2014 [Paper] (Invited to ACM TODS 2016)
  • Distributed and Scalable PCA in the Cloud
    Arun Kumar, Nikos Karampatziakis, Paul Mineiro, Markus Weimer, and Vijay Narayanan
    NIPS BigLearn 2013 [Paper]
  • Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System
    Pradap Konda, Arun Kumar, Christopher Ré, and Vaishnavi Sashikanth
    VLDB 2013 (Demo) [Paper]
  • Hazy: Making it Easier to Build and Maintain Big-data Analytics
    Arun Kumar, Feng Niu, and Christopher Ré
    ACM Queue, 2013 (Invited to CACM March 2013) [Paper]
  • Brainwash: A Data System for Feature Engineering
    Michael Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Ré, and Ce Zhang
    CIDR 2013 (Vision Track) [Paper]
  • Towards a Unified Architecture for in-RDBMS Analytics
    Xixuan Feng*, Arun Kumar*, Benjamin Recht, and Christopher Ré
    ACM SIGMOD 2012 [Paper] [Tech Report] [Code and Data]
  • The MADlib Analytics Library or MAD Skills, the SQL
    Joseph M. Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar
    VLDB 2012 (Industrial Track) [Paper]
  • Probabilistic Management of OCR Data using an RDBMS
    Arun Kumar, and Christopher Ré
    VLDB 2012 [Paper] [Tech Report] [Code and Data]
  • On Reducing Delay in Mobile Data Collection-based WSNs
    Arun K. Kumar, Krishna M. Sivalingam, and Adithya Kumar
    Springer Wireless Networks 2012 [Paper]
  • Flexible Multimedia Content Retrieval Using InfoNames
    Arun Kumar, Ashok Anand, Athula Balachandran, Vyas Sekar, Aditya Akella, and Srinivasan Seshan
    ACM SIGCOMM 2010 (Demo) [Paper]
  • InfoNames: An Information-Based Naming Scheme for Multimedia Content
    Arun Kumar, Athula Balachandran, Vyas Sekar, Aditya Akella, and Srinivasan Seshan
    UW-Madison Technical Report TR1677 [Paper]
  • Energy-Efficient Mobile Data Collection in WSNs with Delay Reduction using Wireless Communication
    Arun K. Kumar, and Krishna M. Sivalingam
    IEEE/ACM COMSNETS 2010 [Paper]


Program Committee:
ACM SIGMOD 2016 Undergraduate Research Poster Competition
USENIX HotCloud 2016

ACM Transactions on Database Systems (TODS) 2015
IEEE Transactions on Knowledge and Data Engineering (TKDE) 2014


Lingjiao Chen (MS, UW-Madison)
Zhiwei Fan (BS, UW-Madison)
Fengan Li (MS, UW-Madison)
Fujie Zhan (BS, UW-Madison)

Mona Jalal (MS, UW-Madison)
Boqun Yan (BS, UW-Madison; First employment: Google)