Mu Cai

Hi, I am a final-year Ph.D. candidate in Computer Sciences Department at University of Wisconsin-Madison, advised by Prof. Yong Jae Lee.

My recent research interest lies in multimodal generative models. I am especially interested in visual prompting, video and 3D understanding, and analyzing the limitations of CLIP.

Email  /  CV  /  GitHub  /  Google Scholar  /  LinkedIn /  Twitter (X) /  Blog

Recent talk on criticizing and creating vision-language models. [YouTube English, Chinese ]

I will graduate around 2025 May, looking for a Research Scientist position around multimodal models. Do not hesitate to shoot me an email if you are interested!



Selected Publications

profile photo



NEW! Matryoshka Multimodal Models
Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee
arXiv, 2024
[arXiv] [code] [Project Page] [Demo]

NEW! TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao^, Yong Jae Lee^, Jianwei Yang^
arXiv, 2024
(^ equal advising)
[arXiv] [Project Page] [Code] [Datasets] [Leaderboard]

NEW! Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
Jianrui Zhang*, Mu Cai*, Yong Jae Lee
arXiv, 2024
(*equal contribution)
[arXiv] [Project Page] [Code] [Datasets] [Leaderboard]

NEW! ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
[arXiv] [code] [Demo] [Project Page] [Youtube]

NEW! CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Jianrui Zhang*, Mu Cai*, Tengyang Xie, Yong Jae Lee
Findings of the Association for Computational Linguistics: ACL Findings 2024
(*equal contribution)
[arXiv] [code] [Project Page]

NEW! VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
Bocheng Zou*, Mu Cai*, Jianrui Zhang, Yong Jae Lee
EMNLP main, 2024
[arXiv] [code] [Project Page] [Dataset]

åNEW! Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
Mu Cai*, Zeyi Huang*, Yuheng Li, Haohan Wang, and Yong Jae Lee
WACV, 2025
(*equal contribution)
[arXiv] [code]





All Publications
NEW! Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Yuheng Li, Haotian Liu, Mu Cai, Yijun Li , Eli Shechtman, Zhe Lin, Yong Jae Lee, and Krishna Kumar Singh
Proceedings of the European Conference on Computer Vision (ECCV), 2024
[arXiv] [code] [Project Page]

NEW! Cross-Modal Self-Supervised Learning with Effective Contrastive Units for Point Clouds
Mu Cai, Chenxu Luo, Yongjae Lee, and Xiaodong Yang
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024
[arXiv] [code] [Youtube video]

NEW! LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, and Michael S. Ryoo
arXiv, 2024
[arXiv] [code]

NEW! Yo'LLaVA: Your Personalized Language and Vision Assistant
Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee
NeurIPS, 2024
[arXiv] [code] [Project Page]

NEW! LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang*, Mu Cai*, Bingxin Xu, Yong Jae Lee^, Yan Yan^
arXiv, 2024
(*equal contribution, ^equal advising)
[arXiv] [code] [Project Page]

Investigating the catastrophic forgetting in multimodal large language models
Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma
Conference on Parsimony and Learning (Proceedings Track) (CPAL), 2023
[arXiv]

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance
Zeyi Huang, Andy Zhou, Zijian Ling,  Mu Cai, Haohan Wang, and Yong Jae Lee
Proceedings of International Conference on Computer Vision (ICCV), 2023
[arXiv]

Out-of-distribution Detection via Frequency-regularized Generative Models
Mu Cai, and Yixuan Li
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023 (Spotlight)
[arXiv] [code] [Youtube]

Masked Discrimination for Self-Supervised Learning on Point Clouds
Haotian Liu, Mu Cai, and Yong Jae Lee
Proceedings of the European Conference on Computer Vision (ECCV), 2022
[arXiv] [code] [talk]

VOS: Learning What You Don’t Know by Virtual Outlier Synthesis
Xuefeng Du, Zhaoning Wang, Mu Cai, and Yixuan Li
Proceedings of the International Conference on Learning Representations (ICLR), 2022
[arXiv] [code]

Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving
Mu Cai, Hong Zhang, Huijuan Huang, Qichuan Geng, Yixuan Li, and Gao Huang
In Proceedings of International Conference on Computer Vision (ICCV), 2021
[arXiv] [code]

A game-theoretic strategy-aware interaction algorithm with validation on real traffic data
Liting Sun*, Mu Cai*, Wei Zhan, and Masayoshi Tomizuka
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020
(*equal contribution)
[PDF]