Mu Cai

Hi, I am a fourth-year Ph.D. candidate in Computer Sciences Department at University of Wisconsin-Madison, advised by Prof. Yong Jae Lee.

My recent research interest lies in the application and fundamental limitations of multimodal generative models. I am especially interested in visual prompting, video and 3D understanding, and analyzing the limitation of CLIP.

Email  /  CV  /  GitHub  /  Google Scholar  /  LinkedIn /  Twitter (X) /  Blog

Recent talk on compositional vision-language models in the input sapce. [YouTube link]



Research

profile photo



NEW! Matryoshka Multimodal Models
Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee
arXiv, 2024
[arXiv] [code] [Project Page] [Demo]

NEW! CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Jianrui Zhang*, Mu Cai*, Tengyang Xie, Yong Jae Lee
Findings of the Association for Computational Linguistics: ACL Findings 2024
(*equal contribution)
[arXiv] [code] [Project Page]

NEW! ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
[arXiv] [code] [Demo] [Project Page]

NEW! Cross-Modal Self-Supervised Learning with Effective Contrastive Units for Point Clouds
Mu Cai, Chenxu Luo, Yongjae Lee, and Xiaodong Yang
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024
[PDF]

NEW! LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, and Michael S. Ryoo
arXiv, 2024
[arXiv] [code]

NEW! Yo'LLaVA: Your Personalized Language and Vision Assistant
Thao Nguyen, Haotian Liu, Mu Cai, Yuheng Li, Utkarsh Ojha, Yong Jae Lee
arXiv, 2024
[arXiv] [code] [Project Page]

NEW! LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang*, Mu Cai*, Bingxin Xu, Yong Jae Lee^, Yan Yan^
arXiv, 2024
(*equal contribution, ^equal advising)
[arXiv] [code] [Project Page]

NEW! Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
Mu Cai*, Zeyi Huang*, Yuheng Li, Haohan Wang, and Yong Jae Lee
arXiv, 2023
(*equal contribution)
[arXiv] [code]

Investigating the catastrophic forgetting in multimodal large language models
Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma
Conference on Parsimony and Learning (Proceedings Track) (CPAL), 2023
[arXiv]

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance
Zeyi Huang, Andy Zhou, Zijian Ling,  Mu Cai, Haohan Wang, and Yong Jae Lee
Proceedings of International Conference on Computer Vision (ICCV), 2023
[arXiv]

Out-of-distribution Detection via Frequency-regularized Generative Models
Mu Cai, and Yixuan Li
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023 (Spotlight)
[arXiv] [code]

Masked Discrimination for Self-Supervised Learning on Point Clouds
Haotian Liu, Mu Cai, and Yong Jae Lee
Proceedings of the European Conference on Computer Vision (ECCV), 2022
[arXiv] [code] [talk]

VOS: Learning What You Don’t Know by Virtual Outlier Synthesis
Xuefeng Du, Zhaoning Wang, Mu Cai, and Yixuan Li
Proceedings of the International Conference on Learning Representations (ICLR), 2022
[arXiv] [code]

Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving
Mu Cai, Hong Zhang, Huijuan Huang, Qichuan Geng, Yixuan Li, and Gao Huang
In Proceedings of International Conference on Computer Vision (ICCV), 2021
[arXiv] [code]

A game-theoretic strategy-aware interaction algorithm with validation on real traffic data
Liting Sun*, Mu Cai*, Wei Zhan, and Masayoshi Tomizuka
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020
(*equal contribution)
[PDF]