Mu Cai

NEW! Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Gemini Team: ..., Mu Cai, ...
[arXiv] [Page]

NEW! Toward Versatile and Efficient Multimodal Models
Mu Cai, PhD Thesis
[arXiv]

NEW! Humanity's Last Exam
Center for AI Safety, Scale AI
Nature, 2026.
[paper]

NEW! Contamination Detection for VLMs using Multi-Modal Semantic Perturbation
Jaden Park, Mu Cai, Feng Yao, Jingbo Shang, Soochahn Lee, and Yong Jae Lee
Proceedings of the International Conference on Learning Representations (ICLR), 2026
[arXiv] [code] [Project Page]

NEW! When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios
Kele Shao*, Keda Tao*, Kejia Zhang, Sicheng Feng, Mu Cai, Yuzhang Shang, Haoxuan You, Can Qin, Yang Sui, Huan Wang
Transactions on Machine Learning Research (TMLR), 2026
[arXiv] [Project Page]

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang*, Mu Cai*, Bingxin Xu, Yong Jae Lee^, Yan Yan^
In Proceedings of International Conference on Computer Vision (ICCV), 2025
(*equal contribution, ^equal advising)
[arXiv] [code] [Project Page]

Magma: A Foundation Model for Multimodal AI Agents
Jianwei Yang*◊, Reuben Tan◊, Qianhui Wu◊, Ruijie Zheng‡, Baolin Peng‡, Yongyuan Liang‡, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lar Liden, Jianfeng Gao
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
[arXiv] [code] [Project Page]

Matryoshka Multimodal Models
Mu Cai, Jianwei Yang, Jianfeng Gao, Yong Jae Lee
Proceedings of the International Conference on Learning Representations (ICLR), 2025
[arXiv] [code] [Project Page] [Demo]

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, and Michael S. Ryoo
Proceedings of the International Conference on Learning Representations (ICLR), 2025
[arXiv] [code]

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
[arXiv] [code] [Demo] [Project Page] [Youtube]

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, Yao Dou, Jaden Park, Jianfeng Gao^, Yong Jae Lee^, Jianwei Yang^
arXiv, 2024
(^ equal advising)
[arXiv] [Project Page] [Code] [Datasets] [Leaderboard]

All Publications

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
Mu Cai*, Zeyi Huang*, Yuheng Li, Haohan Wang, and Yong Jae Lee
WACV, 2025
(*equal contribution)
[arXiv] [code]

Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models

Hyunsik Chae, Seungwoo Yoon, Jaden Park, Chloe Yewon Chun, Yongin Cho, Mu Cai, Yong Jae Lee, Ernest K. Ryu

arXiv, 2025
[arXiv] [code]

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
Jianrui Zhang*, Mu Cai*, Yong Jae Lee
arXiv, 2024 [Accepted by NeurIPS 2025 D&B Track AC, rejected by PC]
(*equal contribution)
[arXiv] [Project Page] [Code] [Datasets] [Leaderboard]

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
Bocheng Zou*, Mu Cai*, Jianrui Zhang, Yong Jae Lee
EMNLP main, 2024
[arXiv] [code] [Project Page] [Dataset]

CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models

Shubham Bharti, Shiyun Cheng, Jihyun Rho, Jianrui Zhang, Mu Cai, Yong Jae Lee, Martina Rau, Xiaojin Zhu

arXiv, 2024
[arXiv]

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment
Yuheng Li, Haotian Liu, Mu Cai, Yijun Li , Eli Shechtman, Zhe Lin, Yong Jae Lee, and Krishna Kumar Singh
Proceedings of the European Conference on Computer Vision (ECCV), 2024
[arXiv] [code] [Project Page]

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Jianrui Zhang*, Mu Cai*, Tengyang Xie, Yong Jae Lee
Findings of the Association for Computational Linguistics: ACL Findings 2024
(*equal contribution)
[arXiv] [code] [Project Page]

Cross-Modal Self-Supervised Learning with Effective Contrastive Units for Point Clouds
Mu Cai, Chenxu Luo, Yongjae Lee, and Xiaodong Yang
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024
[arXiv] [code] [Youtube video]

Yo'LLaVA: Your Personalized Language and Vision Assistant
Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, Yong Jae Lee
NeurIPS, 2024
[arXiv] [code] [Project Page]

Investigating the catastrophic forgetting in multimodal large language models
Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma
Conference on Parsimony and Learning (Proceedings Track) (CPAL), 2023
[arXiv]

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance
Zeyi Huang, Andy Zhou, Zijian Ling, Mu Cai, Haohan Wang, and Yong Jae Lee
Proceedings of International Conference on Computer Vision (ICCV), 2023
[arXiv]

Out-of-distribution Detection via Frequency-regularized Generative Models
Mu Cai, and Yixuan Li
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023 (Spotlight)
[arXiv] [code] [Youtube]

Masked Discrimination for Self-Supervised Learning on Point Clouds
Haotian Liu, Mu Cai, and Yong Jae Lee
Proceedings of the European Conference on Computer Vision (ECCV), 2022
[arXiv] [code] [talk]

VOS: Learning What You Don’t Know by Virtual Outlier Synthesis
Xuefeng Du, Zhaoning Wang, Mu Cai, and Yixuan Li
Proceedings of the International Conference on Learning Representations (ICLR), 2022
[arXiv] [code]

Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving
Mu Cai, Hong Zhang, Huijuan Huang, Qichuan Geng, Yixuan Li, and Gao Huang
In Proceedings of International Conference on Computer Vision (ICCV), 2021
[arXiv] [code]

A game-theoretic strategy-aware interaction algorithm with validation on real traffic data
Liting Sun*, Mu Cai*, Wei Zhan, and Masayoshi Tomizuka
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020
(*equal contribution)
[PDF]