I am a Research Scientist at Google DeepMind.
My recent research interest lies in multimodal generative models. I am especially interested in multimodal agents, video and 3D understanding, and analyzing the limitations of CLIP.
Email / CV / GitHub / Google Scholar / LinkedIn / Twitter (X) / Blog
Recent talk on criticizing and creating vision-language models. [YouTube English, Chinese ]