Our surrounding world is multi-modal in nature. My research in vision-language (VL) aims to build machines that can jointly perceive, understand, and reason over the vision and language modalities to perform real-world tasks, such as describing visual environments or creating images from text descriptions. One major challenge in VL is to build fine-grained semantic alignments between visual entities and language references, known as the visual grounding problem. In this talk, I’ll present our research on building more effective VL systems through a visual grounding perspective. Specifically, I will discuss (1) a fast and accurate one-stage visual grounding paradigm for the stand-alone visual grounding task, (2) jointly learning visual grounding to benefit various VL tasks such as captioning and question answering, and (3) unified VL understanding and generation based on grounded VL representations. Finally, I will conclude my talk by discussing future directions for VL and how to improve it towards a generalist model.
Zhengyuan Yang is a Senior Researcher at Microsoft. He got his Ph.D. in Computer Science from the University of Rochester advised by Prof. Jiebo Luo. He received his bachelor degree at the University of Science and Technology of China (USTC). His research interests involve the intersection of computer vision and natural language processing, including multi-modal vision-language understanding and generation. He received the ACM SIGMM Outstanding Ph.D. Thesis Award 2022, ICPR 2018 Best Industry Related Paper Award, and Twitch Research Fellowship 2020. He serves as an Associate Editor for IEEE TCSVT and a SPC member for AAAI 2023. For more information, please visit zhengyuan.info.
If you are a member of the WashU community, login with your WUSTL Key to interact with events, personalize your calendar, and get recommendations.Login with WUSTL Key
If you are not a member of the WashU community, please login via one of the options below to interact with our calendar.
No recent activity