Unified Vision-Language Modeling through A Visual Grounding Perspective
Monday, March 27, 2023 11:30 AM
About this Event
Zhengyuan Yang
Senior Researcher
Microsoft
Our surrounding world is multi-modal in nature. My research in vision-language (VL) aims to build machines that can jointly perceive, understand, and reason over the vision and language modalities to perform real-world tasks, such as describing visual environments or creating images from text descriptions. One major challenge in VL is to build fine-grained semantic alignments between visual entities and language references, known as the visual grounding problem. In this talk, I’ll present our research on building more effective VL systems through a visual grounding perspective. Specifically, I will discuss (1) a fast and accurate one-stage visual grounding paradigm for the stand-alone visual grounding task, (2) jointly learning visual grounding to benefit various VL tasks such as captioning and question answering, and (3) unified VL understanding and generation based on grounded VL representations. Finally, I will conclude my talk by discussing future directions for VL and how to improve it towards a generalist model.
Event Details
See Who Is Interested
0 people are interested in this event
Dial-In Information
Join Zoom Meeting
https://wustl.zoom.us/j/97957203574?pwd=NUVYblZWdzBwYko0MUM2aWNxbHhXdz09
About this Event
Zhengyuan Yang
Senior Researcher
Microsoft
Our surrounding world is multi-modal in nature. My research in vision-language (VL) aims to build machines that can jointly perceive, understand, and reason over the vision and language modalities to perform real-world tasks, such as describing visual environments or creating images from text descriptions. One major challenge in VL is to build fine-grained semantic alignments between visual entities and language references, known as the visual grounding problem. In this talk, I’ll present our research on building more effective VL systems through a visual grounding perspective. Specifically, I will discuss (1) a fast and accurate one-stage visual grounding paradigm for the stand-alone visual grounding task, (2) jointly learning visual grounding to benefit various VL tasks such as captioning and question answering, and (3) unified VL understanding and generation based on grounded VL representations. Finally, I will conclude my talk by discussing future directions for VL and how to improve it towards a generalist model.