Sign Up

Zhengyuan Yang

Senior Researcher

Microsoft 

Our surrounding world is multi-modal in nature. My research in vision-language (VL) aims to build machines that can jointly perceive, understand, and reason over the vision and language modalities to perform real-world tasks, such as describing visual environments or creating images from text descriptions. One major challenge in VL is to build fine-grained semantic alignments between visual entities and language references, known as the visual grounding problem. In this talk, I’ll present our research on building more effective VL systems through a visual grounding perspective. Specifically, I will discuss (1) a fast and accurate one-stage visual grounding paradigm for the stand-alone visual grounding task, (2) jointly learning visual grounding to benefit various VL tasks such as captioning and question answering, and (3) unified VL understanding and generation based on grounded VL representations. Finally, I will conclude my talk by discussing future directions for VL and how to improve it towards a generalist model.

0 people are interested in this event

Zhengyuan Yang

Senior Researcher

Microsoft 

Our surrounding world is multi-modal in nature. My research in vision-language (VL) aims to build machines that can jointly perceive, understand, and reason over the vision and language modalities to perform real-world tasks, such as describing visual environments or creating images from text descriptions. One major challenge in VL is to build fine-grained semantic alignments between visual entities and language references, known as the visual grounding problem. In this talk, I’ll present our research on building more effective VL systems through a visual grounding perspective. Specifically, I will discuss (1) a fast and accurate one-stage visual grounding paradigm for the stand-alone visual grounding task, (2) jointly learning visual grounding to benefit various VL tasks such as captioning and question answering, and (3) unified VL understanding and generation based on grounded VL representations. Finally, I will conclude my talk by discussing future directions for VL and how to improve it towards a generalist model.