Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

by Liu, Jiazheng , Karlsson, Börje F , Zheng, Sipeng , Lu, Zongqing

in Annotations / Datasets / Large language models / Questions / Reasoning

2025

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Paper

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Liu, Jiazheng,

Karlsson, Börje F,

Zheng, Sipeng,

Lu, Zongqing

2025

Overview

Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-world scenarios. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.

Share this book

Add to My Shelf

Publisher

Cornell University Library, arXiv.org

Subject

Annotations

/ Datasets

/ Large language models

/ Questions

/ Reasoning