Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
by
Szpektor, Idan
, Giryes, Raja
, Assaf Ben Kish
, Moran Yanuka
, Bitton, Yonatan
in
Hallucinations
/ Standard data
2024
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
by
Szpektor, Idan
, Giryes, Raja
, Assaf Ben Kish
, Moran Yanuka
, Bitton, Yonatan
in
Hallucinations
/ Standard data
2024
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
Paper
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
2024
Request Book From Autostore
and Choose the Collection Method
Overview
Recent research increasingly focuses on training vision-language models (VLMs) with long, detailed image captions. However, small-scale VLMs often struggle to balance the richness of these captions with the risk of hallucinating content during fine-tuning. In this paper, we explore how well VLMs adapt to such captions. To quantify caption quality, we propose Decomposed NLI (DNLI), an evaluation framework that breaks down generated captions into individual propositions, assessing each in isolation. This fine-grained analysis reveals a critical balance between capturing descriptive details and preventing hallucinations. Our findings show that simply reducing caption complexity or employing standard data curation techniques does not effectively resolve this issue. To tackle this challenge, we introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model's existing knowledge and visual understanding. KnowAda minimizes hallucinations while preserving high descriptiveness. We validate this approach across several small-scale VLMs (up to 7B parameters) and dense caption datasets, demonstrating that KnowAda effectively balances hallucination reduction and descriptiveness. Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations. We will release our code and models.
Publisher
Cornell University Library, arXiv.org
Subject
This website uses cookies to ensure you get the best experience on our website.