Asset Details
MbrlCatalogueTitleDetail
Do you wish to reserve the book?
Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
by
Derr, Tyler
, Zhan, Xuhui
in
Alignment
/ Cognition & reasoning
/ Cognitive tasks
/ Mapping
/ Optical character recognition
/ Reasoning
/ Representations
/ Task complexity
/ Vision
/ Visual tasks
2025
Hey, we have placed the reservation for you!
By the way, why not check out events that you can attend while you pick your title.
You are currently in the queue to collect this book. You will be notified once it is your turn to collect the book.
Oops! Something went wrong.
Looks like we were not able to place the reservation. Kindly try again later.
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
Do you wish to request the book?
Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
by
Derr, Tyler
, Zhan, Xuhui
in
Alignment
/ Cognition & reasoning
/ Cognitive tasks
/ Mapping
/ Optical character recognition
/ Reasoning
/ Representations
/ Task complexity
/ Vision
/ Visual tasks
2025
Please be aware that the book you have requested cannot be checked out. If you would like to checkout this book, you can reserve another copy
We have requested the book for you!
Your request is successful and it will be processed during the Library working hours. Please check the status of your request in My Requests.
Oops! Something went wrong.
Looks like we were not able to place your request. Kindly try again later.
Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
Paper
Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
2025
Request Book From Autostore
and Choose the Collection Method
Overview
Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at https://inverse-llava.github.io.
Publisher
Cornell University Library, arXiv.org
Subject
This website uses cookies to ensure you get the best experience on our website.