Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

by Chen, Zehua , Ke, Qiuhong , Jiang, Yuxuan , Dai, Yusheng , Zhu, Jun , Cai, Jianfei , Gao, Baolong

in Alignment / Benchmarks / Bias / Competition / Visual tasks

2026

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Paper

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Chen, Zehua,

Ke, Qiuhong,

Jiang, Yuxuan,

Dai, Yusheng,

Zhu, Jun,

Cai, Jianfei,

Gao, Baolong

2026

Overview

Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight V-A-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5\\(\\) cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions.

Share this book

Add to My Shelf

Publisher

Cornell University Library, arXiv.org

Subject

/ Bias