Asset Details

MbrlCatalogueTitleDetail

Do you wish to reserve the book?

SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning

by Min-Han, Shih , Hung-yi, Lee , Chien-yu, Huang , Chi-Yuan, Hsiao , Ke-Han, Lu

in Datasets / Emotion recognition / Large language models / Linguistics / Speech processing

2024

Yes Please

Hey, we have placed the reservation for you!

By the way, why not check out events that you can attend while you pick your title.

Oops! Something went wrong.

Looks like we were not able to place the reservation. Kindly try again later.

Paper

SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning

Min-Han, Shih,

Hung-yi, Lee,

Chien-yu, Huang,

Chi-Yuan, Hsiao,

Ke-Han, Lu

2024

Overview

Instruction-based speech processing is becoming popular. Studies show that training with multiple tasks boosts performance, but collecting diverse, large-scale tasks and datasets is expensive. Thus, it is highly desirable to design a fundamental task that benefits other downstream tasks. This paper introduces a multi-talker speaking style captioning task to enhance the understanding of speaker and prosodic information. We used large language models to generate descriptions for multi-talker speech. Then, we trained our model with pre-training on this captioning task followed by instruction tuning. Evaluation on Dynamic-SUPERB shows our model outperforming the baseline pre-trained only on single-talker tasks, particularly in speaker and emotion recognition. Additionally, tests on a multi-talker QA task reveal that current models struggle with attributes such as gender, pitch, and speaking rate. The code and dataset are available at https://github.com/cyhuang-tw/speechcaps.

Share this book

Add to My Shelf

Publisher

Cornell University Library, arXiv.org

Subject

Datasets

/ Emotion recognition

/ Large language models

/ Linguistics

/ Speech processing