Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Series TitleSeries Title
-
Reading LevelReading Level
-
YearFrom:-To:
-
More FiltersMore FiltersContent TypeItem TypeIs Full-Text AvailableSubjectCountry Of PublicationPublisherSourceTarget AudienceDonorLanguagePlace of PublicationContributorsLocation
Done
Filters
Reset
16,805
result(s) for
"audio-visual"
Sort by:
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-Wise Pseudo Labeling
2024
The Audio-Visual Video Parsing task aims to identify and temporally localize the events that occur in either or both the audio and visual streams of audible videos. It often performs in a weakly-supervised manner, where only video event labels are provided, i.e., the modalities and the timestamps of the labels are unknown. Due to the lack of densely annotated labels, recent work attempts to leverage pseudo labels to enrich the supervision. A commonly used strategy is to generate pseudo labels by categorizing the known video event labels for each modality. However, the labels are still confined to the video level, and the temporal boundaries of events remain unlabeled. In this paper, we propose a new pseudo label generation strategy that can explicitly assign labels to each video segment by utilizing prior knowledge learned from the open world. Specifically, we exploit the large-scale pretrained models, namely CLIP and CLAP, to estimate the events in each video segment and generate segment-level visual and audio pseudo labels, respectively. We then propose a new loss function to exploit these pseudo labels by taking into account their category-richness and segment-richness. A label denoising strategy is also adopted to further improve the visual pseudo labels by flipping them whenever abnormally large forward losses occur. We perform extensive experiments on the LLP dataset and demonstrate the effectiveness of each proposed design and we achieve state-of-the-art video parsing performance on all types of event parsing, i.e., audio event, visual event, and audio-visual event. Furthermore, our experiments verify that the high-quality segment-level pseudo labels provided by our method can be flexibly combined with other audio-visual video parsing backbones and consistently improve their performances. We also examine the proposed pseudo label generation strategy on a relevant weakly-supervised audio-visual event localization task and the experimental results again verify the benefits and generalization of our method.
Journal Article
Singing out : the musical voice in audiovisual media
by
Haworth, Catherine, editor
,
Carroll, Beth (Lecturer in Film and Literature), editor
in
Singing.
,
Audio-visual materials.
,
Music.
2025
'Singing Out' explores a broad range of singing voices and sung moments, from lavish film musical sequences, television and videogames, through to online platforms, advertising, and multimedia installation work. It illustrates the diverse ways in which the singing voice is produced and understood in different media across international contexts, taking into consideration issues such as corporeal form, age, race, reception, and gender. The act of singing emphasises issues of identity, technology, and the identifying markers of the voice itself, heightening communication, acting as an aid to memory, and inviting judgement.
Reframing Holocaust Testimony
2015
Institutions that have collected video testimonies from the few remaining Holocaust survivors are grappling with how to continue their mission to educate and commemorate. Noah Shenker calls attention to the ways that audiovisual testimonies of the Holocaust have been mediated by the institutional histories and practices of their respective archives. Shenker argues that testimonies are shaped not only by the encounter between interviewer and interviewee, but also by technical practices and the testimony process. He analyzes the ways in which interview questions, the framing of the camera, and curatorial and programming preferences impact how Holocaust testimony is molded, distributed, and received.
Deep Audio-visual Learning: A Survey
2021
Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully. Researchers tend to leverage these two modalities to improve the performance of previously considered single-modality tasks or address new challenging problems. In this paper, we provide a comprehensive survey of recent audio-visual learning development. We divide the current audio-visual learning tasks into four different subfields: audio-visual separation and localization, audio-visual correspondence learning, audio-visual generation, and audio-visual representation learning. State-of-the-art methods, as well as the remaining challenges of each subfield, are further discussed. Finally, we summarize the commonly used datasets and challenges.
Journal Article
Introduction to audiovisual archives
by
Stockinger, Peter
in
Audio-visual archives
,
Audio-visual materials
,
Audio-visual materials--Classification
2013,2012
Today, audiovisual archives and libraries have become very popular especially in the field of collecting, preserving and transmitting cultural heritage. However, the data from these archives or libraries – videos, images, sound tracks, etc. – constitute as such only potential cognitive resources for a given public (or “target community”). They have to undergo more or less significant qualitative transformations in order to become user- or community-relevant intellectual goods.
These qualitative transformations are performed through a series of concrete operations such as: audiovisual text segmentation, content description and indexing, pragmatic profiling, translation, etc. These and other operations constitute what we call the semiotic turn in dealing with digital (audiovisual) texts, corpora of texts or even entire (audiovisual) archives and libraries. They demonstrate practically and theoretically the well-known “from data to meta-data” or “from (simple) information to (relevant) knowledge” problem – a problem that obviously directly influences the effective use, the social impact and relevancy and therefore also the future of digital knowledge archives.
It constitutes, indeed, the heart of a diversity of important R&D programs and projects all over the world.
A Review of Recent Advances on Deep Learning Methods for Audio-Visual Speech Recognition
2023
This article provides a detailed review of recent advances in audio-visual speech recognition (AVSR) methods that have been developed over the last decade (2013–2023). Despite the recent success of audio speech recognition systems, the problem of audio-visual (AV) speech decoding remains challenging. In comparison to the previous surveys, we mainly focus on the important progress brought with the introduction of deep learning (DL) to the field and skip the description of long-known traditional “hand-crafted” methods. In addition, we also discuss the recent application of DL toward AV speech fusion and recognition. We first discuss the main AV datasets used in the literature for AVSR experiments since we consider it a data-driven machine learning (ML) task. We then consider the methodology used for visual speech recognition (VSR). Subsequently, we also consider recent AV methodology advances. We then separately discuss the evolution of the core AVSR methods, pre-processing and augmentation techniques, and modality fusion strategies. We conclude the article with a discussion on the current state of AVSR and provide our vision for future research.
Journal Article