Catalogue Search | MBRL

Deep-Learning-Based Multimodal Emotion Classification for Music Videos

by Pandeya, Yagya Raj , Bhattarai, Bhuwan , Lee, Joonwhoan in channel and filter separable convolution , Datasets , Emotions

2021

Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents an affective computing system that relies on music, video, and facial expression cues, making it useful for emotional analysis. We applied the audio–video information exchange and boosting methods to regularize the training process and reduced the computational costs by using a separable convolution strategy. In sum, our empirical findings are as follows: (1) Multimodal representations efficiently capture all acoustic and visual emotional clues included in each music video, (2) the computational cost of each neural network is significantly reduced by factorizing the standard 2D/3D convolution into separate channels and spatiotemporal interactions, and (3) information-sharing methods incorporated into multimodal representations are helpful in guiding individual information flow and boosting overall performance. We tested our findings across several unimodal and multimodal networks against various evaluation metrics and visual analyzers. Our best classifier attained 74% accuracy, an f1-score of 0.73, and an area under the curve score of 0.926.

Journal Article

Share this book

Add to My Shelf

Can We Ditch Feature Engineering? End-to-End Deep Learning for Affect Recognition from Physiological Sensor Data

by Kazienko, Przemysław , Saganowski, Stanisław , Gjoreski, Martin in Affect , Affect (Psychology) , Artificial intelligence

2020

To further extend the applicability of wearable sensors in various domains such as mobile health systems and the automotive industry, new methods for accurately extracting subtle physiological information from these wearable sensors are required. However, the extraction of valuable information from physiological signals is still challenging—smartphones can count steps and compute heart rate, but they cannot recognize emotions and related affective states. This study analyzes the possibility of using end-to-end multimodal deep learning (DL) methods for affect recognition. Ten end-to-end DL architectures are compared on four different datasets with diverse raw physiological signals used for affect recognition, including emotional and stress states. The DL architectures specialized for time-series classification were enhanced to simultaneously facilitate learning from multiple sensors, each having their own sampling frequency. To enable fair comparison among the different DL architectures, Bayesian optimization was used for hyperparameter tuning. The experimental results showed that the performance of the models depends on the intensity of the physiological response induced by the affective stimuli, i.e., the DL models recognize stress induced by the Trier Social Stress Test more successfully than they recognize emotional changes induced by watching affective content, e.g., funny videos. Additionally, the results showed that the CNN-based architectures might be more suitable than LSTM-based architectures for affect recognition from physiological sensors.

Journal Article

Share this book

Add to My Shelf

Multi-view text classification through integrated RNN autoencoder learning of word, sentence, emotion and paragraph representations

by Singh, Narinderjit Singh Sawaran , Ding, Yitao , Alfilh, Raed H. C. in 639/166 , 639/705 , Architecture

2025

Text classification performance can be constrained by single-view approaches that process documents through a single representational lens and struggle to capture the multi-dimensional nature of textual information. We propose FMV-RNN-AE (Feature integration Multi-View RNN Autoencoder), an end-to-end framework that systematically integrates four complementary textual views—word-level embeddings, sentence-level representations, emotion-based features, and paragraph-level semantics. FMV-RNN-AE employs standard RNN autoencoders to learn compressed view-specific representations, followed by a learnable fusion module and joint optimization for classification, focusing on the principled integration of these components rather than introducing a fundamentally new architecture. Comprehensive evaluation across seven benchmark datasets shows consistent improvements of 4.7% compared to strong single-view approaches and 2.2–4.0% over existing multi-view methods, with particularly strong performance on sentiment-oriented tasks (93.5% accuracy on Hate Speech, 92.7% on IMDb). Compared with BERT, FMV-RNN-AE achieves comparable average accuracy while using 7.2 fewer parameters and 50% less memory, at the cost of approximately 4 higher inference latency due to sequential multi-view processing. Thus, the framework is best interpreted as a memory-efficient, task-sensitive alternative suited to latency-tolerant, batch or offline scenarios rather than real-time applications. These results highlight the potential of carefully designed multi-view autoencoder integration for improving text classification robustness across diverse domains under constrained memory budgets.

Journal Article

Share this book

Add to My Shelf

EMVAS: end-to-end multimodal emotion visualization analysis system

by Cambria, Erik , Ju, Ming , Yuan, Haochen in Ablation , Adaptive learning , Audio data

2025

Accurately interpreting human emotions is crucial for enhancing human–machine interactions in applications such as driver monitoring, adaptive learning, and smart environments. Conventional unimodal systems fail to capture the complex interplay of emotional cues in dynamic settings. To address these limitations, we propose EMVAS-an end-to-end multimodal emotion visualization analysis system that seamlessly integrates visual, auditory, and textual modalities. The preprocessing architecture utilizes silence-based audio segmentation alongside end-to-end DeepSpeech2 audio-to-text conversion to generate a synchronized and semantically consistent data stream. For feature extraction, facial landmark detection and action unit analysis capture fine-grained visual cues; Mel-frequency cepstral coefficients, log-scaled fundamental frequency, and Constant-Q transform extract detailed audio features; and a Transformer-based encoder processes textual data for contextual emotion analysis. These heterogeneous features are projected into a unified latent space and fused using a self-supervised multitask learning framework that leverages both shared and modality-specific representations to achieve robust emotion classification. An intuitive front-end provides real-time visualization of temporal trends and emotion frequency distributions. Extensive experiments on benchmark datasets and real-world scenarios demonstrate that EMVAS outperforms state-of-the-art baselines by achieving higher classification accuracy, improved F1 scores, lower mean absolute error, and stronger correlations. Graphical abstract

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter