Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
937 result(s) for "Multimodal perception"
Sort by:
Graph-Attention Fusion with VAE Cross-Modal Mapping and Reinforcement-Learning Visualization for Real-Time AR
In AR scenarios, the intelligent generation and visualization of multimodal perception information face challenges such as feature heterogeneity, insufficient semantic alignment, and unstable real-time performance. To address these issues, this study proposes a feature modeling method that integrates an Attention-GCN for multimodal fusion, a variational autoencoder (VAE) with geometric/temporal constraints for cross-modal mapping, and a reinforcement learning (PPO) driven optimization mechanism to form a \"perception–generation–presentation–feedback\" closed-loop system. Experiments are conducted on a self-built multimodal dataset of 28,000 sequences, with results evaluated on a held-out test set to ensure reliability. Baseline comparisons include a unimodal CNN and a heuristic fusion model under the same computational conditions. Results demonstrate that the proposed framework achieves an average delay of 1.42 ± 0.08 s, frame rate of 57 ± 1.5 fps, semantic alignment rate of 92.4% ± 1.1, and interaction interruption rate of 3.5% ± 0.4, outperforming baselines in efficiency, semantic consistency, and rendering stability. These findings highlight the framework’s feasibility for real-time multimodal interaction in AR scenarios and its scalability across mid-range devices.
A survey on integration of large language models with intelligent robots
In recent years, the integration of large language models (LLMs) has revolutionized the field of robotics, enabling robots to communicate, understand, and reason with human-like proficiency. This paper explores the multifaceted impact of LLMs on robotics, addressing key challenges and opportunities for leveraging these models across various domains. By categorizing and analyzing LLM applications within core robotics elements—communication, perception, planning, and control—we aim to provide actionable insights for researchers seeking to integrate LLMs into their robotic systems. Our investigation focuses on LLMs developed post-GPT-3.5, primarily in text-based modalities while also considering multimodal approaches for perception and control. We offer comprehensive guidelines and examples for prompt engineering, facilitating beginners’ access to LLM-based robotics solutions. Through tutorial-level examples and structured prompt construction, we illustrate how LLM-guided enhancements can be seamlessly integrated into robotics applications. This survey serves as a roadmap for researchers navigating the evolving landscape of LLM-driven robotics, offering a comprehensive overview and practical guidance for harnessing the power of language models in robotics development.
Music–color associations are mediated by emotion
Experimental evidence demonstrates robust cross-modal matches between music and colors that are mediated by emotional associations. US and Mexican participants chose colors that were most/least consistent with 18 selections of classical orchestral music by Bach, Mozart, and Brahms. In both cultures, faster music in the major mode produced color choices that were more saturated, lighter, and yellower whereas slower, minor music produced the opposite pattern (choices that were desaturated, darker, and bluer). There were strong correlations (0.89 < r < 0.99) between the emotional associations of the music and those of the colors chosen to go with the music, supporting an emotional mediation hypothesis in both cultures. Additional experiments showed similarly robust cross-modal matches from emotionally expressive faces to colors and from music to emotionally expressive faces. These results provide further support that music-to-color associations are mediated by common emotional associations.
Why we are not all synesthetes (not even weakly so)
A little over a decade ago, Martino and Marks (Current Directions in Psychological Science 10:61–65, 2001 ) put forward the influential claim that cases of intuitive matchings between stimuli in different sensory modalities should be considered as a weak form of synesthesia. Over the intervening years, many other researchers have agreed—at the very least, implicitly—with this position (e.g., Bien, ten Oever, Goebel, & Sack NeuroImage 59:663–672, 2012 ; Eagleman Cortex 45:1266–1277, 2009 ; Esterman, Verstynen, Ivry, & Robertson Journal of Cognitive Neuroscience 18:1570–1576, 2006 ; Ludwig, Adachi, & Matzuzawa Proceedings of the National Academy of Sciences of the United States of America 108:20661–20665, 2011 ; Mulvenna & Walsh Trends in Cognitive Sciences 10:350–352, 2006 ; Sagiv & Ward 2006 ; Zellner, McGarry, Mattern-McClory, & Abreu Chemical Senses 33:211–222: 2008 ). Here, though, we defend the separatist view , arguing that these cases are likely to form distinct kinds of phenomena despite their superficial similarities. We believe that crossmodal correspondences should be studied in their own right and not assimilated, either in terms of the name used or in terms of the explanation given, to synesthesia. To conflate these two phenomena is both inappropriate and potentially misleading. Below, we critically evaluate the evidence concerning the descriptive and constitutive features of crossmodal correspondences and synesthesia and highlight how they differ. Ultimately, we wish to provide a general definition of crossmodal correspondences as acquired, malleable, relative, and transitive pairings between sensory dimensions and to provide a framework in which to integrate the nonsystematic cataloguing of new cases of crossmodal correspondences, a tendency that has increased in recent years.
An AI-Driven Multimodal Sensor Fusion Framework for Fraud Perception in Short-Video and Live-Streaming Platforms
With the rapid proliferation of short-video platforms and live-streaming commerce ecosystems, marketing activities are increasingly manifested through complex multimodal sensing signals. These heterogeneous sensor data streams exhibit strong temporal dependency, high cross-modal coupling, and progressive evolutionary characteristics, making early-stage fraud perception particularly challenging for conventional unimodal or static analytical paradigms. Existing approaches often fail to effectively capture weak anomalous cues emerging across multimodal channels during the initial stages of fraudulent campaigns. To address these limitations, an artificial intelligence-driven multimodal sensor perception framework is proposed for temporal fraud detection in short-video environments. A multimodal temporal alignment module is first designed to synchronize heterogeneous sensor signals with inconsistent sampling granularities. Subsequently, a shared temporal encoding network is constructed to learn evolution-aware representations across multimodal sensor sequences. On this basis, a cross-modal temporal attention fusion mechanism is introduced to dynamically weight sensor contributions at different behavioral stages. Finally, a fraud evolution modeling and early risk prediction module is developed to characterize the progressive intensification of fraudulent activities and to enable risk assessment under incomplete temporal observations. Extensive experiments conducted on real-world datasets collected from multiple mainstream short-video platforms demonstrate the effectiveness of the proposed AI-driven sensing framework. The model achieves an overall accuracy of 0.941, precision of 0.865, recall of 0.812, and F1 score of 0.838, with the AUC further reaching 0.956, significantly outperforming text-based, vision-based, temporal, and conventional multimodal baselines. In early-stage detection scenarios utilizing only the first 30% of video content, the framework maintains stable performance advantages, achieving a precision of 0.812, recall of 0.704, and F1 score of 0.754, validating its capability for proactive fraud warning.
Video Ergo Sum: Manipulating Bodily Self-Consciousness
Humans normally experience the conscious self as localized within their bodily borders. This spatial unity may break down in certain neurological conditions such as out-of-body experiences, leading to a striking disturbance of bodily self-consciousness. On the basis of these clinical data, we designed an experiment that uses conflicting visual-somatosensory input in virtual reality to disrupt the spatial unity between the self and the body. We found that during multisensory conflict, participants felt as if a virtual body seen in front of them was their own body and mislocalized themselves toward the virtual body, to a position outside their bodily borders. Our results indicate that spatial unity and bodily self-consciousness can be studied experimentally and are based on multisensory and cognitive processing of bodily information.
Auditory and vibrotactile interactions in perception of timbre acoustic features
Recently, there has been increasing interest in developing auditory-to-vibrotactile sensory devices. However, the potential of these technologies is constrained by our limited understanding of which features of complex sounds can be perceived through vibrations. The present study aimed to investigate the vibrotactile perception of acoustic features related to timbre, an essential component to identify environmental, speech and musical sounds. Discrimination thresholds were measured for six features: three spectral (number of harmonics, harmonic roll-off ratio, even-harmonic attenuation) and three temporal (attack time, amplitude modulation depth and amplitude modulation frequency) using auditory, vibrotactile and combined auditory + vibrotactile stimulation in 31 adult humans with normal tactile and auditory sensitivity. Result revealed that all spectral and temporal features can be reliably discriminated via vibrotactile stimulation only. However, for spectral features, vibrotactile thresholds were significantly higher (i.e., worse) than auditory thresholds whereas, for temporal features, only vibrotactile amplitude modulation frequency was significantly higher. With simultaneous auditory and tactile presentation, thresholds significantly improved for attack time and amplitude modulation depth, but not for any of the spectral acoustic features. These results suggest that vibrotactile temporal cues have a more straightforward potential for assisting auditory perception, while vibrotactile spectral cues may require specialized signal processing schemes.
“Moving to the beat” improves timing perception
Here, we demonstrate that “moving to the beat” can improve the perception of timing, providing an intriguing explanation as to why we often move when listening to music. In the first experiment, participants heard a series of isochronous beats and identified whether the timing of a final tone after a short silence was consistent with the timing of the preceding sequence. On half of the trials, participants tapped along with the beat, and on half of the trials, they listened without moving. When the final tone occurred later than expected, performance in the movement condition was significantly better than performance in the no-movement condition. Two additional experiments illustrate that this improved performance is due to improved timekeeping, rather than to a shift in strategy. This work contributes to a growing literature on sensorimotor integration by demonstrating body movement’s objective improvement in timekeeping, complementing previous explorations involving subjective tasks.
Perception of intersensory synchrony: A tutorial review
For most multisensory events, observers perceive synchrony among the various senses (vision, audition, touch), despite the naturally occurring lags in arrival and processing times of the different information streams. A substantial amount of research has examined how the brain accomplishes this. In the present article, we review several key issues about intersensory timing, and we identify four mechanisms of how intersensory lags might be dealt with: by ignoring lags up to some point (a wide window of temporal integration), by compensating for predictable variability, by adjusting the point of perceived synchrony on the longer term, and by shifting one stream directly toward the other.
Lip Movements Affect Infants' Audiovisual Speech Perception
Speech is robustly audiovisual from early in infancy. Here we show that audiovisual speech perception in 4.5-month-old infants is influenced by sensorimotor information related to the lip movements they make while chewing or sucking. Experiment 1 consisted of a classic audiovisual matching procedure, in which two simultaneously displayed talking faces (visual [i] and [u]) were presented with a synchronous vowel sound (audio /i/ or /u/). Infants' looking patterns were selectively biased away from the audiovisual matching face when the infants were producing lip movements similar to those needed to produce the heard vowel. Infants' looking patterns returned to those of a baseline condition (no lip movements, looking longer at the audiovisual matching face) when they were producing lip movements that did not match the heard vowel. Experiment 2 confirmed that these sensorimotor effects interacted with the heard vowel, as looking patterns differed when infants produced these same lip movements while seeing and hearing a talking face producing an unrelated vowel (audio /a/). These findings suggest that the development of speech perception and speech production may be mutually informative.