Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
1,892 result(s) for "Cross-modal"
Sort by:
Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective
Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences in the data itself, a pair of encoders inevitably learns the differences caused by encoder fluctuations. The latter leads to convergence shortcuts, resulting in poor representation quality and an inaccurate reflection of the similarity relationships between samples in the original dataset within the shared space of features. To achieve fine-grained cross-modal representation alignment, we first propose a residual attention network to enhance consistency during momentum updates in cross-modal encoders. Building upon this, we propose momentum encoding from a multi-task perspective as a bridge for cross-modal information, effectively improving cross-modal mutual information, representation quality, and optimizing the distribution of feature points within the cross-modal shared semantic space. By acquiring momentum encoding queues for cross-modal semantic understanding through multi-tasking, we align ambiguous natural language representations around the invariant image features of factual information, alleviating contextual ambiguity and enhancing model robustness. Experimental validation shows that our proposed multi-task perspective of cross-modal momentum encoders outperforms similar models on standardized image classification tasks and image–text cross-modal retrieval tasks on public datasets by up to 8% on the leaderboard, demonstrating the effectiveness of the proposed method. Qualitative experiments on our self-built conservation area image–text paired dataset show that our proposed method accurately performs cross-modal retrieval and generation tasks among 8142 species, proving its effectiveness on fine-grained cross-modal image–text conservation area image datasets.
Sliding perspectives: dissociating ownership from self-location during full body illusions in virtual reality
Bodily illusions have been used to study bodily self-consciousness and disentangle its various components, among other the sense of ownership and self-location. Congruent multimodal correlations between the real body and a fake humanoid body can in fact trigger the illusion that the fake body is one's own and/or disrupt the unity between the perceived self-location and the position of the physical body. However, the extent to which changes in self-location entail changes in ownership is still matter of debate. Here we address this problem with the support of immersive virtual reality. Congruent visuotactile stimulation was delivered on healthy participants to trigger full body illusions from different visual perspectives, each resulting in a different degree of overlap between real and virtual body. Changes in ownership and self-location were measured with novel self-posture assessment tasks and with an adapted version of the cross-modal congruency task. We found that, despite their strong coupling, self-location and ownership can be selectively altered: self-location was affected when having a third person perspective over the virtual body, while ownership toward the virtual body was experienced only in the conditions with total or partial overlap. Thus, when the virtual body is seen in the far extra-personal space, changes in self-location were not coupled with changes in ownership. If a partial spatial overlap is present, ownership was instead typically experienced with a boosted change in the perceived self-location. We discussed results in the context of the current knowledge of the multisensory integration mechanisms contributing to self-body perception. We argue that changes in the perceived self-location are associated to the dynamical representation of peripersonal space encoded by visuotactile neurons. On the other hand, our results speak in favor of visuo-proprioceptive neuronal populations being a driving trigger in full body ownership illusions.
Generic HRTFs May be Good Enough in Virtual Reality. Improving Source Localization through Cross-Modal Plasticity
Auditory spatial localization in humans is performed using a combination of interaural time differences, interaural level differences, as well as spectral cues provided by the geometry of the ear. To render spatialized sounds within a virtual reality (VR) headset, either individualized or generic Head Related Transfer Functions (HRTFs) are usually employed. The former require arduous calibrations, but enable accurate auditory source localization, which may lead to a heightened sense of presence within VR. The latter obviate the need for individualized calibrations, but result in less accurate auditory source localization. Previous research on auditory source localization in the real world suggests that our representation of acoustic space is highly plastic. In light of these findings, we investigated whether auditory source localization could be improved for users of generic HRTFs via cross-modal learning. The results show that pairing a dynamic auditory stimulus, with a spatio-temporally aligned visual counterpart, enabled users of generic HRTFs to improve subsequent auditory source localization. Exposure to the auditory stimulus alone or to asynchronous audiovisual stimuli did not improve auditory source localization. These findings have important implications for human perception as well as the development of VR systems as they indicate that generic HRTFs may be enough to enable good auditory source localization in VR.
SFCF-Net: Spatial-Frequency Synergistic Learning for Casting Defect Segmentation of Pre-Service Aircraft Engine Blades in Industrial Radiographic Inspection
Turbine blades serve as critical components in aircraft engines, yet casting defects inevitably arise during manufacturing. Therefore, accurate pre-service turbine blade defect detection is critical for aircraft engine safety. However, existing deep learning-based detection methods face several challenges: poor image quality, intraclass variance, interclass similarity, and irregular defect geometries. Moreover, most existing defect detection methods rely primarily on spatial-domain features, which are insufficient for capturing fine-grained texture information, limiting their ability to discriminate complex defect patterns. To address these challenges, we propose a novel Spatial-Frequency Complementary Fusion Network (SFCF-Net) that synergistically integrates spatial and frequency-domain features through complementary cross-modal fusion for accurate defect segmentation. First, a Selective Cross-modal Calibration (SCC) module is introduced that selectively calibrates spatial-frequency features through gated cross-modal interactions, effectively preserving fine-grained details under poor image conditions. Next, we propose a Cross-modal Refinement and Complementation (CRC) module that employs dual-stage attention mechanisms to model intra- and inter-modal feature dependencies, enabling robust discrimination between similar defect categories while maintaining consistency within the same defect class. Finally, we propose an Asymmetric Window Attention (AWA) module that employs bidirectional rectangular windows for accurate defect geometric characterization. Comprehensive experiments on the Aero-engine Turbine Blade Casting Defect Segmentation (ATBCD-Seg) dataset and a public benchmark demonstrate that SFCF-Net consistently outperforms state-of-the-art methods across multiple evaluation metrics, meeting practical requirements for automated quality control in blade manufacturing.
Adaptive benefit of cross-modal plasticity following cochlear implantation in deaf adults
SignificanceFollowing sensory deprivation, the sensory brain regions can become colonized by the other intact sensory modalities. In deaf individuals, evidence suggests that visual language recruits auditory brain regions and may limit hearing restoration with a cochlear implant. This suggestion underpins current rehabilitative recommendations that deaf individuals undergoing cochlear implantation should avoid using visual language. However, here we show the opposite: Recruitment of auditory brain regions by visual speech after implantation is associated with better speech understanding with a cochlear implant. This suggests adaptive benefits of visual communication because visual speech may serve to optimize, rather than hinder, restoration of hearing following implantation. These findings have implications for both neuroscientific theory and the clinical rehabilitation of cochlear implant patients worldwide. It has been suggested that visual language is maladaptive for hearing restoration with a cochlear implant (CI) due to cross-modal recruitment of auditory brain regions. Rehabilitative guidelines therefore discourage the use of visual language. However, neuroscientific understanding of cross-modal plasticity following cochlear implantation has been restricted due to incompatibility between established neuroimaging techniques and the surgically implanted electronic and magnetic components of the CI. As a solution to this problem, here we used functional near-infrared spectroscopy (fNIRS), a noninvasive optical neuroimaging method that is fully compatible with a CI and safe for repeated testing. The aim of this study was to examine cross-modal activation of auditory brain regions by visual speech from before to after implantation and its relation to CI success. Using fNIRS, we examined activation of superior temporal cortex to visual speech in the same profoundly deaf adults both before and 6 mo after implantation. Patients’ ability to understand auditory speech with their CI was also measured following 6 mo of CI use. Contrary to existing theory, the results demonstrate that increased cross-modal activation of auditory brain regions by visual speech from before to after implantation is associated with better speech understanding with a CI. Furthermore, activation of auditory cortex by visual and auditory speech developed in synchrony after implantation. Together these findings suggest that cross-modal plasticity by visual speech does not exert previously assumed maladaptive effects on CI success, but instead provides adaptive benefits to the restoration of hearing after implantation through an audiovisual mechanism.
Cross-Modal Simplex Center Learning for Speech-Face Association
Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features. Existing research primarily focuses on learning shared-space representations and computing one-to-one similarities between cross-modal sample pairs to establish their correlation. However, these approaches do not fully account for intra-class variations between the modalities or the many-to-many relationships among cross-modal samples, which are crucial for robust association modeling. To address these challenges, we propose a novel framework that leverages global information to align voice and face embeddings while effectively correlating identity information embedded in both modalities. First, we jointly pre-train face recognition and speaker recognition networks to encode discriminative features from facial images and voice segments. This shared pre-training step ensures the extraction of complementary identity information across modalities. Subsequently, we introduce a cross-modal simplex center loss, which aligns samples with identity centers located at the vertices of a regular simplex inscribed on a hypersphere. This design enforces an equidistant and balanced distribution of identity embeddings, reducing intra-class variations. Furthermore, we employ an improved triplet center loss that emphasizes hard sample mining and optimizes inter-class separability, enhancing the model’s ability to generalize across challenging scenarios. Extensive experiments validate the effectiveness of our framework, demonstrating superior performance across various speech-face association tasks, including matching, verification, and retrieval. Notably, in the challenging gender-constrained matching task, our method achieves a remarkable accuracy of 79.22%, significantly outperforming existing approaches. These results highlight the potential of the proposed framework to advance the state of the art in cross-modal identity association.
The effect of transcranial random noise stimulation (tRNS) over bilateral parietal cortex in visual cross-modal conflicts
In complex sensory environments, visual cross-modal conflicts often affect auditory performance. The inferior parietal cortex (IPC) is involved in processing visual conflicts, namely when cognitive control processes such as inhibitory control and working memory are required. This study investigated the effect of bilateral IPC tRNS on reducing visual cross-modal conflicts and explored whether its efficacy is dependent on the conflict type. Forty-four young adults were randomly allocated to receive either active tRNS (100–640 Hz, 2-mA for 20 min) or sham stimulation. Participants repeatedly performed tasks in three phases: before, during, and after stimulation. Results showed that tRNS significantly enhanced task accuracy across both semantic and non-semantic conflicts compared to sham, as well as a greater benefit in semantic conflict after stimulation. Correlation analyses indicated that individuals with lower baseline performance benefited more from active tRNS during stimulation in the non-semantic conflict task. There were no significant differences between groups in reaction time for each conflict type task. These findings provide important evidence for the use of tRNS in reducing visual cross-modal conflicts, particularly in suppressing semantic distractors, and highlight the critical role of bilateral IPC in modulating visual cross-modal conflicts.
The neural representations underlying asymmetric cross‐modal prediction of words
Cross‐modal prediction serves a crucial adaptive role in the multisensory world, yet the neural mechanisms underlying this prediction are poorly understood. The present study addressed this important question by combining a novel audiovisual sequence memory task, functional magnetic resonance imaging (fMRI), and multivariate neural representational analyses. Our behavioral results revealed a reliable asymmetric cross‐modal predictive effect, with a stronger prediction from visual to auditory (VA) modality than auditory to visual (AV) modality. Mirroring the behavioral pattern, we found the superior parietal lobe (SPL) showed higher pattern similarity for VA than AV pairs, and the strength of the predictive coding in the SPL was positively correlated with the behavioral predictive effect in the VA condition. Representational connectivity analyses further revealed that the SPL mediated the neural pathway from the visual to the auditory cortex in the VA condition but was not involved in the auditory to visual cortex pathway in the AV condition. Direct neural pathways within the unimodal regions were found for the visual‐to‐visual and auditory‐to‐auditory predictions. Together, these results provide novel insights into the neural mechanisms underlying cross‐modal sequence prediction. Visual–auditory (VA) prediction was stronger than auditory–visual (AV) prediction. Superior parietal lobe (SPL) shows higher neural similarity for VA than AV prediction. Indirect pathway from the visual to auditory cortex via SPL underlies VA prediction. In contrast, direct connectivity within modality‐specific areas supports within‐modal predictions.
Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization
Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting their applicability to complex, multi-event videos. In this paper, we introduce CMSTR-ODE, a novel Cross-Modal Streaming Transformer with Neural ODE Temporal Localization framework for dense video captioning. Our model incorporates three key innovations: (1) Neural ODE-based Temporal Localization for continuous and efficient event boundary prediction, improving the accuracy of temporal segmentation; (2) cross-modal memory retrieval, which enriches video features with external textual knowledge, enabling more context-aware and descriptive captioning; and (3) a Streaming Multi-Scale Transformer Decoder that generates captions in real time, handling objects and events of varying scales. We evaluate CMSTR-ODE on two benchmark datasets, YouCook2, Flickr30k, and ActivityNet Captions, where it achieves SOTA performance, significantly outperforming existing models in terms of CIDEr, BLEU-4, and ROUGE scores. Our model also demonstrates superior computational efficiency, processing videos at 15 frames per second, making it suitable for real-time applications such as video surveillance and live video captioning. Ablation studies highlight the contributions of each component, confirming the effectiveness of our approach. By addressing the limitations of current methods, CMSTR-ODE sets a new benchmark for dense video captioning, offering a robust and scalable solution for both real-time and long-form video understanding tasks.
Multisensory inclusive design with sensory substitution
Sensory substitution techniques are perceptual and cognitive phenomena used to represent one sensory form with an alternative. Current applications of sensory substitution techniques are typically focused on the development of assistive technologies whereby visually impaired users can acquire visual information via auditory and tactile cross-modal feedback. But despite their evident success in scientific research and furthering theory development in cognition, sensory substitution techniques have not yet gained widespread adoption within sensory-impaired populations. Here we argue that shifting the focus from assistive to mainstream applications may resolve some of the current issues regarding the use of sensory substitution devices to improve outcomes for those with disabilities. This article provides a tutorial guide on how to use research into multisensory processing and sensory substitution techniques from the cognitive sciences to design new inclusive cross-modal displays. A greater focus on developing inclusive mainstream applications could lead to innovative technologies that could be enjoyed by every person.