Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
1,756 result(s) for "Cross-modal"
Sort by:
Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective
Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences in the data itself, a pair of encoders inevitably learns the differences caused by encoder fluctuations. The latter leads to convergence shortcuts, resulting in poor representation quality and an inaccurate reflection of the similarity relationships between samples in the original dataset within the shared space of features. To achieve fine-grained cross-modal representation alignment, we first propose a residual attention network to enhance consistency during momentum updates in cross-modal encoders. Building upon this, we propose momentum encoding from a multi-task perspective as a bridge for cross-modal information, effectively improving cross-modal mutual information, representation quality, and optimizing the distribution of feature points within the cross-modal shared semantic space. By acquiring momentum encoding queues for cross-modal semantic understanding through multi-tasking, we align ambiguous natural language representations around the invariant image features of factual information, alleviating contextual ambiguity and enhancing model robustness. Experimental validation shows that our proposed multi-task perspective of cross-modal momentum encoders outperforms similar models on standardized image classification tasks and image–text cross-modal retrieval tasks on public datasets by up to 8% on the leaderboard, demonstrating the effectiveness of the proposed method. Qualitative experiments on our self-built conservation area image–text paired dataset show that our proposed method accurately performs cross-modal retrieval and generation tasks among 8142 species, proving its effectiveness on fine-grained cross-modal image–text conservation area image datasets.
Generic HRTFs May be Good Enough in Virtual Reality. Improving Source Localization through Cross-Modal Plasticity
Auditory spatial localization in humans is performed using a combination of interaural time differences, interaural level differences, as well as spectral cues provided by the geometry of the ear. To render spatialized sounds within a virtual reality (VR) headset, either individualized or generic Head Related Transfer Functions (HRTFs) are usually employed. The former require arduous calibrations, but enable accurate auditory source localization, which may lead to a heightened sense of presence within VR. The latter obviate the need for individualized calibrations, but result in less accurate auditory source localization. Previous research on auditory source localization in the real world suggests that our representation of acoustic space is highly plastic. In light of these findings, we investigated whether auditory source localization could be improved for users of generic HRTFs via cross-modal learning. The results show that pairing a dynamic auditory stimulus, with a spatio-temporally aligned visual counterpart, enabled users of generic HRTFs to improve subsequent auditory source localization. Exposure to the auditory stimulus alone or to asynchronous audiovisual stimuli did not improve auditory source localization. These findings have important implications for human perception as well as the development of VR systems as they indicate that generic HRTFs may be enough to enable good auditory source localization in VR.
Sliding perspectives: dissociating ownership from self-location during full body illusions in virtual reality
Bodily illusions have been used to study bodily self-consciousness and disentangle its various components, among other the sense of ownership and self-location. Congruent multimodal correlations between the real body and a fake humanoid body can in fact trigger the illusion that the fake body is one's own and/or disrupt the unity between the perceived self-location and the position of the physical body. However, the extent to which changes in self-location entail changes in ownership is still matter of debate. Here we address this problem with the support of immersive virtual reality. Congruent visuotactile stimulation was delivered on healthy participants to trigger full body illusions from different visual perspectives, each resulting in a different degree of overlap between real and virtual body. Changes in ownership and self-location were measured with novel self-posture assessment tasks and with an adapted version of the cross-modal congruency task. We found that, despite their strong coupling, self-location and ownership can be selectively altered: self-location was affected when having a third person perspective over the virtual body, while ownership toward the virtual body was experienced only in the conditions with total or partial overlap. Thus, when the virtual body is seen in the far extra-personal space, changes in self-location were not coupled with changes in ownership. If a partial spatial overlap is present, ownership was instead typically experienced with a boosted change in the perceived self-location. We discussed results in the context of the current knowledge of the multisensory integration mechanisms contributing to self-body perception. We argue that changes in the perceived self-location are associated to the dynamical representation of peripersonal space encoded by visuotactile neurons. On the other hand, our results speak in favor of visuo-proprioceptive neuronal populations being a driving trigger in full body ownership illusions.
Cross-Modal Simplex Center Learning for Speech-Face Association
Speech-face association aims to achieve identity matching between facial images and voice segments by aligning cross-modal features. Existing research primarily focuses on learning shared-space representations and computing one-to-one similarities between cross-modal sample pairs to establish their correlation. However, these approaches do not fully account for intra-class variations between the modalities or the many-to-many relationships among cross-modal samples, which are crucial for robust association modeling. To address these challenges, we propose a novel framework that leverages global information to align voice and face embeddings while effectively correlating identity information embedded in both modalities. First, we jointly pre-train face recognition and speaker recognition networks to encode discriminative features from facial images and voice segments. This shared pre-training step ensures the extraction of complementary identity information across modalities. Subsequently, we introduce a cross-modal simplex center loss, which aligns samples with identity centers located at the vertices of a regular simplex inscribed on a hypersphere. This design enforces an equidistant and balanced distribution of identity embeddings, reducing intra-class variations. Furthermore, we employ an improved triplet center loss that emphasizes hard sample mining and optimizes inter-class separability, enhancing the model’s ability to generalize across challenging scenarios. Extensive experiments validate the effectiveness of our framework, demonstrating superior performance across various speech-face association tasks, including matching, verification, and retrieval. Notably, in the challenging gender-constrained matching task, our method achieves a remarkable accuracy of 79.22%, significantly outperforming existing approaches. These results highlight the potential of the proposed framework to advance the state of the art in cross-modal identity association.
The effect of transcranial random noise stimulation (tRNS) over bilateral parietal cortex in visual cross-modal conflicts
In complex sensory environments, visual cross-modal conflicts often affect auditory performance. The inferior parietal cortex (IPC) is involved in processing visual conflicts, namely when cognitive control processes such as inhibitory control and working memory are required. This study investigated the effect of bilateral IPC tRNS on reducing visual cross-modal conflicts and explored whether its efficacy is dependent on the conflict type. Forty-four young adults were randomly allocated to receive either active tRNS (100–640 Hz, 2-mA for 20 min) or sham stimulation. Participants repeatedly performed tasks in three phases: before, during, and after stimulation. Results showed that tRNS significantly enhanced task accuracy across both semantic and non-semantic conflicts compared to sham, as well as a greater benefit in semantic conflict after stimulation. Correlation analyses indicated that individuals with lower baseline performance benefited more from active tRNS during stimulation in the non-semantic conflict task. There were no significant differences between groups in reaction time for each conflict type task. These findings provide important evidence for the use of tRNS in reducing visual cross-modal conflicts, particularly in suppressing semantic distractors, and highlight the critical role of bilateral IPC in modulating visual cross-modal conflicts.
The neural representations underlying asymmetric cross‐modal prediction of words
Cross‐modal prediction serves a crucial adaptive role in the multisensory world, yet the neural mechanisms underlying this prediction are poorly understood. The present study addressed this important question by combining a novel audiovisual sequence memory task, functional magnetic resonance imaging (fMRI), and multivariate neural representational analyses. Our behavioral results revealed a reliable asymmetric cross‐modal predictive effect, with a stronger prediction from visual to auditory (VA) modality than auditory to visual (AV) modality. Mirroring the behavioral pattern, we found the superior parietal lobe (SPL) showed higher pattern similarity for VA than AV pairs, and the strength of the predictive coding in the SPL was positively correlated with the behavioral predictive effect in the VA condition. Representational connectivity analyses further revealed that the SPL mediated the neural pathway from the visual to the auditory cortex in the VA condition but was not involved in the auditory to visual cortex pathway in the AV condition. Direct neural pathways within the unimodal regions were found for the visual‐to‐visual and auditory‐to‐auditory predictions. Together, these results provide novel insights into the neural mechanisms underlying cross‐modal sequence prediction. Visual–auditory (VA) prediction was stronger than auditory–visual (AV) prediction. Superior parietal lobe (SPL) shows higher neural similarity for VA than AV prediction. Indirect pathway from the visual to auditory cortex via SPL underlies VA prediction. In contrast, direct connectivity within modality‐specific areas supports within‐modal predictions.
Adaptive benefit of cross-modal plasticity following cochlear implantation in deaf adults
SignificanceFollowing sensory deprivation, the sensory brain regions can become colonized by the other intact sensory modalities. In deaf individuals, evidence suggests that visual language recruits auditory brain regions and may limit hearing restoration with a cochlear implant. This suggestion underpins current rehabilitative recommendations that deaf individuals undergoing cochlear implantation should avoid using visual language. However, here we show the opposite: Recruitment of auditory brain regions by visual speech after implantation is associated with better speech understanding with a cochlear implant. This suggests adaptive benefits of visual communication because visual speech may serve to optimize, rather than hinder, restoration of hearing following implantation. These findings have implications for both neuroscientific theory and the clinical rehabilitation of cochlear implant patients worldwide. It has been suggested that visual language is maladaptive for hearing restoration with a cochlear implant (CI) due to cross-modal recruitment of auditory brain regions. Rehabilitative guidelines therefore discourage the use of visual language. However, neuroscientific understanding of cross-modal plasticity following cochlear implantation has been restricted due to incompatibility between established neuroimaging techniques and the surgically implanted electronic and magnetic components of the CI. As a solution to this problem, here we used functional near-infrared spectroscopy (fNIRS), a noninvasive optical neuroimaging method that is fully compatible with a CI and safe for repeated testing. The aim of this study was to examine cross-modal activation of auditory brain regions by visual speech from before to after implantation and its relation to CI success. Using fNIRS, we examined activation of superior temporal cortex to visual speech in the same profoundly deaf adults both before and 6 mo after implantation. Patients’ ability to understand auditory speech with their CI was also measured following 6 mo of CI use. Contrary to existing theory, the results demonstrate that increased cross-modal activation of auditory brain regions by visual speech from before to after implantation is associated with better speech understanding with a CI. Furthermore, activation of auditory cortex by visual and auditory speech developed in synchrony after implantation. Together these findings suggest that cross-modal plasticity by visual speech does not exert previously assumed maladaptive effects on CI success, but instead provides adaptive benefits to the restoration of hearing after implantation through an audiovisual mechanism.
Cortical Neuroplasticity and Cognitive Function in Early-Stage, Mild-Moderate Hearing Loss: Evidence of Neurocognitive Benefit From Hearing Aid Use
Age-related hearing loss (ARHL) is associated with cognitive decline as well as structural and functional brain changes. However, the mechanisms underlying neurocognitive deficits in ARHL are poorly understood and it is unclear whether clinical treatment with hearing aids may modify neurocognitive outcomes. To address these topics, cortical visual evoked potentials (CVEPs), cognitive function, and speech perception abilities were measured in 28 adults with untreated, mild-moderate ARHL and 13 age-matched normal hearing (NH) controls. The group of adults with ARHL were then fit with bilateral hearing aids and re-evaluated after 6 months of amplification use. At baseline, the ARHL group exhibited more extensive recruitment of auditory, frontal, and pre-frontal cortices during a visual motion processing task, providing evidence of cross-modal re-organization and compensatory cortical neuroplasticity. Further, more extensive cross-modal recruitment of the right auditory cortex was associated with greater degree of hearing loss, poorer speech perception in noise, and worse cognitive function. Following clinical treatment with hearing aids, a reversal in cross-modal re-organization of auditory cortex by vision was observed in the ARHL group, coinciding with gains in speech perception and cognitive performance. Thus, beyond the known benefits of hearing aid use on communication, outcomes from this study provide evidence that clinical intervention with well-fit amplification may promote more typical cortical organization and functioning and provide cognitive benefit.
Cross-Modal Transformer-Based Streaming Dense Video Captioning with Neural ODE Temporal Localization
Dense video captioning is a critical task in video understanding, requiring precise temporal localization of events and the generation of detailed, contextually rich descriptions. However, the current state-of-the-art (SOTA) models face significant challenges in event boundary detection, contextual understanding, and real-time processing, limiting their applicability to complex, multi-event videos. In this paper, we introduce CMSTR-ODE, a novel Cross-Modal Streaming Transformer with Neural ODE Temporal Localization framework for dense video captioning. Our model incorporates three key innovations: (1) Neural ODE-based Temporal Localization for continuous and efficient event boundary prediction, improving the accuracy of temporal segmentation; (2) cross-modal memory retrieval, which enriches video features with external textual knowledge, enabling more context-aware and descriptive captioning; and (3) a Streaming Multi-Scale Transformer Decoder that generates captions in real time, handling objects and events of varying scales. We evaluate CMSTR-ODE on two benchmark datasets, YouCook2, Flickr30k, and ActivityNet Captions, where it achieves SOTA performance, significantly outperforming existing models in terms of CIDEr, BLEU-4, and ROUGE scores. Our model also demonstrates superior computational efficiency, processing videos at 15 frames per second, making it suitable for real-time applications such as video surveillance and live video captioning. Ablation studies highlight the contributions of each component, confirming the effectiveness of our approach. By addressing the limitations of current methods, CMSTR-ODE sets a new benchmark for dense video captioning, offering a robust and scalable solution for both real-time and long-form video understanding tasks.
Cats match voice and face: cross-modal representation of humans in cats (Felis catus)
We examined whether cats have a cross-modal representation of humans, using a cross-modal expectancy violation paradigm originally used with dogs by Adachi et al. (Anim Cogn 10:17–21, 2007). We compared cats living in houses and in cat cafés to assess the potential effect of postnatal experience. Cats were presented with the face of either their owner or a stranger on a laptop monitor after playing back the voice of one of two people calling the subject’s name. In half of the trials the voice and face were of the same person (congruent condition) whereas in the other half of trials the stimuli did not match (incongruent condition). The café cats paid attention to the monitor longer in incongruent than congruent conditions, showing an expectancy violation. By contrast, house cats showed no similar tendency. These results show that at least café cats can predict their owner’s face upon hearing the owner’s voice, suggesting possession of cross-modal representation of at least one human. There may be a minimal kind or amount of postnatal experiences that lead to formation of a cross-modal representation of a specific person.