Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
30
result(s) for
"Xu, Boshen"
Sort by:
Sintering protonic zirconate cells with enhanced electrolysis stability and Faradaic efficiency
by
Dong, Pei
,
Chen, Dongchang
,
Koomson, Samuel
in
639/301/299/893
,
639/4077/893
,
Chemical synthesis
2025
The emerging applications of steam electrolysis and electrochemical synthesis at 300–600 °C set stringent requirements on the stability of protonic ceramic cells, which cannot be met by Ce-rich electrolytes. A promising candidate is Ce-free BaZr
0.8
Y
0.2
O
3−
δ
, but its usage has long been hindered due to the high sintering temperatures required for protonic devices. Here we resolved the issue through a co-sintering process, in which the shrinkage stress of a readily sinterable support layer helps to densify the pure BaZr
0.8
Y
0.2
O
3
–δ
electrolyte membrane at low temperatures. This approach eliminates Ce and harmful sintering aids in the dense zirconate electrolyte membrane, thereby enhancing the Faradaic efficiency and electrochemical stability, especially under harsh operating conditions. The protonic zirconate cells have exceptional performance and demonstrate stable high-steam pressure electrolysis up to 0.7 atm steam pressure, −2 A cm
−2
current density and over 800 h of dynamic operation at 600 °C. Our processing breakthrough enables stabilized protonic cells for demanding applications in future energy infrastructure.
Emerging applications of steam electrolysis and electrochemical synthesis for future hydrogen technologies at intermediate temperatures set stringent requirements on the stability of protonic ceramic cells. Now a sintering approach enables densified Ce-free protonic zirconate cells with enhanced Faradaic efficiency and exceptional stability under harsh operating conditions.
Journal Article
AIGC Enabling Non-Genetic Design Methods and Practices
2024
Artificial Intelligence Generated Content (AIGC) technology aligns seamlessly with the design requirements of non-genetic heritage, offering a viable pathway for its modernization. This paper delineates the specific design needs of non-genetic heritage and utilizes a diffusion model to create themed images and animations related to this heritage. Additionally, AIGC is employed to enhance the creation of virtual reality interactive imagery. The Long Short-Term Memory (LSTM) network is deployed to classify time-series gesture data, facilitating the training and categorization of Chinese Sign Language (CSL) gestures for virtual interactive engagement with non-heritage themes. We have integrated the AIGC operation process into the theme of non-genetic inheritance, thereby constructing a robust development trajectory for AIGC-enhanced non-genetic heritage. The experimental setup is crafted to ascertain the optimal number of iterations and training durations through the control variable method. We evaluate the efficacy of the diffusion model for anti-implicit writing analysis and the performance of the speech recognition, text dialogue, and text response modules within the non-heritage multimodal interactive framework using Word Error Rate (WER) and Mean Opinion Score (MOS). A descriptive analysis of users’ interactive experiences with non-heritage content is also conducted. The results indicate that the speech recognition module achieved a WER of 0.365, while the text response module garnered an MOS of 4.49 with a standard deviation of 0.56. This multimodal, non-heritage virtual interaction, leveraging multiple modalities, enriches users’ experiences and deepens their understanding and appreciation of non-heritage content. Consequently, this enhances the high-quality development of non-genetic heritage.
Journal Article
POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World
2024
We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture fine-grained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple third-person views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization. Our code is available at https://github.com/xuboshen/pov_acmmm2023.
SPAFormer: Sequential 3D Part Assembly with Transformers
2025
We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task. This task requires accurate prediction of each part's poses in sequential steps. As the number of parts increases, the possible assembly combinations increase exponentially, leading to a combinatorial explosion that severely hinders the efficacy of 3D-PA. SPAFormer addresses this problem by leveraging weak constraints from assembly sequences, effectively reducing the solution space's complexity. Since the sequence of parts conveys construction rules similar to sentences structured through words, our model explores both parallel and autoregressive generation. We further strengthen SPAFormer through knowledge enhancement strategies that utilize the attributes of parts and their sequence information, enabling it to capture the inherent assembly pattern and relationships among sequentially ordered parts. We also construct a more challenging benchmark named PartNet-Assembly covering 21 varied categories to more comprehensively validate the effectiveness of SPAFormer. Extensive experiments demonstrate the superior generalization capabilities of SPAFormer, particularly with multi-tasking and in scenarios requiring long-horizon assembly. Code is available at https://github.com/xuboshen/SPAFormer.
SPAFormer: Sequential 3D Part Assembly with Transformers
2024
We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task. This task requires accurate prediction of each part's pose and shape in sequential steps, and as the number of parts increases, the possible assembly combinations increase exponentially, leading to a combinatorial explosion that severely hinders the efficacy of 3D-PA. SPAFormer addresses this problem by leveraging weak constraints from assembly sequences, effectively reducing the solution space's complexity. Since assembly part sequences convey construction rules similar to sentences being structured through words, our model explores both parallel and autoregressive generation. It further enhances assembly through knowledge enhancement strategies that utilize the attributes of parts and their sequence information, enabling it to capture the inherent assembly pattern and relationships among sequentially ordered parts. We also construct a more challenging benchmark named PartNet-Assembly covering 21 varied categories to more comprehensively validate the effectiveness of SPAFormer. Extensive experiments demonstrate the superior generalization capabilities of SPAFormer, particularly with multi-tasking and in scenarios requiring long-horizon assembly. Codes and model weights will be released at https://github.com/xuboshen/SPAFormer.
EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
2025
Egocentric video-language pretraining has significantly advanced video representation learning. Humans perceive and interact with a fully 3D world, developing spatial awareness that extends beyond text-based understanding. However, most previous works learn from 1D text or 2D visual cues, such as bounding boxes, which inherently lack 3D understanding. To bridge this gap, we introduce EgoDTM, an Egocentric Depth- and Text-aware Model, jointly trained through large-scale 3D-aware video pretraining and video-text contrastive learning. EgoDTM incorporates a lightweight 3D-aware decoder to efficiently learn 3D-awareness from pseudo depth maps generated by depth estimation models. To further facilitate 3D-aware video pretraining, we enrich the original brief captions with hand-object visual cues by organically combining several foundation models. Extensive experiments demonstrate EgoDTM's superior performance across diverse downstream tasks, highlighting its superior 3D-aware visual understanding. Code: https://github.com/xuboshen/EgoDTM.
Unveiling Visual Biases in Audio-Visual Localization Benchmarks
2024
Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding
2025
We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.
Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation
2026
Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
2026
Self-reflection mechanisms that rely on purely text-based rethinking processes perform well in most multimodal tasks. However, when directly applied to long-form video understanding scenarios, they exhibit clear limitations. The fundamental reasons for this lie in two points: (1)long-form video understanding involves richer and more dynamic visual input, meaning rethinking only the text information is insufficient and necessitates a further rethinking process specifically targeting visual information; (2) purely text-based reflection mechanisms lack cross-modal interaction capabilities, preventing them from fully integrating visual information during reflection. Motivated by these insights, we propose REVISOR (REflective VIsual Segment Oriented Reasoning), a novel framework for tool-augmented multimodal reflection. REVISOR enables MLLMs to collaboratively construct introspective reflection processes across textual and visual modalities, significantly enhancing their reasoning capability for long-form video understanding. To ensure that REVISOR can learn to accurately review video segments highly relevant to the question during reinforcement learning, we designed the Dual Attribution Decoupled Reward (DADR) mechanism. Integrated into the GRPO training strategy, this mechanism enforces causal alignment between the model's reasoning and the selected video evidence. Notably, the REVISOR framework significantly enhances long-form video understanding capability of MLLMs without requiring supplementary supervised fine-tuning or external models, achieving impressive results on four benchmarks including VideoMME, LongVideoBench, MLVU, and LVBench.