Catalogue Search | MBRL

The Subword‐Character Multi‐Scale Transformer With Learnable Positional Encoding for Machine Translation

by Zhou, Wei , Yao, Wenjing in fine‐grained features , learnable positional encoding , machine translation

2025

The transformer model addresses the efficiency bottleneck caused by sequential computation in traditional recurrent neural networks (RNN) by leveraging the self‐attention mechanism to parallelize the capture of global dependencies. The subword‐level modeling units and fixed‐pattern positional encoding adopted by mainstream methods struggle to adequately capture fine‐grained feature information in morphologically rich languages, limiting the model's flexible learning of target‐side word order patterns. To address these challenges, this study innovatively constructs a subword‐character multi‐scale transformer architecture integrated with a learnable positional encoding mechanism. The model abandons traditional fixed‐pattern positional encodings, enabling autonomous optimization of the positional representation space for source and target languages through end‐to‐end training mechanisms, significantly enhancing dynamic adaptability in cross‐linguistic positional mapping. While preserving the global semantic modeling advantages of subword units, the framework introduces a lightweight‐designed character‐level branch to supplement fine‐grained features. For the fusion of subword and character branches, we employ context‐aware cross‐attention to enable dynamic integration of linguistic information at different granularities. Our model achieves notable improvements in BLEU scores on the WMT'14 English‐German (En‐De), WMT'17 Chinese‐English (Zh‐En), and WMT'16 English‐Romanian (En‐Ro) benchmark tasks. These results demonstrate the synergistic effects of fine‐grained multi‐scale modeling and learnable positional encoding in enhancing translation quality and linguistic adaptability. The method achieves significant performance breakthroughs in machine translation through deep integration of linguistic features at different granularities.

Journal Article

Share this book

Add to My Shelf

Scene Uyghur Text Detection Based on Fine-Grained Feature Representation

by Aysa, Alimjan , Wang, Yiwen , Xu, Xuebin in adaptive spatial feature fusion , Algorithms , Boxes

2022

Scene text detection task aims to precisely localize text in natural environments. At present, the application scenarios of text detection topics have gradually shifted from plain document text to more complex natural scenarios. Objects with similar texture and text morphology in the complex background noise of natural scene images are prone to false recall and difficult to detect multi-scale texts, a multi-directional scene Uyghur text detection model based on fine-grained feature representation and spatial feature fusion is proposed, and feature extraction and feature fusion are improved to enhance the network’s ability to represent multi-scale features. In this method, the multiple groups of 3 × 3 convolutional feature groups that are connected like the hierarchical residual to build a residual network for feature extraction, which captures the feature details and increases the receptive field of the network to adapt to multi-scale text and long glued dimensional font detection and suppress false positives of text-like objects. Secondly, an adaptive multi-level feature map fusion strategy is adopted to overcome the inconsistency of information in multi-scale feature map fusion. The proposed model achieves 93.94% and 84.92% F-measure on the self-built Uyghur dataset and the ICDAR2015 dataset, respectively, which improves the accuracy of Uyghur text detection and suppresses false positives.

Journal Article

Share this book

Add to My Shelf

One-Shot Multiple Object Tracking in UAV Videos Using Task-Specific Fine-Grained Features

by He, Zhiwei , Zhu, Ziming , Wu, Han in Benchmarks , Coders , Computer applications

2022

Multiple object tracking (MOT) in unmanned aerial vehicle (UAV) videos is a fundamental task and can be applied in many fields. MOT consists of two critical procedures, i.e., object detection and re-identification (ReID). One-shot MOT, which incorporates detection and ReID in a unified network, has gained attention due to its fast inference speed. It significantly reduces the computational overhead by making two subtasks share features. However, most existing one-shot trackers struggle to achieve robust tracking in UAV videos. We observe that the essential difference between detection and ReID leads to an optimization contradiction within one-shot networks. To alleviate this contradiction, we propose a novel feature decoupling network (FDN) to convert shared features into detection-specific and ReID-specific representations. The FDN searches for characteristics and commonalities between the two tasks to synergize detection and ReID. In addition, existing one-shot trackers struggle to locate small targets in UAV videos. Therefore, we design a pyramid transformer encoder (PTE) to enrich the semantic information of the resulting detection-specific representations. By learning scale-aware fine-grained features, the PTE empowers our tracker to locate targets in UAV videos accurately. Extensive experiments on VisDrone2021 and UAVDT benchmarks demonstrate that our tracker achieves state-of-the-art tracking performance.

Journal Article

Share this book

Add to My Shelf

CAFE-YOLO: an object detection algorithm from UAV perspective fusing channel attention and fine-grained feature enhancement

by Luo, Qianchun , Zhu, Lei , Sun, Jinghua in 639/166 , 639/705 , Accuracy

2025

In aerial imagery captured by drones, object detection tasks often face challenges such as a high proportion of small objects, complex background interference, and insufficient lighting conditions, all of which substantially affect feature representation and detection accuracy. To address these challenges, a novel object detection algorithm named channel attention and fine-grained enhancement YOLO (CAFE-YOLO) is proposed. This algorithm incorporates a channel attention mechanism into the backbone network to enhance the focus on critical features while suppressing redundant information. Furthermore, a fine-grained feature enhancement module is introduced to extract local detail features, improving the perception of small and occluded objects. In the detection head, a lightweight attention-guided feature fusion strategy is designed to further optimize object localization and classification performance. Experimental results on the VisDrone2019 dataset show that the proposed method achieves significantly better detection performance than most existing advanced algorithms in complex drone-captured imaging scenarios. While maintaining a lightweight architecture, it reaches a mean average precision at IoU threshold 0.5 of 44.6%, demonstrating substantial improvements in both overall detection accuracy and robustness.

Journal Article

Share this book

Add to My Shelf

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

by Li, Ze , Song, Bin , Chi, Yuhao in Accuracy , Algorithms , Artificial intelligence

2024

In the field of remote sensing image captioning (RSIC), mainstream methods typically adopt an encoder–decoder framework. Methods based on this framework often use only simple feature fusion strategies, failing to fully mine the fine-grained features of the remote sensing image. Moreover, the lack of context information introduction in the decoder results in less accurate generated sentences. To address these problems, we propose a two-stage feature enhancement model (TSFE) for remote sensing image captioning. In the first stage, we adopt an adaptive feature fusion strategy to acquire multi-scale features. In the second stage, we further mine fine-grained features based on multi-scale features by establishing associations between different regions of the image. In addition, we introduce global features with scene information in the decoder to help generate descriptions. Experimental results on the RSICD, UCM-Captions, and Sydney-Captions datasets demonstrate that the proposed method outperforms existing state-of-the-art approaches.

Journal Article

Share this book

Add to My Shelf

Research on Emotion Recognition Method Based on Adaptive Window and Fine-Grained Features in MOOC Learning

by Li, Ze , Bao, Jindi , Tao, Xiaomei in Accuracy , adaptive window , audio and visual features

2022

In MOOC learning, learners’ emotions have an important impact on the learning effect. In order to solve the problem that learners’ emotions are not obvious in the learning process, we propose a method to identify learner emotion by combining eye movement features and scene features. This method uses an adaptive window to partition samples and enhances sample features through fine-grained feature extraction. Using an adaptive window to partition samples can make the eye movement information in the sample more abundant, and fine-grained feature extraction from an adaptive window can increase discrimination between samples. After adopting the method proposed in this paper, the four-category emotion recognition accuracy of the single modality of eye movement reached 65.1% in MOOC learning scenarios. Both the adaptive window partition method and the fine-grained feature extraction method based on eye movement signals proposed in this paper can be applied to other modalities.

Journal Article

Share this book

Add to My Shelf

A Student Facial Expression Recognition Model Based on Multi-Scale and Deep Fine-Grained Feature Attention Enhancement

by Shou, Zhaoyu , Li, Dongxu , Zhang, Huibing in Accuracy , Algorithms , Attention - physiology

2024

In smart classroom environments, accurately recognizing students’ facial expressions is crucial for teachers to efficiently assess students’ learning states, timely adjust teaching strategies, and enhance teaching quality and effectiveness. In this paper, we propose a student facial expression recognition model based on multi-scale and deep fine-grained feature attention enhancement (SFER-MDFAE) to address the issues of inaccurate facial feature extraction and poor robustness of facial expression recognition in smart classroom scenarios. Firstly, we construct a novel multi-scale dual-pooling feature aggregation module to capture and fuse facial information at different scales, thereby obtaining a comprehensive representation of key facial features; secondly, we design a key region-oriented attention mechanism to focus more on the nuances of facial expressions, further enhancing the representation of multi-scale deep fine-grained feature; finally, the fusion of multi-scale and deep fine-grained attention-enhanced features is used to obtain richer and more accurate facial key information and realize accurate facial expression recognition. The experimental results demonstrate that the proposed SFER-MDFAE outperforms the existing state-of-the-art methods, achieving an accuracy of 76.18% on FER2013, 92.75% on FERPlus, 92.93% on RAF-DB, 67.86% on AffectNet, and 93.74% on the real smart classroom facial expression dataset (SCFED). These results validate the effectiveness of the proposed method.

Journal Article

Share this book

Add to My Shelf

GLF-Net: A Semantic Segmentation Model Fusing Global and Local Features for High-Resolution Remote Sensing Images

by Song, Wanying , Zhang, Peng , Zhou, Xinwei in Accuracy , Adaptation , Algorithms

2023

Semantic segmentation of high-resolution remote sensing images holds paramount importance in the field of remote sensing. To better excavate and fully fuse the features in high-resolution remote sensing images, this paper introduces a novel Global and Local Feature Fusion Network, abbreviated as GLF-Net, by incorporating the extensive contextual information and refined fine-grained features. The proposed GLF-Net, devised as an encoder–decoder network, employs the powerful ResNet50 as its baseline model. It incorporates two pivotal components within the encoder phase: a Covariance Attention Module (CAM) and a Local Fine-Grained Extraction Module (LFM). And an additional wavelet self-attention module (WST) is integrated into the decoder stage. The CAM effectively extracts the features of different scales from various stages of the ResNet and then encodes them with graph convolutions. In this way, the proposed GLF-Net model can well capture the global contextual information with both universality and consistency. Additionally, the local feature extraction module refines the feature map by encoding the semantic and spatial information, thereby capturing the local fine-grained features in images. Furthermore, the WST maximizes the synergy between the high-frequency and the low-frequency information, facilitating the fusion of global and local features for better performance in semantic segmentation. The effectiveness of the proposed GLF-Net model is validated through experiments conducted on the ISPRS Potsdam and Vaihingen datasets. The results verify that it can greatly improve segmentation accuracy.

Journal Article

Share this book

Add to My Shelf

FFPNet: Fine-Grained Feature Perception Network for Semantic Change Detection on Bi-Temporal Remote Sensing Images

by Xia, Kai , Feng, Hailin , Zhang, Fengwei in Accuracy , Change detection , channel–spatial inter-correlation

2024

Semantic change detection (SCD) is a newly important topic in the field of remote sensing (RS) image interpretation since it provides semantic comprehension for bi-temporal RS images via predicting change regions and change types and has great significance for urban planning and ecological monitoring. With the availability of large scale bi-temporal RS datasets, various models based on deep learning (DL) have been widely applied in SCD. Since convolution operators in DL extracts two-dimensional feature matrices in the spatial dimension of images and stack feature matrices in the dimension termed the channel, feature maps of images are tri-dimensional. However, recent SCD models usually overlook the stereoscopic property of feature maps. Firstly, recent SCD models are usually limited in capturing spatial global features in the process of bi-temporal global feature extraction and overlook the global channel features. Meanwhile, recent SCD models only focus on spatial cross-temporal interaction in the process of change feature perception and ignore the channel interaction. Thus, to address above two challenges, a novel fine-grained feature perception network (FFPNet) is proposed in this paper, which employs the Omni Transformer (OiT) module to capture bi-temporal channel–spatial global features before utilizing the Omni Cross-Perception (OCP) module to achieve channel–spatial interaction between cross-temporal features. According to the experiments on the SECOND dataset and the LandsatSCD dataset, our FFPNet reaches competitive performance on both countryside and urban scenes compared with recent typical SCD models.

Journal Article

Share this book

Add to My Shelf

Dynamic Weighting Network for Person Re-Identification

by Liu, Peng , Li, Guang , Liu, Chunguang in Analysis , Classification , Computational linguistics

2023

Recently, hybrid Convolution-Transformer architectures have become popular due to their ability to capture both local and global image features and the advantage of lower computational cost over pure Transformer models. However, directly embedding a Transformer can result in the loss of convolution-based features, particularly fine-grained features. Therefore, using these architectures as the backbone of a re-identification task is not an effective approach. To address this challenge, we propose a feature fusion gate unit that dynamically adjusts the ratio of local and global features. The feature fusion gate unit fuses the convolution and self-attentive branches of the network with dynamic parameters based on the input information. This unit can be integrated into different layers or multiple residual blocks, which will have varying effects on the accuracy of the model. Using feature fusion gate units, we propose a simple and portable model called the dynamic weighting network or DWNet, which supports two backbones, ResNet and OSNet, called DWNet-R and DWNet-O, respectively. DWNet significantly improves re-identification performance over the original baseline, while maintaining reasonable computational consumption and number of parameters. Finally, our DWNet-R achieves an mAP of 87.53%, 79.18%, 50.03%, on the Market1501, DukeMTMC-reID, and MSMT17 datasets. Our DWNet-O achieves an mAP of 86.83%, 78.68%, 55.66%, on the Market1501, DukeMTMC-reID, and MSMT17 datasets.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter