Catalogue Search | MBRL

Semantic segmentation feature fusion network based on transformer

by Li, Tianping , Zhang, Hua , Cui, Zhaotong in 639/166 , 639/766/259 , Attention

2025

Convolutional neural networks have demonstrated efficacy in acquiring local features and spatial details; however, they struggle to obtain global information, which could potentially compromise the segmentation of important regions of an image. Transformer can increase the expressiveness of pixels by establishing global relationships between them. Moreover, some transformer-based self-attentive methods do not combine the advantages of convolution, which makes the model require more computational parameters. This work uses both Transformer and CNN structures to improve the relationship between image-level regions and global information to improve segmentation accuracy and performance in order to address these two issues and improve the semantic segmentation results at the same time. We first build a Feature Alignment Module (FAM) module to enhance spatial details and improve channel representations. Second, we compute the link between similar pixels using a Transformer structure, which enhances the pixel representation. Finally, we design a Pyramid Convolutional Pooling Module (PCPM) that both compresses and enriches the feature maps, as well as determines the global correlations among the pixels, to reduce the computational burden on the transformer. These three elements come together to form a transformer-based semantic segmentation feature fusion network (FFTNet). Our method yields 82.5% mIoU, according to experimental results based on the Cityscapes test dataset. Furthermore, we conducted various visualization tests using the Pascal VOC 2012 and Cityscapes datasets. The results show that our approach outperforms alternative approaches.

Journal Article

Share this book

Add to My Shelf

Multi-attention fusion transformer for single-image super-resolution

by Cui, Zhaotong , Li, Tianping , Han, Yu in 639/705/117 , 639/705/258 , 639/705/794

2024

Recently, Transformer-based methods have gained prominence in image super-resolution (SR) tasks, addressing the challenge of long-range dependence through the incorporation of cross-layer connectivity and local attention mechanisms. However, the analysis of these networks using local attribution maps has revealed significant limitations in leveraging the spatial extent of input information. To unlock the inherent potential of Transformer in image SR, we propose the Multi-Attention Fusion Transformer (MAFT), a novel model designed to integrate multiple attention mechanisms with the objective of expanding the number and range of pixels activated during image reconstruction. This integration enhances the effective utilization of input information space. At the core of our model lies the Multi-attention Adaptive Integration Groups, which facilitate the transition from dense local attention to sparse global attention through the introduction of Local Attention Aggregation and Global Attention Aggregation blocks with alternating connections, effectively broadening the network's receptive field. The effectiveness of our proposed algorithm has been validated through comprehensive quantitative and qualitative evaluation experiments conducted on benchmark datasets. Compared to state-of-the-art methods (e.g. HAT), the proposed MAFT achieves 0.09 dB gains on Urban100 dataset for × 4 SR task while containing 32.55% and 38.01% fewer parameters and FLOPs, respectively.

Journal Article

Share this book

Add to My Shelf

Combining transformer global and local feature extraction for object detection

by Cui, Zhaotong , Li, Tianping , Wei, Dongmei in Accuracy , Algorithms , Anchor-free

2024

Convolutional neural network (CNN)-based object detectors perform excellently but lack global feature extraction and cannot establish global dependencies between object pixels. Although the Transformer is able to compensate for this, it does not incorporate the advantages of convolution, which results in insufficient information being obtained about the details of local features, as well as slow speed and large computational parameters. In addition, Feature Pyramid Network (FPN) lacks information interaction across layers, which can reduce the acquisition of feature context information. To solve the above problems, this paper proposes a CNN-based anchor-free object detector that combines transformer global and local feature extraction (GLFT) to enhance the extraction of semantic information from images. First, the segmented channel extraction feature attention (SCEFA) module was designed to improve the extraction of local multiscale channel features from the model and enhance the discrimination of pixels in the object region. Second, the aggregated feature hybrid transformer (AFHTrans) module combined with convolution is designed to enhance the extraction of global and local feature information from the model and to establish the dependency of the pixels of distant objects. This approach compensates for the shortcomings of the FPN by means of multilayer information aggregation transmission. Compared with a transformer, these methods have obvious advantages. Finally, the feature extraction head (FE-Head) was designed to extract full-text information based on the features of different tasks. An accuracy of 47.0% and 82.76% was achieved on the COCO2017 and PASCAL VOC2007 + 2012 datasets, respectively, and the experimental results validate the effectiveness of our method.

Journal Article

Share this book

Add to My Shelf

Research on a Face Real-time Tracking Algorithm Based on Particle Filter Multi-Feature Fusion

by Wang, Wen , Liu, Hui , Wang, Tao in Algorithms , Economic models , features fusion

2019

With the revolutionary development of cloud computing and internet of things, the integration and utilization of “big data” resources is a hot topic of the artificial intelligence research. Face recognition technology information has the advantages of being non-replicable, non-stealing, simple and intuitive. Video face tracking in the context of big data has become an important research hotspot in the field of information security. In this paper, a multi-feature fusion adaptive adjustment target tracking window and an adaptive update template particle filter tracking framework algorithm are proposed. Firstly, the skin color and edge features of the face are extracted in the video sequence. The weighted color histogram are extracted which describes the face features. Then we use the integral histogram method to simplify the histogram calculation of the particles. Finally, according to the change of the average distance, the tracking window is adjusted to accurately track the tracking object. At the same time, the algorithm can adaptively update the tracking template which improves the accuracy and accuracy of the tracking. The experimental results show that the proposed method improves the tracking effect and has strong robustness in complex backgrounds such as skin color, illumination changes and face occlusion.

Journal Article

Share this book

Add to My Shelf

Enhanced multi-scale networks for semantic segmentation

by Cui, Zhaotong , Li, Tianping , Han, Yu in Accuracy , Complexity , Computational Intelligence

2024

Multi-scale representation provides an effective answer to the scale variation of objects and entities in semantic segmentation. The ability to capture long-range pixel dependency facilitates semantic segmentation. In addition, semantic segmentation necessitates the effective use of pixel-to-pixel similarity in the channel direction to enhance pixel areas. By reviewing the characteristics of earlier successful segmentation models, we discover a number of crucial elements that enhance segmentation model performance, including a robust encoder structure, multi-scale interactions, attention mechanisms, and a robust decoder structure. The attention mechanism of the asymmetric non-local neural network (ANNet) is merged with multi-scale pyramidal modules to accelerate model segmentation while maintaining high accuracy. However, ANNet does not account for the similarity between pixels in the feature map channel direction, making the segmentation accuracy unsatisfactory. As a result, we propose EMSNet, a straightforward convolutional network architecture for semantic segmentation that consists of Integration of enhanced regional module (IERM) and Multi-scale convolution module (MSCM). The IERM module generates weights using four or five-stage feature maps, then fuses the input features with the weights and uses more computation. The similarity of the channel direction feature graphs is also calculated using ANNet’s auxiliary loss function. The MSCM module can more accurately describe the interactions between various channels, capture the interdependencies between feature pixels, and capture the multi-scale context. Experiments prove that we perform well in tests using the benchmark dataset. On Cityscapes test data, we get 82.2% segmentation accuracy. The mIoU in the ADE20k and Pascal VOC datasets are, respectively, 45.58% and 85.46%.

Journal Article

Share this book

Add to My Shelf

Nested attention network based on category contexts learning for semantic segmentation

by Liu, Meilin , Li, Tianping , Wei, Dongmei in Attention mechanism , Complexity , Computational Intelligence

2024

The attention mechanism is widely used in the field of semantic segmentation, due to the fact that it can be used to obtain effective long-distance dependencies by assigning different weights to objects according to different tasks. We propose a novel Nested Attention Network (NANet) for semantic segmentation, which combines Feature Category Attention (FCA) and Channel Relationship Attention (CRA) to effectively aggregate same-category contexts in both spatial and channel dimensions. Specifically, FCA captures the dependencies between spatial pixel features and categories to achieve the aggregation of features of the same category. CRA further captures the channel relationships on the output of FCA to obtain richer contexts. Numerous experiments have shown that NANet has a lower number of parameters and computational complexity than other state-of-the-art methods, and is a lightweight model with a lower total number of floating-point operations. We evaluated the performance of NANet on three datasets: Cityscapes, PASCAL VOC 2012, and ADE20K, and the experimental results show that NANet obtains promising results, reaching a performance of 82.6% on the Cityscapes test set.

Journal Article

Share this book

Add to My Shelf

Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentation

by Maoxia, Zhou , Yang, Xiaolong , Cui, Zhaotong in Coders , Complexity , Computational Intelligence

2025

Recently, a number of vision transformer models for semantic segmentation have been proposed, with the majority of these achieving impressive results. However, they lack the ability to exploit the intrinsic position and channel features of the image and are less capable of multi-scale feature fusion. This paper presents a semantic segmentation method that successfully combines attention and multiscale representation, thereby enhancing performance and efficiency. This represents a significant advancement in the field. Multi-layers semantic extraction and multi-scale aggregation transformer decoder (MEMAFormer) is proposed, which consists of two components: mix-layers dual channel semantic extraction module (MDCE) and semantic aggregation pyramid pooling module (SAPPM). The MDCE incorporates a multi-layers cross attention module (MCAM) and an efficient channel attention module (ECAM). In MCAM, horizontal connections between encoder and decoder stages are employed as feature queries for the attention module. The hierarchical feature maps derived from different encoder and decoder stages are integrated into key and value. To address long-term dependencies, ECAM selectively emphasizes interdependent channel feature maps by integrating relevant features across all channels. The adaptability of the feature maps is reduced by pyramid pooling, which reduces the amount of computation without compromising performance. SAPPM is comprised of several distinct pooled kernels that extract context with a deeper flow of information, forming a multi-scale feature by integrating various feature sizes. The MEMAFormer-B0 model demonstrates superior performance compared to SegFormer-B0, exhibiting gains of 4.8%, 4.0% and 3.5% on the ADE20K, Cityscapes and COCO-stuff datasets, respectively.

Journal Article

Share this book

Add to My Shelf

Target Tracking Based on Camshift Algorithm and Multi-feature Fusion

by Chaoqian, Gao , Hu, Chen , Tianping, Li in Algorithms , Color , Feature extraction

2020

This paper proposes a new algorithm on the basics of Camshift algorithm and multi-feature fusion. First, the target area is weighted by the Gaussian function. Then LBP is used to extract local texture features, next fuse the texture features with the color features to obtain the color and texture histogram. Finally, for the purpose of tracking the target accurately and preventing occlusion, the Kalman filter is used to forecast the position of the moving target. From the comparison of the experimental results, the conclusion can be drawn that the new algorithm can effectively overcome the interference of objects and track the target accurately in a more complex environment.

Journal Article

Share this book

Add to My Shelf

Adaptive dictionary learning based on local configuration pattern for face recognition

by Li, Shuwei , Chen, Tao , Li, Tianping in Accuracy , Adaptive algorithms , Classification

2020

Sparse representation based on classification and collaborative representation based classification with regularized least square has been successfully used in face recognition. The over-completed dictionary is crucial for the approaches based on sparse representation or collaborative representation because it directly determines recognition accuracy and recognition time. In this paper, we proposed an algorithm of adaptive dictionary learning according to the inputting testing image. First, nearest neighbors of the testing image are labeled in local configuration pattern (LCP) subspace employing statistical similarity and configuration similarity defined in this paper. Then the face images labeled as nearest neighbors are used as atoms to build the adaptive representation dictionary, which means all atoms of this dictionary are nearest neighbors and they are more similar to the testing image in structure. Finally, the testing image is collaboratively represented and classified class by class with this proposed adaptive over-completed compact dictionary. Nearest neighbors are labeled by local binary pattern and microscopic feature in the very low dimension LCP subspace, so the labeling is very fast. The number of nearest neighbors is changeable for the different testing samples and is much less than that of all training samples generally, which significantly reduces the computational cost. In addition, atoms of this proposed dictionary are these high dimension face image vectors but not lower dimension LCP feature vectors, which ensures not only that the information included in face image is not lost but also that the atoms are more similar to the testing image in structure, which greatly increases the recognition accuracy. We also use the Fisher ratio to assess the robustness of this proposed dictionary. The extensive experiments on representative face databases with variations of lighting, expression, pose, and occlusion demonstrate that the proposed approach is superior both in recognition time and in accuracy.

Journal Article

Share this book

Add to My Shelf

Mutually reinforcing non-local neural networks for semantic segmentation

by Zhang, Hua , Cui, Zhaotong , Li, Tianping in Asymmetry , CERM , Complexity

2023

The ability to capture pixels' long-distance interdependence is beneficial to semantic segmentation. In addition, semantic segmentation requires the effective use of pixel-to-pixel similarity in the channel direction to enhance pixel regions. Asymmetric Non-local Neural Networks (ANNet) combine multi-scale spatial pyramidal pooling modules and Non-local blocks to reduce model parameters without sacrificing performance. However, ANNet does not consider pixel similarity in the channel direction in the feature map, so its segmentation effect is not ideal. This article proposes a Mutually Reinforcing Non-local Neural Networks (MRNNet) to improve ANNet. MRNNet consists specifically of the channel enhancement regions module (CERM), and the position-enhanced pixels module (PEPM). In contrast to Asymmetric Fusion Non-local Block (AFNB) in ANNet, CERM does not combine the feature maps of the high and low stages, but rather utilizes the auxiliary loss function of ANNet. Calculating the similarity between feature maps in channel direction improves the category representation of feature maps in the channel aspect and reduces matrix multiplication computation. PEPM enhances pixels in the spatial direction of the feature map by calculating the similarity between pixels in the spatial direction of the feature map. Experiments reveal that our segmentation accuracy for cityscapes test data reaches 81.9%. Compared to ANNet, the model's parameters are reduced by 11.35 (M). Given ten different pictures with a size of 2048 × 1024, the average reasoning time of MRNNet is 0.103(s) faster than that of the ANNet model.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter