Catalogue Search | MBRL

A survey of the vision transformers and their CNN-transformer based variants

in Artificial neural networks , Attention , Classification

2023

Vision transformers have become popular as a possible substitute to convolutional neural networks (CNNs) for a variety of computer vision applications. These transformers, with their ability to focus on global relationships in images, offer large learning capacity. However, they may suffer from limited generalization as they do not tend to model local correlation in images. Recently, in vision transformers hybridization of both the convolution operation and self-attention mechanism has emerged, to exploit both the local and global image representations. These hybrid vision transformers, also referred to as CNN-Transformer architectures, have demonstrated remarkable results in vision applications. Given the rapidly growing number of hybrid vision transformers, it has become necessary to provide a taxonomy and explanation of these hybrid architectures. This survey presents a taxonomy of the recent vision transformer architectures and more specifically that of the hybrid vision transformers. Additionally, the key features of these architectures such as the attention mechanisms, positional embeddings, multi-scale processing, and convolution are also discussed. In contrast to the previous survey papers that are primarily focused on individual vision transformer architectures or CNNs, this survey uniquely emphasizes the emerging trend of hybrid vision transformers. By showcasing the potential of hybrid vision transformers to deliver exceptional performance across a range of computer vision tasks, this survey sheds light on the future directions of this rapidly evolving architecture.

Journal Article

Share this book

Add to My Shelf

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

by Zhang, Jing , Tao, Dacheng , Xu, Yufei in Accuracy , Bias , Classification

2023

Vision transformers have shown great potential in various computer vision tasks owing to their strong capability to model long-range dependency using the self-attention mechanism. Nevertheless, they treat an image as a 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance, which is instead learned implicitly from large-scale training data with longer training schedules. In this paper, we leverage the two IBs and propose the ViTAE transformer, which utilizes a reduction cell for multi-scale feature and a normal cell for locality. The two kinds of cells are stacked in both isotropic and multi-stage manners to formulate two families of ViTAE models, i.e., the vanilla ViTAE and ViTAEv2. Experiments on the ImageNet dataset as well as downstream tasks on the MS COCO, ADE20K, and AP10K datasets validate the superiority of our models over the baseline and representative models. Besides, we scale up our ViTAE model to 644 M parameters and obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 classification accuracy on ImageNet Real validation set, without using extra private data. It demonstrates that the introduced inductive bias still helps when the model size becomes large. The source code and pretrained models are publicly available atcode.

Journal Article

Share this book

Add to My Shelf

SegViT v2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers

by Tian, Zhi , Shen, Chunhua , Liu, Yifan in Coders , Computational efficiency , Computing costs

2024

This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder–decoder framework and introduce SegViTv2. In this study, we introduce a novel Attention-to-Mask (ATM) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms popular decoder UPerNet using various ViT backbones while consuming only about 5% of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a Shrunk++ structure that incorporates edge-aware query-based down-sampling (EQD) and query-based up-sampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to 50% while maintaining competitive performance. Furthermore, we propose to adapt SegViT for continual semantic segmentation, demonstrating nearly zero forgetting of previously learned knowledge. Experiments show that our proposed SegViTv2 surpasses recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: https://github.com/zbwxp/SegVit.

Journal Article

Share this book

Add to My Shelf

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

by Tao, Dacheng , Zhang, Jiangning , Li, Xiangtai in Artificial neural networks , Biological evolution , Data integration

2024

Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer.

Journal Article

Share this book

Add to My Shelf

SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels

by Wang, Pichao , Shou, Mike Zheng , Luo, Hao in Channels , Experiments , Feature maps

2024

Pre-trained vision transformers have strong representations benefit to various downstream tasks. Recently many parameter-efficient fine-tuning (PEFT) methods have been proposed, and their experiments demonstrate that tuning only 1% extra parameters could surpass full fine-tuning in low-data resource scenarios. However, these methods overlook the task-specific information when fine-tuning diverse downstream tasks. In this paper, we propose a simple yet effective method called “Salient Channel Tuning\" (SCT) to leverage the task-specific information by forwarding the model with the task images to select partial channels in a feature map that enables us to tune only 1/8 channels leading to significantly lower parameter costs. Experiments outperform full fine-tuning on 18 out of 19 tasks in the VTAB-1K benchmark by adding only 0.11M parameters of the ViT-B, which is 780× fewer than its full fine-tuning counterpart. Furthermore, experiments on domain generalization and few-shot learning surpass other PEFT methods with lower parameter costs, demonstrating our proposed tuning technique’s strong capability and effectiveness in the low-data regime. The code will be available at https://github.com/zhaohengyuan1/SCT.git

Journal Article

Share this book

Add to My Shelf

ViT-SmartAgri: Vision Transformer and Smartphone-Based Plant Disease Detection for Smart Agriculture

by Barman, Utpal , Sharma, Vaishali , Rahman, Mirzanur in Accuracy , Agribusiness , Agriculture

2024

Invading pests and diseases always degrade the quality and quantity of plants. Early and accurate identification of plant diseases is critical for plant health and growth. This work proposes a smartphone-based solution using a Vision Transformer (ViT) model for identifying healthy plants and unhealthy plants with diseases. The collected dataset of tomato leaves was used to collectively train Vision Transformer and Inception V3-based deep learning (DL) models to differentiate healthy and diseased plants. These models detected 10 different tomato disease classes from the dataset containing 10,010 images. The performance of the two DL models was compared. This work also presents a smartphone-based application (Android App) using a ViT-based model, which works on the basis of the self-attention mechanism and yielded a better performance (90.99% testing) than Inception V3 in our experimentation. The proposed ViT-SmartAgri is promising and can be implemented on a colossal scale for smart agriculture, thus inspiring future work in this area.

Journal Article

Share this book

Add to My Shelf

Transformer architectures for computer vision: A comprehensive review and future research directions

by Uke, Nilesh , Ugile, Tukaram in abnormal event detection , contextual relationships , long-range dependencies

2025

Long-range dependencies and contextual relationships in videos were captured by using Convolutional Neural Networks (CNNs) in past. Recently the use of Transformers is started for capturing the long-range dependencies and contextual relationships in videos. Transformers have made revolutionary impacts in Natural Language Processing (NLP) area and started making significant contributions in Computer Vision problems. So, it was required to perform the review of different Transformer Architectures in Computer Vision which will help to use them for different applications in Computer Vision. This paper provides a comprehensive review of the Transformer Architectures in Computer Vision, providing a detailed view about their evolution from Vision Transformers (ViTs) to more advanced variants of transformers like Swin Transformer, Transformer-XL, and Hybrid CNN-Transformer models. We have tried to make the study of the advantages of the Transformers over the traditional Convolutional Neural Networks (CNNs), their applications for Object Detection, Image Classification, Video Analysis, and their computational challenges. Finally, we discuss the future research directions, including the self-attention mechanisms, multi-modal learning, and lightweight architectures for Edge Computing.

Journal Article

Share this book

Add to My Shelf

Deepfake detection using convolutional vision transformers and convolutional neural networks

by Slim, Salwa O. , Mohsen, Sohaila , Mostafa, Tarek in Accuracy , Artificial Intelligence , Artificial neural networks

2024

Deepfake technology has rapidly advanced in recent years, creating highly realistic fake videos that can be difficult to distinguish from real ones. The rise of social media platforms and online forums has exacerbated the challenges of detecting misinformation and malicious content. This study leverages many papers on artificial intelligence techniques to address deepfake detection. This research proposes a deep learning (DL)-based method for detecting deepfakes. The system comprises three components: preprocessing, detection, and prediction. Preprocessing includes frame extraction, face detection, alignment, and feature cropping. Convolutional neural networks (CNNs) are employed in the eye and nose feature detection phase. A CNN combined with a vision transformer is also used for face detection. The prediction component employs a majority voting approach, merging results from the three models applied to different features, leading to three individual predictions. The model is trained on various face images using FaceForensics++ and DFDC datasets. Multiple performance metrics, including accuracy, precision, F1, and recall, are used to assess the proposed model’s performance. The experimental results indicate the potential and strengths of the proposed CNN that achieved enhanced performance with an accuracy of 97%, while the CViT-based model achieved 85% using the FaceForences++ dataset and demonstrated significant improvements in deepfake detection compared to recent studies, affirming the potential of the suggested framework for detecting deepfakes on social media. This study contributes to a broader understanding of CNN-based DL methods for deepfake detection.

Journal Article

Share this book

Add to My Shelf

A Novel Fault Diagnosis Method of Rolling Bearing Based on Integrated Vision Transformer Model

by Xu, Zengbing , Tang, Xinyu , Wang, Zhigang in Accuracy , Bearings , Deep learning

2022

In order to improve the diagnosis accuracy and generalization of bearing faults, an integrated vision transformer (ViT) model based on wavelet transform and the soft voting method is proposed in this paper. Firstly, the discrete wavelet transform (DWT) was utilized to decompose the vibration signal into the subsignals in the different frequency bands, and then these different subsignals were transformed into a time–frequency representation (TFR) map by the continuous wavelet transform (CWT) method. Secondly, the TFR maps were input with respective to the multiple individual ViT models for preliminary diagnosis analysis. Finally, the final diagnosis decision was obtained by using the soft voting method to fuse all the preliminary diagnosis results. Through multifaceted diagnosis tests of rolling bearings on different datasets, the diagnosis results demonstrate that the proposed integrated ViT model based on the soft voting method can diagnose the different fault categories and fault severities of bearings accurately, and has a higher diagnostic accuracy and generalization ability by comparison analysis with integrated CNN and individual ViT.

Journal Article

Share this book

Add to My Shelf

Vision transformer architecture and applications in digital health: a tutorial and survey

by Gebali, Fayez , Kanan, Awos , Chelvan, Ilamparithi Thirumarai in Artificial intelligence , CAE) and Design , Computer engineering

2023

The vision transformer (ViT) is a state-of-the-art architecture for image recognition tasks that plays an important role in digital health applications. Medical images account for 90% of the data in digital medicine applications. This article discusses the core foundations of the ViT architecture and its digital health applications. These applications include image segmentation, classification, detection, prediction, reconstruction, synthesis, and telehealth such as report generation and security. This article also presents a roadmap for implementing the ViT in digital health systems and discusses its limitations and challenges.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter