Catalogue Search | MBRL

Few-shot adaptation of multi-modal foundation models: a survey

by Chen, Delong , Dai, Wenwen , Zhang, Chuanyi in Adaptation , Adaptive sampling , Artificial Intelligence

2024

Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of visual foundation models. These models with robust and aligned semantic representations learned from billions of internet image-text pairs and can be applied to various downstream tasks in a zero-shot manner. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired. Consequently, many researchers have begun to explore few-shot adaptation methods for these models, gradually deriving three main technical approaches: (1) prompt-based methods, (2) adapter-based methods, and (3) external knowledge-based methods. Nevertheless, this rapidly developing field has produced numerous results without a comprehensive survey to systematically organize the research progress. Therefore, in this survey, we introduce and analyze the research advancements in few-shot adaptation methods for multi-modal models, summarizing commonly used datasets and experimental setups, and comparing the results of different methods. In addition, due to the lack of reliable theoretical support for existing methods, we derive the few-shot adaptation generalization error bound for multi-modal models. The theorem reveals that the generalization error of multi-modal foundation models is constrained by three factors: domain gap, model capacity, and sample size. Based on this, we propose three possible solutions from the following aspects: (1) adaptive domain generalization, (2) adaptive model selection, and (3) adaptive knowledge utilization.Kindly check and confirm the edit made in the title.The title is correct.

Journal Article

Share this book

Add to My Shelf

Investigating translation for Indic languages with BLOOMZ-3b through prompting and LoRA fine-tuning

by Nair, Aarathi Rajagopalan , Gupta, Deepa , Premjith, B. in 639/705/117 , 639/705/258 , Artificial intelligence

2024

In the domain of natural language processing, the rise of Large Language Models and Generative AI represents a noteworthy transition, enabling machines to understand and generate text resembling that produced by humans. This research conducts a thorough examination of this transformative technology, with a focus on its influence on machine translation. The study explores the translation landscape between English and Indic languages, which include Hindi, Kannada, Malayalam, Tamil, and Telugu. To address this, the Large Language Model, BLOOMZ-3b, is utilized, which has been primarily developed for a text generation task. Multiple prompting engineering techniques for machine translation are prominently explored. The study further traverse fine-tuning the BLOOMZ-3b model using a Parameter Efficient Fine-Tuning technique called Low Rank Adaptation, aiming to reduce computational complexity. Hence, by combining innovative prompting approaches using BLOOMZ-3b model and fine-tuning the model, it contributes to continuous development of machine translation technologies beyond traditional borders of what can be done with respect to language processing. In this regard, not only does this research shed light on the intricacy of translation problems but it also sets a precedence for optimizing or adapting big language models to various languages which end up advancing Artificial Intelligence and Natural Language Processing at large.

Journal Article

Share this book

Add to My Shelf

Multi-scene camera relocalization via modulated coordinate regression and low-rank adaptation

by Sarıgül, Mehmet , Ata, Barış , Karacan, Levent in 639/166 , 639/705 , Accuracy

2025

Camera relocalization, the task of estimating a camera’s 6-DoF pose from a single image, typically necessitates training a separate model for each scene or performing fine-tuning to adapt to new environments. In this work, we present a novel approach for multi-scene camera relocalization using a single, unified regressor. Our method builds upon a pre-trained encoder, trained on a diverse set of scenes, to extract generalizable features. To adapt this encoder for specific scenes, we employ parameter-efficient fine-tuning. Additionally, we introduce a lightweight feature modulation mechanism that incorporates compact scene embeddings to condition the model, allowing it to distinguish between scenes without requiring dedicated branches or retraining. Experiments on standard relocalization benchmarks demonstrate that our method achieves competitive accuracy across multiple scenes compared to scene-specific models, while significantly reducing model complexity and training parameters. Notably, our model utilizes over 5 times fewer trainable parameters and over 3 times fewer deployment parameters than recent multi-scene counterparts, while delivering superior performance. The proposed framework provides a scalable and generalizable solution for camera relocalization in real-world, multi-environment applications.

Journal Article

Share this book

Add to My Shelf

LoRA-INT8 Whisper: A Low-Cost Cantonese Speech Recognition Framework for Edge Devices

by Wu, Shie , Zhang, Lusheng , Wang, Zhongxun in Accuracy , Acknowledgment , Adaptation

2025

To address the triple bottlenecks of data scarcity, oversized models, and slow inference that hinder Cantonese automatic speech recognition (ASR) in low-resource and edge-deployment settings, this study proposes a cost-effective Cantonese ASR system based on LoRA fine-tuning and INT8 quantization. First, Whisper-tiny is parameter-efficiently fine-tuned on the Common Voice zh-HK training set using LoRA with rank = 8. Only 1.6% of the original weights are updated, reducing the character error rate (CER) from 49.5% to 11.1%, a performance close to full fine-tuning (10.3%), while cutting the training memory footprint and computational cost by approximately one order of magnitude. Next, the fine-tuned model is compressed into a 60 MB INT8 checkpoint via dynamic quantization in ONNX Runtime. On a MacBook Pro M1 Max CPU, the quantized model achieves an RTF = 0.20 (offline inference 5 × real-time) and 43% lower latency than the FP16 baseline; on an NVIDIA A10 GPU, it reaches RTF = 0.06, meeting the requirements of high-concurrency cloud services. Ablation studies confirm that the LoRA-INT8 configuration offers the best trade-off among accuracy, speed, and model size. Limitations include the absence of spontaneous-speech noise data, extreme-hardware validation, and adaptive LoRA structure optimization. Future work will incorporate large-scale self-supervised pre-training, tone-aware loss functions, AdaLoRA architecture search, and INT4/NPU quantization, and will establish an mJ/char energy–accuracy curve. The ultimate goal is to achieve CER ≤ 8%, RTF < 0.1, and mJ/char < 1 for low-power real-time Cantonese ASR in practical IoT scenarios.

Journal Article

Share this book

Add to My Shelf

Enhancing queries for code generation with reinforcement learning

by Li, Tingting , Yuan, Dawei , Liang, Guojun in 639/166 , 639/705 , Code generation

2025

We present a reinforcement learning framework that enhances natural language queries to improve DeepSeek code generation. A parametric refiner (Qwen with LoRA) is trained via REINFORCE while the generator remains fixed, using a scalar reward that can combine text similarity (BLEU-4, ROUGE-L, F1, Overlap) with execution signals (unit tests, syntax/timeout penalties). On the DS1000 benchmark (800 train / 200 test), RL4QE improves the code similarity by 34.3%. Ablations show that BLEU-4 is the most reliable text reward overall (with F1 competitive on a larger scale), and LoRA with rank outperforms complete fine-tuning on most metrics while being more parameter efficient. The approach is transferred across foundation models (e.g., Qwen1.5/2/2.5 variants), where architecture often matters more than size. RL4QE is easy to integrate in practice (LoRA in attention projections) and supports reproducibility.

Journal Article

Share this book

Add to My Shelf

Ontology-conformal recognition of materials entities using language models

by Thomas, Akhil , Niranjan Murthy, Rachana , Mishra, Lokesh in 639/166/988 , 639/301 , 639/705/117

2025

Extracting structured and semantically annotated materials information from unstructured scientific literature is a crucial step toward constructing machine-interpretable knowledge graphs and accelerating data-driven materials research. This is especially important in materials science, which is adversely affected by data scarcity. Data scarcity further motivates employing solutions such as foundation language models for extracting information which can in principle address several subtasks of the information extraction problem in a range of domains without the need of generating costly large-scale annotated datasets for each downstream task. However, foundation language models struggle with tasks like Named Entity Recognition (NER) due to domain-specific terminologies, fine-grained entities, and semantic ambiguity. The issue is even more pronounced when entities must map directly to pre-existing domain ontologies. This work aims to assess whether foundation large language models (LLMs) can successfully perform ontology-conformal NER in the materials mechanics and fatigue domain. Specifically, we present a comparative evaluation of in-context learning (ICL) with foundation models such as GPT-4 against fine-tuned task-specific language models, including MatSciBERT and DeBERTa. The study is performed on two materials fatigue datasets, which contain annotations at a comparatively fine-grained level adhering to the class definitions of a formal ontology to ensure semantic alignment and cross-dataset interoperability. Both datasets cover adjacent domains to assess how well both NER methodologies generalize when presented with typical domain shifts. Task-specific models are shown to significantly outperform general foundation models on an ontology-constrained NER. Our findings reveal a strong dependence on the quality of few-shot demonstrations in ICL to handle domain-shift. The study also highlights the significance of domain-specific pre-training by comparing task-specific models that differ primarily in their pre-training corpus.

Journal Article

Share this book

Add to My Shelf

Leveraging Vision Foundation Model via PConv-Based Fine-Tuning with Automated Prompter for Defect Segmentation

by Jiang, Yifan , Lu, Jiangang , Chen, Jinshui in Adaptability , Adaptation , Architecture

2025

In industrial scenarios, image segmentation is essential for accurately identifying defect regions. Recently, the emergence of foundation models driven by powerful computational resources and large-scale training data has brought about a paradigm shift in deep learning-based image segmentation. The Segment Anything Model (SAM) has shown exceptional performance across various downstream tasks, owing to its vast semantic knowledge and strong generalization capabilities. However, the feature distribution discrepancy, reliance on manually labeled prompts, and limited category information of SAM reduce its scalability in industrial settings. To address these issues, we propose PA-SAM, an industrial defect segmentation framework based on SAM. Firstly, to bridge the gap between SAM’s pre-training data and distinct characteristics of industrial defects, we introduce a parameter-efficient fine-tuning (PEFT) technique incorporating lightweight Multi-Scale Partial Convolution Aggregation (MSPCA) into Low-Rank Adaptation (LoRA), named MSPCA-LoRA, which effectively enhances the image encoder’s sensitivity to prior knowledge biases, while maintaining PEFT efficiency. Furthermore, we present the Image-to-Prompt Embedding Generator (IPEG), which utilizes image embeddings to autonomously create high-quality prompt embeddings for directing mask segmentation, eliminating the limitations of manually provided prompts. Finally, we apply effective refinements to SAM’s mask decoder, transforming SAM into an end-to-end semantic segmentation framework. On two real-world defect segmentation datasets, PA-SAM achieves mean Intersections over Union of 73.87% and 68.30%, as well as mean Dice coefficients of 84.90% and 80.22%, outperforming other state-of-the-art algorithms, further demonstrating its robust generalization and application potential.

Journal Article

Share this book

Add to My Shelf

A new low-rank adaptation method for brain structure and metastasis segmentation via decoupled principal weight direction and magnitude

by Hu, Keli , Zhu, Hancan , Wang, Yaqing in 631/114/1305 , 631/114/1564 , Accuracy

2025

Deep learning techniques have become pivotal in medical image segmentation, but their success often relies on large, manually annotated datasets, which are expensive and labor-intensive to obtain. Additionally, different segmentation tasks frequently require retraining models from scratch, resulting in substantial computational costs. To address these limitations, we propose PDoRA, an innovative parameter-efficient fine-tuning method that leverages knowledge transfer from a pre-trained SwinUNETR model for a wide range of brain image segmentation tasks. PDoRA minimizes the reliance on extensive data annotation and computational resources by decomposing model weights into principal and residual weights. The principal weights are further divided into magnitude and direction, enabling independent fine-tuning to enhance the model’s ability to capture task-specific features. The residual weights remain fixed and are later fused with the updated principal weights, ensuring model stability while enhancing performance. We evaluated PDoRA on three diverse medical image datasets for brain structure and metastasis segmentation. The results demonstrate that PDoRA consistently outperforms existing parameter-efficient fine-tuning methods, achieving superior segmentation accuracy and efficiency. Our code is available at https://github.com/Perfect199001/PDoRA/tree/main .

Journal Article

Share this book

Add to My Shelf

EMSAM: enhanced multi-scale segment anything model for leaf disease segmentation

by Zhang, Jianhua , Feng, Quan , Li, Junlong in Accuracy , Adaptation , adapter tuning

2025

Accurate segmentation of leaf diseases is crucial for crop health management and disease prevention. However, existing studies fall short in addressing issues such as blurred disease spot boundaries and complex feature distributions in disease images. Although the vision foundation model, Segment Anything Model (SAM), performs well in general segmentation tasks within natural scenes, it does not exhibit good performance in plant disease segmentation. To achieve fine-grained segmentation of leaf disease images, this study proposes an advanced model: Enhanced Multi-Scale SAM (EMSAM). EMSAM employs the Local Feature Extraction Module (LFEM) and the Global Feature Extraction Module (GFEM) to extract local and global features from images respectively. The LFEM utilizes multiple convolutional layers to capture lesion boundaries and detailed characteristics, while the GFEM fine-tunes ViT blocks using a Multi-Scale Adaptive Adapter (MAA) to obtain multi-scale global information. Both outputs of LFEM and GFEM are then effectively fused in the Feature Fusion Module (FFM), which is optimized with cross-branch and channel attention mechanisms, significantly enhancing the model’s ability to handle blurred boundaries and complex shapes. EMSAM integrates lightweight linear layers as classification heads and employs a joint loss function for both classification and segmentation tasks. Experimental results on the PlantVillage dataset demonstrate that EMSAM outperforms the second-best state-of-the-art semantic segmentation model by 2.45% in Dice Coefficient and 6.91% in IoU score, and surpasses the baseline method by 21.40% and 22.57%, respectively. Particularly, for images with moderate and severe disease levels, EMSAM achieved Dice Coefficients of 0.8354 and 0.8178, respectively, significantly outperforming other semantic segmentation algorithms. Additionally, the model achieved a classification accuracy of 87.86% across the entire dataset, highlighting EMSAM’s effectiveness and superiority in plant disease segmentation and classification tasks.

Journal Article

Share this book

Add to My Shelf

$Augmented prediction of vertebral collapse after osteoporotic vertebral compression fractures through parameter-efficient fine-tuning of biomedical foundation models$

Augmented prediction of vertebral collapse after osteoporotic vertebral compression fractures through parameter-efficient fine-tuning of biomedical foundation models

by Yuh, Woon Tak , Park, Sung Bae , Cho, Wonwoo in 631/114 , 631/114/2397 , 692/308

2024

Vertebral collapse (VC) following osteoporotic vertebral compression fracture (OVCF) often requires aggressive treatment, necessitating an accurate prediction for early intervention. This study aimed to develop a predictive model leveraging deep neural networks to predict VC progression after OVCF using magnetic resonance imaging (MRI) and clinical data. Among 245 enrolled patients with acute OVCF, data from 200 patients were used for the development dataset, and data from 45 patients were used for the test dataset. To construct an accurate prediction model, we explored two backbone architectures: convolutional neural networks and vision transformers (ViTs), along with various pre-trained weights and fine-tuning methods. Through extensive experiments, we built our model by performing parameter-efficient fine-tuning of a ViT model pre-trained on a large-scale biomedical dataset. Attention rollouts indicated that the contours and internal features of the compressed vertebral body were critical in predicting VC with this model. To further improve the prediction performance of our model, we applied the augmented prediction strategy, which uses multiple MRI frames and achieves a significantly higher area under the curve (AUC). Our findings suggest that employing a biomedical foundation model fine-tuned using a parameter-efficient method, along with augmented prediction, can significantly enhance medical decisions.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter