Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
25
result(s) for
"Special Issue on Open-World Visual Recognition"
Sort by:
Generalized Out-of-Distribution Detection: A Survey
2024
Out-of-distribution (OOD) detection is critical to ensuring the reliability and safety of machine learning systems. For instance, in autonomous driving, we would like the driving system to issue an alert and hand over the control to humans when it detects unusual scenes or objects that it has never seen during training time and cannot make a safe decision. The term, OOD detection, first emerged in 2017 and since then has received increasing attention from the research community, leading to a plethora of methods developed, ranging from classification-based to density-based to distance-based ones. Meanwhile, several other problems, including anomaly detection (AD), novelty detection (ND), open set recognition (OSR), and outlier detection (OD), are closely related to OOD detection in terms of motivation and methodology. Despite common goals, these topics develop in isolation, and their subtle differences in definition and problem setting often confuse readers and practitioners. In this survey, we first present a unified framework called generalized OOD detection, which encompasses the five aforementioned problems, i.e.,AD, ND, OSR, OOD detection, and OD. Under our framework, these five problems can be seen as special cases or sub-tasks, and are easier to distinguish. Despite comprehensive surveys of related fields, the summarization of OOD detection methods remains incomplete and requires further advancement. This paper specifically addresses the gap in recent technical developments in the field of OOD detection. It also provides a comprehensive discussion of representative methods from other sub-tasks and how they relate to and inspire the development of OOD detection methods. The survey concludes by identifying open challenges and potential research directions.
Journal Article
A Comprehensive Survey on Test-Time Adaptation Under Distribution Shifts
2025
Machine learning methods strive to acquire a robust model during the training process that can effectively generalize to test samples, even in the presence of distribution shifts. However, these methods often suffer from performance degradation due to unknown test distributions. Test-time adaptation (TTA), an emerging paradigm, has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions. Recent progress in this paradigm has highlighted the significant benefits of using unlabeled data to train self-adapted models prior to inference. In this survey, we categorize TTA into several distinct groups based on the form of test data, namely, test-time domain adaptation, test-time batch adaptation, and online test-time adaptation. For each category, we provide a comprehensive taxonomy of advanced algorithms and discuss various learning scenarios. Furthermore, we analyze relevant applications of TTA and discuss open challenges and promising areas for future research. For a comprehensive list of TTA methods, kindly refer to
https://github.com/tim-learn/awesome-test-time-adaptation
.
Journal Article
PRCL: Probabilistic Representation Contrastive Learning for Semi-Supervised Semantic Segmentation
by
Sun, Baigui
,
Xie, Haoyu
,
Wang, Changqi
in
Computer vision
,
Image enhancement
,
Image segmentation
2024
Tremendous breakthroughs have been developed in Semi-Supervised Semantic Segmentation (S4) through contrastive learning. However, due to limited annotations, the guidance on unlabeled images is generated by the model itself, which inevitably exists noise and disturbs the unsupervised training process. To address this issue, we propose a robust contrastive-based S4 framework, termed the Probabilistic Representation Contrastive Learning (PRCL) framework to enhance the robustness of the unsupervised training process. We model the pixel-wise representation as Probabilistic Representations (PR) via multivariate Gaussian distribution and tune the contribution of the ambiguous representations to tolerate the risk of inaccurate guidance in contrastive learning. Furthermore, we introduce Global Distribution Prototypes (GDP) by gathering all PRs throughout the whole training process. Since the GDP contains the information of all representations with the same class, it is robust from the instant noise in representations and bears the intra-class variance of representations. In addition, we generate Virtual Negatives (VNs) based on GDP to involve the contrastive learning process. Extensive experiments on two public benchmarks demonstrate the superiority of our PRCL framework.
Journal Article
Learning Hierarchical Visual Transformation for Domain Generalizable Visual Matching and Recognition
by
Hong, Richang
,
Chang, Tianyu
,
Wang, Meng
in
Artificial neural networks
,
Effectiveness
,
Image classification
2024
Modern deep neural networks are prone to learn domain-dependent shortcuts and thus usually suffer from severe performance degradation when tested in unseen target domains due to their poor ability of out-of-distribution generalization, which significantly limits the real-world applications. The main reason is the domain shift lying in the large distribution gap between source and unseen target data. To this end, this paper takes a step towards training robust models for domain generalizable visual tasks, which mainly focuses on learning domain-invariant visual representation to alleviate the domain shift. Specifically, we first propose an effective Hierarchical Visual Transformation (HVT) network to (1) first transform the training sample hierarchically into new domains with diverse distributions from three levels: Global, Local, and Pixel, (2) then maximize the visual discrepancy between the source domain and new domains, and minimize the cross-domain feature inconsistency to capture domain-invariant features. Besides, we further enhance the HVT network by introducing the environment-invariant learning. To be specific, we enforce the invariance of the visual representation across automatically inferred environments by minimizing invariant learning loss that considers the weighted average of environmental losses. In this way, we can prevent the model from relying on the spurious features for prediction, thus helping the model to effectively learn domain-invariant representation and narrow the domain gap in various visual matching and recognition tasks, such as stereo matching, pedestrian retrieval, and image classification. We term our extended HVT as EHVT to show distinction. We integrate our EHVT network into different models and evaluate its effectiveness and compatibility on several public benchmark datasets. Extensive experiments clearly show that our EHVT can substantially enhance the generalization performance in various tasks. Our codes are available at https://github.com/cty8998/EHVT-VisualDG.
Journal Article
Multi-modal Prototypes for Open-World Semantic Segmentation
2024
In semantic segmentation, generalizing a visual system to both seen categories and novel categories at inference time has always been practically valuable yet challenging. To enable such functionality, existing methods mainly rely on either providing several support demonstrations from the visual aspect or characterizing the informative clues from the textual aspect (e.g., the class names). Nevertheless, both two lines neglect the complementary intrinsic of low-level visual and high-level language information, while the explorations that consider visual and textual modalities as a whole to promote predictions are still limited. To close this gap, we propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for open-world semantic segmentation, and build a novel prototype-based segmentation framework to realize this promise. To be specific, unlike the straightforward combination of bi-modal clues, we decompose the high-level language information as multi-aspect prototypes and aggregate the low-level visual information as more semantic prototypes, on basis of which, a fine-grained complementary fusion makes the multi-modal prototypes more powerful and accurate to promote the prediction. Based on an elastic mask prediction module that permits any number and form of prototype inputs, we are able to solve the zero-shot, few-shot and generalized counterpart tasks in one architecture. Extensive experiments on both PASCAL-5i and COCO-20i datasets show the consistent superiority of the proposed method compared with the previous state-of-the-art approaches, and a range of ablation studies thoroughly dissects each component in our framework both quantitatively and qualitatively that verify their effectiveness.
Journal Article
Towards Training-Free Open-World Segmentation via Image Prompt Foundation Models
2025
The realm of computer vision has witnessed a paradigm shift with the advent of foundational models, mirroring the transformative influence of large language models in the domain of natural language processing. This paper delves into the exploration of open-world segmentation, presenting a novel approach called Image Prompt Segmentation (IPSeg) that harnesses the power of vision foundational models. IPSeg lies the principle of a training-free paradigm, which capitalizes on image prompt techniques. Specifically, IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. Our approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image. The generated point prompts are further utilized to guide the Segment Anything Model to segment the target object in the input image. The proposed method stands out by eliminating the need for exhaustive training sessions, thereby offering a more efficient and scalable solution. Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg’s efficacy for flexible open-world segmentation using intuitive image prompts. This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images.
Journal Article
Open-Vocabulary Animal Keypoint Detection with Semantic-Feature Matching
2024
Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into fully supervised and few-shot class-agnostic approaches. The former typically relies on laborious and time-consuming manual annotations, posing considerable challenges in expanding keypoint detection to a broader range of keypoint categories and animal species. The latter, though less dependent on extensive manual input, still requires necessary support images with annotation for reference during testing. To realize zero-shot keypoint detection without any prior annotation, we introduce the Open-Vocabulary Keypoint Detection (OVKD) task, which is innovatively designed to use text prompts for identifying arbitrary keypoints across any species. In pursuit of this goal, we have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM). This framework synergistically combines vision and language models, creating an interplay between language features and local keypoint visual features. KDSM enhances its capabilities by integrating Domain Distribution Matrix Matching (DDMM) and other special modules, such as the Vision-Keypoint Relational Awareness (VKRA) module, improving the framework’s generalizability and overall performance. Our comprehensive experiments demonstrate that KDSM significantly outperforms the baseline in terms of performance and achieves remarkable success in the OVKD task. Impressively, our method, operating in a zero-shot fashion, still yields results comparable to state-of-the-art few-shot species class-agnostic keypoint detection methods. Codes and data are available at https://github.com/zhanghao5201/KDSM.
Journal Article
OV-VIS: Open-Vocabulary Video Instance Segmentation
2024
Conventionally, the goal of Video Instance Segmentation (VIS) is to segment and categorize objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation (OV-VIS), which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark OV-VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1196 diverse categories, significantly surpassing the category size of existing datasets by more than an order of magnitude. Third, we propose a transformer-based OV-VIS model, OV2Seg+, which associates per-frame segmentation masks with a memory-induced transformer and clarifies objects in videos with a voting module given language guidance. In addition, to monitor the progress, we set up the evaluation protocols for OV-VIS and propose a set of strong baseline models to facilitate future endeavors. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg+. The dataset and code are released here https://github.com/haochenheheda/LVVIS. The competition website is provided here https://www.codabench.org/competitions/1748.
Journal Article
Visual Out-of-Distribution Detection in Open-Set Noisy Environments
by
Yin, Yilong
,
He, Rundong
,
Chang, Xiaojun
in
Confounding (Statistics)
,
Invariants
,
Machine learning
2024
The presence of noisy examples in the training set inevitably hampers the performance of out-of-distribution (OOD) detection. In this paper, we investigate a previously overlooked problem called OOD detection under asymmetric open-set noise, which is frequently encountered and significantly reduces the identifiability of OOD examples. We analyze the generating process of asymmetric open-set noise and observe the influential role of the confounding variable, entangling many open-set noisy examples with partial in-distribution (ID) examples referred to as hard-ID examples due to spurious-related characteristics. To address the issue of the confounding variable, we propose a novel method called Adversarial Confounder REmoving (ACRE) that utilizes progressive optimization with adversarial learning to curate three collections of potential examples (easy-ID, hard-ID, and open-set noisy) while simultaneously developing invariant representations and reducing spurious-related representations. Specifically, by obtaining easy-ID examples with minimal confounding effect, we learn invariant representations from ID examples that aid in identifying hard-ID and open-set noisy examples based on their similarity to the easy-ID set. By triplet adversarial learning, we achieve the joint minimization and maximization of distribution discrepancies across the three collections, enabling the dual elimination of the confounding variable. We also leverage potential open-set noisy examples to optimize a K+1-class classifier, further removing the confounding variable and inducing a tailored K+1-Guided scoring function. Theoretical analysis establishes the feasibility of ACRE, and extensive experiments demonstrate its effectiveness and generalization. Code is available at https://github.com/Anonymous-re-ssl/ACRE0.
Journal Article
Multi-source-free Domain Adaptive Object Detection
2024
To enhance the transferability of object detection models in real-world scenarios where data is sampled from disparate distributions, considerable attention has been devoted to domain adaptive object detection (DAOD). Researchers have also investigated multi-source DAOD to confront the challenges posed by training samples originating from different source domains. However, existing methods encounter difficulties when source data is unavailable due to privacy preservation policies or transmission cost constraints. To address these issues, we introduce and address the problem of Multi-source-free Domain Adaptive Object Detection (MSFDAOD), which seeks to perform domain adaptation for object detection using multi-source-pretrained models without any source data or target labels. Specifically, we propose a novel Divide-and-Aggregate Contrastive Adaptation (DACA) framework. First, multiple mean-teacher detection models perform effective knowledge distillation and class-wise contrastive learning within each source domain feature space, denoted as “Divide”. Meanwhile, DACA integrates proposals, obtains unified pseudo-labels, and assigns dynamic weights to student prediction aggregation, denoted as “Aggregate”. The two-step process of “Divide” and “Aggregate” enables our method to efficiently leverage the advantages of multiple source-free models and aggregate their contributions to adaptation in a self-supervised manner. Extensive experiments are conducted on multiple popular benchmark datasets, and the results demonstrate that the proposed DACA framework significantly outperforms state-of-the-art approaches for MSFDAOD tasks.
Journal Article