Catalogue Search | MBRL

Semantic Understanding of Scenes Through the ADE20K Dataset

by Xiao, Tete , Barriuso, Adela , Puig, Xavier in Benchmarks , Computer vision , Data acquisition

2019

Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.

Journal Article

Share this book

Add to My Shelf

Occluded Video Instance Segmentation: A Benchmark

by Liu, Xiaoyu , Bai, Xiang , Hu, Yao in Algorithms , Datasets , Image segmentation

2022

Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems cannot. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 16.3, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. We also present a simple plug-and-play module that performs temporal feature calibration to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain a remarkable AP improvement on the OVIS dataset. The OVIS dataset and project code are available at http://songbai.site/ovis.

Journal Article

Share this book

Add to My Shelf

End-to-End Learning of Deep Visual Representations for Image Retrieval

by Larlus, Diane , Revaud, Jerome , Gordo, Albert in Artificial Intelligence , Computer architecture , Computer Imaging

2017

While deep learning has become a key ingredient in the top performing methods for many computer vision tasks, it has failed so far to bring similar improvements to instance-level image retrieval. In this article, we argue that reasons for the underwhelming results of deep methods on image retrieval are threefold: (1) noisy training data, (2) inappropriate deep architecture, and (3) suboptimal training procedure. We address all three issues. First, we leverage a large-scale but noisy landmark dataset and develop an automatic cleaning method that produces a suitable training set for deep retrieval. Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it. Last, we train this network with a siamese architecture that combines three streams with a triplet loss. At the end of the training process, the proposed architecture produces a global image representation in a single forward pass that is well suited for image retrieval. Extensive experiments show that our approach significantly outperforms previous retrieval approaches, including state-of-the-art methods based on costly local descriptor indexing and spatial verification. On Oxford 5k, Paris 6k and Holidays, we respectively report 94.7, 96.6, and 94.8 mean average precision. Our representations can also be heavily compressed using product quantization with little loss in accuracy.

Journal Article

Share this book

Add to My Shelf

A survey on instance segmentation: state of the art

by Bhat, Ghulam Mohiuddin , Hafiz, Abdul Mueed in Accuracy , Centroids , Classification

2020

Object detection or localization is an incremental step in progression from coarse to fine digital image inference. It not only provides the classes of the image objects, but also provides the location of the image objects which have been classified. The location is given in the form of bounding boxes or centroids. Semantic segmentation gives fine inference by predicting labels for every pixel in the input image. Each pixel is labelled according to the object class within which it is enclosed. Furthering this evolution, instance segmentation gives different labels for separate instances of objects belonging to the same class. Hence, instance segmentation may be defined as the technique of simultaneously solving the problem of object detection as well as that of semantic segmentation. In this survey paper on instance segmentation, its background, issues, techniques, evolution, popular datasets, related work up to the state of the art and future scope have been discussed. The paper provides valuable information for those who want to do research in the field of instance segmentation.

Journal Article

Share this book

Add to My Shelf

EfficientPS: Efficient Panoptic Segmentation

by Valada Abhinav , Mohan, Rohit in Benchmarks , Datasets , Image annotation

2021

Understanding the scene in which an autonomous robot operates is critical for its competent functioning. Such scene comprehension necessitates recognizing instances of traffic participants along with general scene semantics which can be effectively addressed by the panoptic segmentation task. In this paper, we introduce the Efficient Panoptic Segmentation (EfficientPS) architecture that consists of a shared backbone which efficiently encodes and fuses semantically rich multi-scale features. We incorporate a new semantic head that aggregates fine and contextual features coherently and a new variant of Mask R-CNN as the instance head. We also propose a novel panoptic fusion module that congruously integrates the output logits from both the heads of our EfficientPS architecture to yield the final panoptic segmentation output. Additionally, we introduce the KITTI panoptic segmentation dataset that contains panoptic annotations for the popularly challenging KITTI benchmark. Extensive evaluations on Cityscapes, KITTI, Mapillary Vistas and Indian Driving Dataset demonstrate that our proposed architecture consistently sets the new state-of-the-art on all these four benchmarks while being the most efficient and fast panoptic segmentation architecture to date.

Journal Article

Share this book

Add to My Shelf

Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes

by Siva, Karthik Mustikovela , Mescheder, Lars , Rother, Carsten in Augmented reality , Computer vision , Data augmentation

2018

The success of deep learning in computer vision is based on the availability of large annotated datasets. To lower the need for hand labeled images, virtually rendered 3D worlds have recently gained popularity. Unfortunately, creating realistic 3D content is challenging on its own and requires significant human effort. In this work, we propose an alternative paradigm which combines real and synthetic data for learning semantic instance segmentation and object detection models. Exploiting the fact that not all aspects of the scene are equally important for this task, we propose to augment real-world imagery with virtual objects of the target category. Capturing real-world images at large scale is easy and cheap, and directly provides real background appearances without the need for creating complex 3D models of the environment. We present an efficient procedure to augment these images with virtual objects. In contrast to modeling complete 3D environments, our data augmentation approach requires only a few user interactions in combination with 3D models of the target object category. Leveraging our approach, we introduce a novel dataset of augmented urban driving scenes with 360 degree images that are used as environment maps to create realistic lighting and reflections on rendered objects. We analyze the significance of realistic object placement by comparing manual placement by humans to automatic methods based on semantic scene analysis. This allows us to create composite images which exhibit both realistic background appearance as well as a large number of complex object arrangements. Through an extensive set of experiments, we conclude the right set of parameters to produce augmented data which can maximally enhance the performance of instance segmentation models. Further, we demonstrate the utility of the proposed approach on training standard deep models for semantic instance segmentation and object detection of cars in outdoor driving scenarios. We test the models trained on our augmented data on the KITTI 2015 dataset, which we have annotated with pixel-accurate ground truth, and on the Cityscapes dataset. Our experiments demonstrate that the models trained on augmented imagery generalize better than those trained on fully synthetic data or models trained on limited amounts of annotated real data.

Journal Article

Share this book

Add to My Shelf

LIP: Local Importance-Based Pooling

by Wang, Limin , Gao, Ziteng , Wu, Gangshan in Accuracy , Artificial neural networks , Classification

2023

Spatial downsampling layers are favored in convolutional neural networks (CNNs) to downscale feature maps for larger receptive fields and less memory consumption. However, for visual recognition tasks, these layers might lose discriminative details due to improper pooling strategies. In this paper, we present a unified framework (LAN) over the common downsampling layers (e.g., average pooling, max pooling, and strided convolution) from a view of local aggregation based on importance. In this LAN framework, we analyze the issues of these widely-used pooling layers and figure out the criteria of designing an effective downsampling layer. Based on this analysis, we propose a simple, general, and effective pooling operation based on local importance modeling, termed as Local Importance-based Pooling (LIP). LIP is able to enhance discriminative features during the downsampling procedure by learning adaptive importance weights based on inputs. To further modulate different pooling windows for more effective pooling, we present the improved version of LIP, termed LIP++, by introducing an explicit margin term and efficient logit modules. Our LIP++ can yield consistent accuracy improvement over the original LIP yet with a smaller computational cost. Extensive experiments show that our presented LIP method consistently yields notable gains with different CNN architectures on the image classification task. In the challenging MS COCO dataset, detectors with our LIP-ResNets as backbones obtain a consistent performance improvement over the vanilla ResNets on both bounding box detection and instance segmentation. Finally, we also verify the effectiveness of LIP on the tasks of pose estimation and semantic segmentation, demonstrating its generalization to the dense prediction task.

Journal Article

Share this book

Add to My Shelf

Dual-source discrimination power analysis for multi-instance contactless palmprint recognition

by Bi, Xue , Li, Ming , Leng, Lu in Access control , Accuracy , Algorithms

2017

Due to the benefits of palmprint recognition and the advantages of biometric fusion systems, it is necessary to study multi-source palmprint fusion systems. Unfortunately, the research on multi-instance palmprint feature fusion is absent until now. In this paper, we extract the features of left and right palmprints with two-dimensional discrete cosine transform (2DDCT) to constitute a dual-source space. Normalization is utilized in dual-source space to avoid the disturbance caused by the coefficients with large absolute values. Thus complicated pre-masking is needless and arbitrary removing of discriminative coefficients is avoided. Since more discriminative coefficients can be preserved and retrieved with discrimination power analysis (DPA) from dual-source space, the accuracy performance is improved. The experiments performed on contactless palmprint database confirm that dual-source DPA, which is designed for multi-instance palmprint feature fusion recognition, outperforms single-source DPA.

Journal Article

Share this book

Add to My Shelf

An instance level analysis of data complexity

by Smith, Michael R. , Giraud-Carrier, Christophe , Martinez, Tony in Algorithms , Applied sciences , Artificial Intelligence

2014

Most data complexity studies have focused on characterizing the complexity of the entire data set and do not provide information about individual instances. Knowing which instances are misclassified and understanding why they are misclassified and how they contribute to data set complexity can improve the learning process and could guide the future development of learning algorithms and data analysis methods. The goal of this paper is to better understand the data used in machine learning problems by identifying and analyzing the instances that are frequently misclassified by learning algorithms that have shown utility to date and are commonly used in practice. We identify instances that are hard to classify correctly ( instance hardness ) by classifying over 190,000 instances from 64 data sets with 9 learning algorithms. We then use a set of hardness measures to understand why some instances are harder to classify correctly than others. We find that class overlap is a principal contributor to instance hardness. We seek to integrate this information into the training process to alleviate the effects of class overlap and present ways that instance hardness can be used to improve learning.

Journal Article

Share this book

Add to My Shelf

Leaf segmentation in plant phenotyping: a collation study

by Liu, Xiaoming , Polder, Gerrit , Tsaftaris, Sotirios A. in Accuracy , Algorithms , Chamfering

2016

Image-based plant phenotyping is a growing application area of computer vision in agriculture. A key task is the segmentation of all individual leaves in images. Here we focus on the most common rosette model plants, Arabidopsis and young tobacco. Although leaves do share appearance and shape characteristics, the presence of occlusions and variability in leaf shape and pose, as well as imaging conditions, render this problem challenging. The aim of this paper is to compare several leaf segmentation solutions on a unique and first-of-its-kind dataset containing images from typical phenotyping experiments. In particular, we report and discuss methods and findings of a collection of submissions for the first Leaf Segmentation Challenge of the Computer Vision Problems in Plant Phenotyping workshop in 2014. Four methods are presented: three segment leaves by processing the distance transform in an unsupervised fashion, and the other via optimal template selection and Chamfer matching. Overall, we find that although separating plant from background can be accomplished with satisfactory accuracy ( > 90 % Dice score), individual leaf segmentation and counting remain challenging when leaves overlap. Additionally, accuracy is lower for younger leaves. We find also that variability in datasets does affect outcomes. Our findings motivate further investigations and development of specialized algorithms for this particular application, and that challenges of this form are ideally suited for advancing the state of the art. Data are publicly available (online at http://www.plant-phenotyping.org/datasets ) to support future challenges beyond segmentation within this application domain.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter