Catalogue Search | MBRL

An Exploration of Embodied Visual Exploration

by Grauman Kristen , Ramakrishnan, Santhosh K , Jayaraman Dinesh in Algorithms , Computer vision , Exploration

2021

Embodied computer vision considers perception for robots in novel, unstructured environments. Of particular importance is the embodied visual exploration problem: how might a robot equipped with a camera scope out a new environment? Despite the progress thus far, many basic questions pertinent to this problem remain unanswered: (i) What does it mean for an agent to explore its environment well? (ii) Which methods work well, and under which assumptions and environmental settings? (iii) Where do current approaches fall short, and where might future work seek to improve? Seeking answers to these questions, we first present a taxonomy for existing visual exploration algorithms and create a standard framework for benchmarking them. We then perform a thorough empirical study of the four state-of-the-art paradigms using the proposed framework with two photorealistic simulated 3D environments, a state-of-the-art exploration architecture, and diverse evaluation metrics. Our experimental results offer insights and suggest new performance metrics and baselines for future work in visual exploration. Code, models and data are publicly available.

Journal Article

Share this book

Add to My Shelf

Predicting Important Objects for Egocentric Video Summarization

by Grauman, Kristen , Lee, Yong Jae in Activities of daily living , Analysis , Artificial Intelligence

2015

We present a video summarization approach for egocentric or “wearable” camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer’s day. In contrast to traditional keyframe selection techniques, the resulting summary focuses on the most important objects and people with which the camera wearer interacts. To accomplish this, we develop region cues indicative of high-level saliency in egocentric video—such as the nearness to hands, gaze, and frequency of occurrence—and learn a regressor to predict the relative importance of any new region based on these cues. Using these predictions and a simple form of temporal event detection, our method selects frames for the storyboard that reflect the key object-driven happenings. We adjust the compactness of the final summary given either an importance selection criterion or a length budget; for the latter, we design an efficient dynamic programming solution that accounts for importance, visual uniqueness, and temporal displacement. Critically, the approach is neither camera-wearer-specific nor object-specific; that means the learned importance metric need not be trained for a given user or context, and it can predict the importance of objects and people that have never been seen previously. Our results on two egocentric video datasets show the method’s promise relative to existing techniques for saliency and summarization.

Journal Article

Share this book

Add to My Shelf

Learning the Relative Importance of Objects from Tagged Images for Retrieval and Cross-Modal Search

by Grauman, Kristen , Hwang, Sung Ju in Analysis , Artificial Intelligence , Computer Imaging

2012

We introduce an approach to image retrieval and auto-tagging that leverages the implicit information about object importance conveyed by the list of keyword tags a person supplies for an image. We propose an unsupervised learning procedure based on Kernel Canonical Correlation Analysis that discovers the relationship between how humans tag images (e.g., the order in which words are mentioned) and the relative importance of objects and their layout in the scene. Using this discovered connection, we show how to boost accuracy for novel queries, such that the search results better preserve the aspects a human may find most worth mentioning. We evaluate our approach on three datasets using either keyword tags or natural language descriptions, and quantify results with both ground truth parameters as well as direct tests with human subjects. Our results show clear improvements over approaches that either rely on image features alone, or that use words and image features but ignore the implied importance cues. Overall, our work provides a novel way to incorporate high-level human perception of scenes into visual representations for enhanced image search.

Journal Article

Share this book

Add to My Shelf

Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds

by Vijayanarasimhan, Sudheendra , Grauman, Kristen in Accumulation , Active control , Active learning

2014

Active learning and crowdsourcing are promising ways to efficiently build up training sets for object recognition, but thus far techniques are tested in artificially controlled settings. Typically the vision researcher has already determined the dataset’s scope, the labels “actively” obtained are in fact already known, and/or the crowd-sourced collection process is iteratively fine-tuned. We present an approach for live learning of object detectors, in which the system autonomously refines its models by actively requesting crowd-sourced annotations on images crawled from the Web. To address the technical issues such a large-scale system entails, we introduce a novel part-based detector amenable to linear classifiers, and show how to identify its most uncertain instances in sub-linear time with a hashing-based solution. We demonstrate the approach with experiments of unprecedented scale and autonomy, and show it successfully improves the state-of-the-art for the most challenging objects in the PASCAL VOC benchmark. In addition, we show our detector competes well with popular nonlinear classifiers that are much more expensive to train.

Journal Article

Share this book

Add to My Shelf

Learning Image Representations Tied to Egomotion from Unlabeled Video

by Jayaraman, Dinesh , Grauman, Kristen in Artificial Intelligence , Computer Imaging , Computer Science

2017

Understanding how images of objects and scenes behave in response to specific egomotions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images. We propose a new “embodied” visual learning paradigm, exploiting proprioceptive motor signals to train visual representations from egocentric video with no manual supervision. Specifically, we enforce that our learned features exhibit equivariance i.e., they respond predictably to transformations associated with distinct egomotions. With three datasets, we show that our unsupervised feature learning approach significantly outperforms previous approaches on visual recognition and next-best-view prediction tasks. In the most challenging test, we show that features learned from video captured on an autonomous driving platform improve large-scale scene recognition in static images from a disjoint domain.

Journal Article

Share this book

Add to My Shelf

Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning

by Gao, Ruohan , Garg, Rishabh , Grauman, Kristen in Audio data , Datasets , Feature extraction

2023

Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process. In particular, we develop a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream’s coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time. Furthermore, we introduce two new large video datasets: one with realistic binaural audio simulated for real-world scanned environments, and the other with pseudo-binaural audio obtained from ambisonic sounds in YouTube 360∘ videos. On three datasets, we demonstrate the efficacy of our method, which achieves state-of-the-art results.

Journal Article

Share this book

Add to My Shelf

Discovering Attribute Shades of Meaning with the Crowd

by Grauman, Kristen , Kovashka, Adriana in Analysis , Artificial Intelligence , Categories

2015

To learn semantic attributes, existing methods typically train one discriminative model for each word in a vocabulary of nameable properties. However, this “one model per word” assumption is problematic: while a word might have a precise linguistic definition, it need not have a precise visual definition. We propose to discover shades of attribute meaning. Given an attribute name, we use crowdsourced image labels to discover the latent factors underlying how different annotators perceive the named concept. We show that structure in those latent factors helps reveal shades, that is, interpretations for the attribute shared by some group of annotators. Using these shades, we train classifiers to capture the primary (often subtle) variants of the attribute. The resulting models are both semantic and visually precise. By catering to users’ interpretations, they improve attribute prediction accuracy on novel images. Shades also enable more successful attribute-based image search, by providing robust personalized models for retrieving multi-attribute query results. They are widely applicable to tasks that involve describing visual content, such as zero-shot category learning and organization of photo collections.

Journal Article

Share this book

Add to My Shelf

WhittleSearch: Interactive Image Search with Relative Attribute Feedback

by Grauman, Kristen , Parikh, Devi , Kovashka, Adriana in Algorithms , Analysis , Artificial Intelligence

2015

We propose a novel mode of feedback for image search, where a user describes which properties of exemplar images should be adjusted in order to more closely match his/her mental model of the image sought. For example, perusing image results for a query “black shoes”, the user might state, “Show me shoe images like these, but sportier .” Offline, our approach first learns a set of ranking functions, each of which predicts the relative strength of a nameable attribute in an image (e.g., sportiness ). At query time, the system presents the user with a set of exemplar images, and the user relates them to his/her target image with comparative statements. Using a series of such constraints in the multi-dimensional attribute space, our method iteratively updates its relevance function and re-ranks the database of images. To determine which exemplar images receive feedback from the user, we present two variants of the approach: one where the feedback is user-initiated and another where the feedback is actively system-initiated. In either case, our approach allows a user to efficiently “whittle away” irrelevant portions of the visual feature space, using semantic language to precisely communicate her preferences to the system. We demonstrate our technique for refining image search for people, products, and scenes, and we show that it outperforms traditional binary relevance feedback in terms of search speed and accuracy. In addition, the ordinal nature of relative attributes helps make our active approach efficient—both computationally for the machine when selecting the reference images, and for the user by requiring less user interaction than conventional passive and active methods.

Journal Article

Share this book

Add to My Shelf

Learning Kernels for Unsupervised Domain Adaptation with Applications to Visual Object Recognition

by Grauman, Kristen , Sha, Fei , Gong, Boqing in Adaptation , Algorithms , Analysis

2014

Domain adaptation aims to correct the mismatch in statistical properties between the source domain on which a classifier is trained and the target domain to which the classifier is to be applied. In this paper, we address the challenging scenario of unsupervised domain adaptation , where the target domain does not provide any annotated data to assist in adapting the classifier. Our strategy is to learn robust features which are resilient to the mismatch across domains and then use them to construct classifiers that will perform well on the target domain. To this end, we propose novel kernel learning approaches to infer such features for adaptation. Concretely, we explore two closely related directions. In the first direction, we propose unsupervised learning of a geodesic flow kernel (GFK). The GFK summarizes the inner products in an infinite sequence of feature subspaces that smoothly interpolates between the source and target domains. In the second direction, we propose supervised learning of a kernel that discriminatively combines multiple base GFKs. Those base kernels model the source and the target domains at fine-grained granularities. In particular, each base kernel pivots on a different set of landmarks—the most useful data instances that reveal the similarity between the source and the target domains, thus bridging them to achieve adaptation. Our approaches are computationally convenient, automatically infer important hyper-parameters, and are capable of learning features and classifiers discriminatively without demanding labeled data from the target domain. In extensive empirical studies on standard benchmark recognition datasets, our appraches yield state-of-the-art results compared to a variety of competing methods.

Journal Article

Share this book

Add to My Shelf

Cost-Sensitive Active Visual Category Learning

by Vijayanarasimhan, Sudheendra , Grauman, Kristen in Active learning , Annotations , Applied sciences

2011

We present an active learning framework that predicts the tradeoff between the effort and information gain associated with a candidate image annotation, thereby ranking unlabeled and partially labeled images according to their expected “net worth” to an object recognition system. We develop a multi-label multiple-instance approach that accommodates realistic images containing multiple objects and allows the category-learner to strategically choose what annotations it receives from a mixture of strong and weak labels. Since the annotation cost can vary depending on an image’s complexity, we show how to improve the active selection by directly predicting the time required to segment an unlabeled image. Our approach accounts for the fact that the optimal use of manual effort may call for a combination of labels at multiple levels of granularity, as well as accurate prediction of manual effort. As a result, it is possible to learn more accurate category models with a lower total expenditure of annotation effort. Given a small initial pool of labeled data, the proposed method actively improves the category models with minimal manual intervention.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter