Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
93
result(s) for
"Adam, Hartwig"
Sort by:
Sixteen facial expressions occur in similar contexts worldwide
2021
Understanding the degree to which human facial expressions co-vary with specific social contexts across cultures is central to the theory that emotions enable adaptive responses to important challenges and opportunities
1
–
6
. Concrete evidence linking social context to specific facial expressions is sparse and is largely based on survey-based approaches, which are often constrained by language and small sample sizes
7
–
13
. Here, by applying machine-learning methods to real-world, dynamic behaviour, we ascertain whether naturalistic social contexts (for example, weddings or sporting competitions) are associated with specific facial expressions
14
across different cultures. In two experiments using deep neural networks, we examined the extent to which 16 types of facial expression occurred systematically in thousands of contexts in 6 million videos from 144 countries. We found that each kind of facial expression had distinct associations with a set of contexts that were 70% preserved across 12 world regions. Consistent with these associations, regions varied in how frequently different facial expressions were produced as a function of which contexts were most salient. Our results reveal fine-grained patterns in human facial expressions that are preserved across the modern world.
An analysis of 16 types of facial expression in thousands of contexts in millions of videos revealed fine-grained patterns in human facial expression that are preserved across the modern world.
Journal Article
View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose
2022
Recognition of human poses and actions is crucial for autonomous systems to interact smoothly with people. However, cameras generally capture human poses in 2D as images and videos, which can have significant appearance variations across viewpoints that make the recognition tasks challenging. To address this, we explore recognizing similarity in 3D human body poses from 2D information, which has not been well-studied in existing works. Here, we propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D poses from projection and occlusion are difficult to represent through a deterministic mapping, and therefore we adopt a probabilistic formulation for our embedding space. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 3D pose estimation models. We also show that by training a simple temporal embedding model, we achieve superior performance on pose sequence retrieval and largely reduce the embedding dimension from stacking frame-based embeddings for efficient large-scale retrieval. Furthermore, in order to enable our embeddings to work with partially visible input, we further investigate different keypoint occlusion augmentation strategies during training. We demonstrate that these occlusion augmentations significantly improve retrieval performance on partial 2D input poses. Results on action recognition and video alignment demonstrate that using our embeddings without any additional training achieves competitive performance relative to other models specifically trained for each task.
Journal Article
Training Machines to Identify Species using GBIF-mediated Datasets
by
Brulé, Denis
,
Loarie, Scott
,
Perona, Pietro
in
Artificial intelligence
,
Biodiversity
,
cameras
2019
Advances in machine vision technology are rapidly enabling new and innovative uses within the field of biodiversity. Computers are now able to use images to identify tens of thousands of species across a wide range of taxonomic groups in real time, notably demonstrated by iNaturalist.org, which suggests species IDs to users (https://www.inaturalist.org/pages/computer_vision_demo) as they create observation records. Soon it will be commonplace to detect species in video feeds or use the camera in a mobile device to search for species-related content on the Internet. The Global Biodiversity Information Facility (GBIF) has an important role to play in advancing and improving this technology, whether in terms of data, collaboration across teams, or citation practice. But in the short term, the most important role may relate to initiating a cultural shift in accepted practices for the use of GBIF-mediated data for training of artificial intelligence (AI). “Training datasets” play a critical role in achieving species recognition capability in any machine vision system. These datasets compile representative images containing the explicit, verifiable identifications of the species they include. High-powered computers run algorithms on these training datasets, analysing the imagery and building complex models that characterize defining features for each species or taxonomic group. Researchers can, in turn, apply the resulting models to new images, determining what species or group they likely contain. Current research in machine vision is exploring (a) the use of location and date information to further improve model results, (b) identification methods beyond species-level into attribute, character, trait, or part-level ID, with an eye toward human interpretability, and (c) expertise modeling for improved determination of “research grade” images and metadata. The GBIF community has amassed one of the largest datasets of labelled species images available on the internet: more than 33 million species occurrence records in GBIF.org have one or more images (https://www.gbif.org/occurrence/gallery). Machine vision models, when integrated into the data collection tools in use across the GBIF network, can improve the user experience. For example, in citizen science applications like iNaturalist, automated species suggestion helps even novice users contribute occurrence records to GBIF. Perhaps most importantly, GBIF has implemented uniform (and open) data licensing, established guidelines on citation and provided consistent methods for tracking data use through the Digital Object Identifiers (DOI) citation chain. GBIF would like to build on the lessons learned in these activities while striving to assist with this technology research and increase its power and availability. We envisage an approach as follows: To assist in developing and refining machine vision models, GBIF plans to provide training datasets, taking effort to ensure license and citation practice are respected. The training datasets will be issued with a DOI, and the contributing datasets will be linked through the DOI citation graph. To assist application developers, Google and Visipedia plan to build and publish openly-licensed models and tutorials for how to adapt them for localized use. Together we will strive to ensure that data is being used responsibly and transparently, to close the gap between machine vision scientists, application developers, and users and to share taxonomic trees capturing the taxon rank to which machine vision models can identify with confidence based on an image’s visual characteristics. To assist in developing and refining machine vision models, GBIF plans to provide training datasets, taking effort to ensure license and citation practice are respected. The training datasets will be issued with a DOI, and the contributing datasets will be linked through the DOI citation graph. To assist application developers, Google and Visipedia plan to build and publish openly-licensed models and tutorials for how to adapt them for localized use. Together we will strive to ensure that data is being used responsibly and transparently, to close the gap between machine vision scientists, application developers, and users and to share taxonomic trees capturing the taxon rank to which machine vision models can identify with confidence based on an image’s visual characteristics.
Journal Article
View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose
2021
Recognition of human poses and actions is crucial for autonomous systems to interact smoothly with people. However, cameras generally capture human poses in 2D as images and videos, which can have significant appearance variations across viewpoints that make the recognition tasks challenging. To address this, we explore recognizing similarity in 3D human body poses from 2D information, which has not been well-studied in existing works. Here, we propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D poses from projection and occlusion are difficult to represent through a deterministic mapping, and therefore we adopt a probabilistic formulation for our embedding space. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 3D pose estimation models. We also show that by training a simple temporal embedding model, we achieve superior performance on pose sequence retrieval and largely reduce the embedding dimension from stacking frame-based embeddings for efficient large-scale retrieval. Furthermore, in order to enable our embeddings to work with partially visible input, we further investigate different keypoint occlusion augmentation strategies during training. We demonstrate that these occlusion augmentations significantly improve retrieval performance on partial 2D input poses. Results on action recognition and video alignment demonstrate that using our embeddings without any additional training achieves competitive performance relative to other models specifically trained for each task.
Epsilon-VAE: Denoising as Visual Decoding
2025
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. For high-dimensional visual data, it reduces redundancy and emphasizes key features for high-quality generation. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. In this work, we offer a new perspective by proposing denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (FID), comparing it to state-of-the-art autoencoding approaches. By adopting iterative reconstruction through diffusion, our autoencoder, namely Epsilon-VAE, achieves high reconstruction quality, which in turn enhances downstream generation quality by 22% at the same compression rates or provides 2.3x inference speedup through increasing compression rates. We hope this work offers new insights into integrating iterative generation and autoencoding for improved compression and generation.
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
2023
We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model and task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. 2) Sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigates the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including video classification, image classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L variant focusing on video tasks that achieves new state-of-the-art in zero-shot video classification: 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 68.3% on Kinetics-700, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.
SANPO: A Scene Understanding, Accessibility and Human Navigation Dataset
2024
Vision is essential for human navigation. The World Health Organization (WHO) estimates that 43.3 million people were blind in 2020, and this number is projected to reach 61 million by 2050. Modern scene understanding models could empower these people by assisting them with navigation, obstacle avoidance and visual recognition capabilities. The research community needs high quality datasets for both training and evaluation to build these systems. While datasets for autonomous vehicles are abundant, there is a critical gap in datasets tailored for outdoor human navigation. This gap poses a major obstacle to the development of computer vision based Assistive Technologies. To overcome this obstacle, we present SANPO, a large-scale egocentric video dataset designed for dense prediction in outdoor human navigation environments. SANPO contains 701 stereo videos of 30+ seconds captured in diverse real-world outdoor environments across four geographic locations in the USA. Every frame has a high resolution depth map and 112K frames were annotated with temporally consistent dense video panoptic segmentation labels. The dataset also includes 1961 high-quality synthetic videos with pixel accurate depth and panoptic segmentation annotations to balance the noisy real world annotations with the high precision synthetic annotations. SANPO is already publicly available and is being used by mobile applications like Project Guideline to train mobile models that help low-vision users go running outdoors independently. To preserve anonymization during peer review, we will provide a link to our dataset upon acceptance. SANPO is available here: https://google-research-datasets.github.io/sanpo_dataset/
Video Creation by Demonstration
2024
We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present \\(\\delta\\)-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, \\(\\delta\\)-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at https://delta-diffusion.github.io/.
Video Foundation Models for Animal Behavior Analysis
2024
Computational approaches leveraging computer vision and machine learning have transformed the quantification of animal behavior from video. However, existing methods often rely on task-specific features or models, which struggle to generalize across diverse datasets and tasks. Recent advances in machine learning, particularly the emergence of vision foundation models, i.e., large-scale models pre-trained on massive, diverse visual repositories, offers a way to tackle these challenges. Here, we investigate the potential of frozen video foundation models across a range of behavior analysis tasks, including classification, retrieval, and localization. We use a single, frozen model to extract general-purpose representations from video data, and perform extensive evaluations on diverse open-sourced animal behavior datasets. Our results demonstrate that features with minimal adaptation from foundation models achieve competitive performance compared to existing methods specifically designed for each dataset, across species, behaviors, and experimental contexts. This highlights the potential of frozen video foundation models as a powerful and accessible backbone for automated behavior analysis, with the ability to accelerate research across diverse fields from neuroscience, to ethology, and to ecology.
Unified Visual Relationship Detection with Vision and Language Models
2023
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs). VLMs provide well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification. Our bottom-up design enables the model to enjoy the benefit of training with both object detection and visual relationship datasets. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model. UniVRD achieves 38.07 mAP on HICO-DET, outperforming the current best bottom-up HOI detector by 14.26 mAP. More importantly, we show that our unified detector performs as well as dataset-specific models in mAP, and achieves further improvements when we scale up the model. Our code will be made publicly available on GitHub.