Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
322
result(s) for
"Zisserman, Andrew"
Sort by:
Synthetic Humans for Action Recognition from Unseen Viewpoints
by
Laptev Ivan
,
Varol Gül
,
Schmid, Cordelia
in
Human activity recognition
,
Human motion
,
Human performance
2021
Although synthetic training data has been shown to be beneficial for tasks such as human pose estimation, its use for RGB human action recognition is relatively unexplored. Our goal in this work is to answer the question whether synthetic humans can improve the performance of human action recognition, with a particular focus on generalization to unseen viewpoints. We make use of the recent advances in monocular 3D human body reconstruction from real action sequences to automatically render synthetic training videos for the action labels. We make the following contributions: (1) we investigate the extent of variations and augmentations that are beneficial to improving performance at new viewpoints. We consider changes in body shape and clothing for individuals, as well as more action relevant augmentations such as non-uniform frame sampling, and interpolating between the motion of individuals performing the same action; (2) We introduce a new data generation methodology, SURREACT, that allows training of spatio-temporal CNNs for action classification; (3) We substantially improve the state-of-the-art action recognition performance on the NTU RGB+D and UESTC standard human action multi-view benchmarks; Finally, (4) we extend the augmentation approach to in-the-wild videos from a subset of the Kinetics dataset to investigate the case when only one-shot training data is available, and demonstrate improvements in this case as well.
Journal Article
A Statistical Approach to Texture Classification from Single Images
2005
Issue Title: Special Issue on Texture Analysis and Synthesis We investigate texture classification from single images obtained under unknown viewpoint and illumination. A statistical approach is developed where textures are modelled by the joint probability distribution of filter responses. This distribution is represented by the frequency histogram of filter response cluster centres (textons). Recognition proceeds from single, uncalibrated images and the novelty here is that rotationally invariant filters are used and the filter response space is low dimensional. Classification performance is compared with the filter banks and methods of Leung and Malik [IJCV, 2001], Schmid [CVPR, 2001] and Cula and Dana [IJCV, 2004] and it is demonstrated that superior performance is achieved here. Classification results are presented for all 61 materials in the Columbia-Utrecht texture database. We also discuss the effects of various parameters on our classification algorithm--such as the choice of filter bank and rotational invariance, the size of the texton dictionary as well as the number of training images used. Finally, we present a method of reliably measuring relative orientation co-occurrence statistics in a rotationally invariant manner, and discuss whether incorporating such information can enhance the classifier's performance.[PUBLICATION ABSTRACT]
Journal Article
Automated detection, labelling and radiological grading of clinical spinal MRIs
2024
Spinal magnetic resonance (MR) scans are a vital tool for diagnosing the cause of back pain for many diseases and conditions. However, interpreting clinically useful information from these scans can be challenging, time-consuming and hard to reproduce across different radiologists. In this paper, we alleviate these problems by introducing a multi-stage automated pipeline for analysing spinal MR scans. This pipeline first detects and labels vertebral bodies across several commonly used sequences (e.g. T1w, T2w and STIR) and fields of view (e.g. lumbar, cervical, whole spine). Using these detections it then performs automated diagnosis for several spinal disorders, including intervertebral disc degenerative changes in T1w and T2w lumbar scans, and spinal metastases, cord compression and vertebral fractures. To achieve this, we propose a new method of vertebrae detection and labelling, using vector fields to group together detected vertebral landmarks and a language-modelling inspired beam search to determine the corresponding levels of the detections. We also employ a new transformer-based architecture to perform radiological grading which incorporates context from multiple vertebrae and sequences, as a real radiologist would. The performance of each stage of the pipeline is tested in isolation on several clinical datasets, each consisting of 66 to 421 scans. The outputs are compared to manual annotations of expert radiologists, demonstrating accurate vertebrae detection across a range of scan parameters. Similarly, the model’s grading predictions for various types of disc degeneration and detection of spinal metastases closely match those of an expert radiologist. To aid future research, our code and trained models are made publicly available.
Journal Article
The Pascal Visual Object Classes Challenge: A Retrospective
2015
The
Pascal
Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available
dataset
of images together with ground truth annotation and standardised evaluation software; and (ii) an annual
competition
and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008–2012. The paper is intended for two audiences:
algorithm designers
, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and,
challenge designers
, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community’s progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision,
2012
) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.
Journal Article
The Pascal Visual Object Classes (VOC) Challenge
by
Winn, John
,
Van Gool, Luc
,
Zisserman, Andrew
in
Artificial Intelligence
,
Benchmarking
,
Categories
2010
The
Pascal
Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as
the
benchmark for object detection.
This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.
Journal Article
Automatic and Efficient Human Pose Estimation for Sign Language Videos
by
Charles, James
,
Pfister, Tomas
,
Zisserman, Andrew
in
Artificial Intelligence
,
Automation
,
Computer Imaging
2014
We present a fully automatic arm and hand tracker that detects joint positions over continuous sign language video sequences of more than an hour in length. To achieve this, we make contributions in four areas: (i) we show that the overlaid signer can be separated from the background TV broadcast using co-segmentation over all frames with a layered model; (ii) we show that joint positions (shoulders, elbows, wrists) can be predicted per-frame using a random forest regressor given only this segmentation and a colour model; (iii) we show that the random forest can be trained from an existing semi-automatic, but computationally expensive, tracker; and, (iv) introduce an evaluator to assess whether the predicted joint positions are correct for each frame. The method is applied to 20 signing footage videos with changing background, challenging imaging conditions, and for different signers. Our framework outperforms the state-of-the-art long term tracker by Buehler et al. (International Journal of Computer Vision 95:180–197,
2011
), does not require the manual annotation of that work, and, after automatic initialisation, performs tracking in real-time. We also achieve superior joint localisation results to those obtained using the pose estimation method of Yang and Ramanan (Proceedings of the IEEE conference on computer vision and pattern recognition,
2011
).
Journal Article
Audio-visual modelling in a clinical setting
by
Noble, J. Alison
,
Jiao, Jianbo
,
Zisserman, Andrew
in
631/1647/245/1859
,
639/166/985
,
639/705/117
2024
Auditory and visual signals are two primary perception modalities that are usually present together and correlate with each other, not only in natural environments but also in clinical settings. However, audio-visual modelling in the latter case can be more challenging, due to the different sources of audio/video signals and the noise (both signal-level and semantic-level) in auditory signals—usually speech audio. In this study, we consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations that benefit various clinical tasks, without relying on dense supervisory annotations from human experts for the model training. A simple yet effective multi-modal self-supervised learning framework is presented for this purpose. The proposed approach is able to help find standard anatomical planes, predict the focusing position of sonographer’s eyes, and localise anatomical regions of interest during ultrasound imaging. Experimental analysis on a large-scale clinical multi-modal ultrasound video dataset show that the proposed novel representation learning method provides good transferable anatomical representations that boost the performance of automated downstream clinical tasks, even outperforming fully-supervised solutions. Being able to learn such medical representations in a self-supervised manner will contribute to several aspects including a better understanding of obstetric imaging, training new sonographers, more effective assistive tools for human experts, and enhancement of the clinical workflow.
Journal Article
Geometric Latent Dirichlet Allocation on a Matching Graph for Large-scale Image Datasets
by
Zisserman, Andrew
,
Philbin, James
,
Sivic, Josef
in
Allocations
,
Applied sciences
,
Artificial Intelligence
2011
Given a large-scale collection of images our aim is to efficiently associate images which contain the same entity, for example a building or object, and to discover the significant entities. To achieve this, we introduce the
Geometric Latent Dirichlet Allocation
(
g
LDA) model for unsupervised discovery of particular objects in unordered image collections. This explicitly represents images as mixtures of particular objects or facades, and builds rich latent topic models which incorporate the identity and locations of visual words specific to the topic in a geometrically consistent way. Applying standard inference techniques to this model enables images likely to contain the same object to be probabilistically grouped and ranked.
Additionally, to reduce the computational cost of applying the
g
LDA model to large datasets, we propose a scalable method that first computes a
matching graph
over all the images in a dataset. This matching graph connects images that contain the same object, and rough image groups can be mined from this graph using standard clustering techniques. The
g
LDA model can then be applied to generate a more nuanced representation of the data. We also discuss how “hub images” (images representative of an object or landmark) can easily be extracted from our matching graph representation.
We evaluate our techniques on the publicly available Oxford buildings dataset (5K images) and show examples of automatically mined objects. The methods are evaluated quantitatively on this dataset using a ground truth labeling for a number of Oxford landmarks. To demonstrate the scalability of the matching graph method, we show qualitative results on two larger datasets of images taken of the Statue of Liberty (37K images) and Rome (1M+ images).
Journal Article
Detect+Track: robust and flexible software tools for improved tracking and behavioural analysis of fish
by
Taylor, Graham K.
,
Newport, Cait
,
Pérez-Campanero, Natalia
in
animal behaviour
,
animal movement
,
Computer Science and Artificial Intelligence
2025
We introduce a novel video processing method called Detect+Track that combines a deep learning-based object detector with a template-based object agnostic tracker to significantly enhance the accuracy and robustness of animal tracking. Applied to a behavioural experiment involving Picasso triggerfish ( Rhinecanthus aculeatus ) navigating a randomized array of cylindrical obstacles, the method accurately localizes fish centroids across challenging conditions including occlusion, variable lighting, body deformation and surface ripples. Virtual gates between adjacent obstacles and between obstacles and tank boundaries are computed using Voronoi tessellation and planar homology, enabling detailed analysis of gap selection behaviour. Fish speed, movement direction and a more precise estimate of body centroid—key metrics for behavioural analyses—are estimated using optical flow method. The modular workflow is adaptable to new experimental designs, supports manual correction and retraining for new object classes and allows efficient large-scale batch processing. By addressing key limitations of existing tracking tools, Detect+Track provides a flexible and generalizable solution for researchers studying movement and decision-making in complex environments. A detailed tutorial is provided, together with all the data and code required to reproduce our results and enable future innovations in behavioural tracking and analysis.
Journal Article