Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
10 result(s) for "human–object interaction classification"
Sort by:
Automated Parts-Based Model for Recognizing Human–Object Interactions from Aerial Imagery with Fully Convolutional Network
Advanced aerial images have led to the development of improved human–object interaction recognition (HOI) methods for usage in surveillance, security, and public monitoring systems. Despite the ever-increasing rate of research being conducted in the field of HOI, the existing challenges of occlusion, scale variation, fast motion, and illumination variation continue to attract more researchers. In particular, accurate identification of human body parts, the involved objects, and robust features is the key to effective HOI recognition systems. However, identifying different human body parts and extracting their features is a tedious and rather ineffective task. Based on the assumption that only a few body parts are usually involved in a particular interaction, this article proposes a novel parts-based model for recognizing complex human–object interactions in videos and images captured using ground and aerial cameras. Gamma correction and non-local means denoising techniques have been used for pre-processing the video frames and Felzenszwalb’s algorithm has been utilized for image segmentation. After segmentation, twelve human body parts have been detected and five of them have been shortlisted based on their involvement in the interactions. Four kinds of features have been extracted and concatenated into a large feature vector, which has been optimized using the t-distributed stochastic neighbor embedding (t-SNE) technique. Finally, the interactions have been classified using a fully convolutional network (FCN). The proposed system has been validated on the ground and aerial videos of the VIRAT Video, YouTube Aerial, and SYSU 3D HOI datasets, achieving average accuracies of 82.55%, 86.63%, and 91.68% on these datasets, respectively.
Diagnosing Human-Object Interaction Detectors
We have witnessed significant progress in human-object interaction (HOI) detection. However, relying solely on mAP (mean Average Precision) scores as a summary metric does not provide sufficient insight into the nuances of model performance ( e.g. , why one model outperforms another), which can hinder further innovation in this field. To address this issue, we introduce a diagnosis toolbox in this paper to offer a detailed quantitative breakdown of HOI detection models, inspired by the success of object detection diagnosis tools. We first conduct a holistic investigation into the HOI detection pipeline. By defining a set of errors and using oracles to fix each one, we quantitatively analyze the significance of different errors based on the mAP improvement gained from fixing them. Next, we explore the two key sub-tasks of HOI detection: human-object pair localization and interaction classification. For the pair localization task, we compute the coverage of ground-truth human-object pairs and assess the noisiness of the localization results. For the classification task, we measure a model’s ability to distinguish between positive and negative detection results and to classify actual interactions when human-object pairs are correctly localized. We analyze eight state-of-the-art HOI detection models, providing valuable diagnostic insights to guide future research. For instance, our diagnosis reveals that the state-of-the-art model RLIPv2 outperforms others primarily due to its significant improvement in multi-label interaction classification accuracy. Our toolbox is applicable across various methods and datasets and is available at https://neu-vi.github.io/Diag-HOI/ .
Human-object interaction detection based on cascade multi-scale transformer
Human-object interaction (HOI) detection is an advanced computer vision task for detecting the relationship between human and surrounding objects. Some methods have emerged to accomplish this task with impressive results, but possess certain limitations. We analyze in detail the advantages and disadvantages between different paradigms, and creatively propose a cascade multi-scale transformer (CMST). CMST comprises three key components: a shared encoder, a human-object pair decoder, and an interaction decoder. These three components are responsible for extracting contextual features, localizing the human-object pairs and classifying the specific interactions in pairs, respectively. CMST decouples the tasks of object detection and interaction classification while still maintains an end-to-end detection pipeline. Furthermore, we aim to address the issues of high computational complexity and slow convergence associated with the transformer architecture. To achieve this, we propose two novel attention mechanisms: multi-scale human-object pair attention and multi-scale interaction attention. By incorporating these attentions, we introduce multi-scale features, making our model well-suited for complex scenes involving instances of varying scales. The effectiveness of our approach is proven on widely-used benchmarks where we achieve better improvements. The experimental results demonstrate that CMST has great potential for real-time applications and complex scene detection.
Comprehensive context learning for two-stage human-object interaction detection
Human-object interaction (HOI) detection aims to localize humans and objects and infer their interactions in an image. Specifically, transformer-based two-stage methods exhibit outstanding training and performance advantages. However, these methods often utilize object features lacking fine-grained context for HOI classification, neglecting pose and orientation information, and suffering from insufficient relation context for HOI triplets’ relational reasoning. In this paper, we propose a two-stage transformer-based model to address these issues. Firstly, we introduce a novel explicit query construction method for the decoder, leveraging spatial and content priors from the object detector along with human pose information to initialize these queries. This enables the decoder to effectively identify the type of interaction. Additionally, we improve the cross-attention in the decoding process to better reintroduce image features and propose parallel branch decoders to separately perform interaction classification and optimize instance detection. Between the two decoders, we employ a novel attentive fusion module to generate and propagate the relation context, assisting the model in relational reasoning. Extensive experiments conducted on two widely used public benchmarks demonstrate the effectiveness of our approach. The results show that our model surpasses other methods and achieves state-of-the-art performance.
Human–object interaction detection based on disentangled axial attention transformer
Human–object interaction (HOI) detection aims to localize and infer interactions between human and objects in an image. Recent work proposed transformer encoder–decoder architectures for HOI detection with exceptional performance, but possess certain drawbacks: they do not employ a complete disentanglement strategy to learn more discriminative features for different sub-tasks; they cannot achieve sufficient contextual exchange within each branch, which is crucial for accurate relational reasoning; their transformer models suffer from high computational costs and large memory usage due to complex attention calculations. In this work, we propose a disentangled transformer network that disentangles both the encoder and decoder into three branches for human detection, object detection, and interaction classification. Then we propose a novel feature unify decoder to associate the predictions of each disentangled decoder, and introduce a multiplex relation embedding module and an attentive fusion module to perform sufficient contextual information exchange among branches. Additionally, to reduce the model’s computational cost, a position-sensitive axial attention is incorporated into the encoder, allowing our model to achieve a better accuracy-complexity trade-off. Extensive experiments are conducted on two public HOI benchmarks to demonstrate the effectiveness of our approach. The results indicate that our model outperforms other methods, achieving state-of-the-art performance.
Human–object interaction recognition based on interactivity detection and multi-feature fusion
Human–object interaction (HOI) recognition is a computer vision task that detects the relationship between human and surrounding objects. Though recent methods have yielded impressive results, suppressing non-interactive human–object pairs remains challenging and needs to be tackled. In this work, we propose a novel interactivity detection method to calculate an interactivity score for each pair by exploiting human intention. We select human gaze to represent human intention. Pairs with scores below the interactivity threshold we set are considered non-interactive pairs and filtered out. Besides, to extract more discriminative HOI classification features and boost detection performance, we design human–object pair-level contextual features and three-component human pose features. These two features together with appearance features and spatial location features constitute our classification features. A multi-stream classification module is proposed to extract them and conduct HOI classification. The effectiveness of our method is validated on widely-used benchmarks where we achieve decent improvements over state-of-the-arts.
Integrating Pose Features and Cross-Relationship Learning for Human–Object Interaction Detection
Background: The main challenge in human–object interaction detection (HOI) is how to accurately reason about ambiguous, complex, and difficult to recognize interactions. The model structure of the existing methods is relatively single, and the image input may be occluded and cannot be accurately recognized. Methods: In this paper, we design a Pose-Aware Interaction Network (PAIN) based on transformer architecture and human posture to address these issues through two innovations: A new feature fusion method is proposed, which fuses human pose features and image features early before the encoder to improve the feature expression ability, and the individual motion-related features are additionally strengthened by adding to the human branch; the Cross-Attention Relationship fusion Module (CARM) better fuses the three-branch output and captures the detailed relationship information of HOI. Results: The proposed method achieves 64.51%AProle#1, 66.42%AProle#2 on the public dataset V-COCO and 30.83% AP on HICO-DET, which can recognize HOI instances more accurately.
Human Pose Estimation and Object Interaction for Sports Behaviour
In the new era of technology, daily human activities are becoming more challenging in terms of monitoring complex scenes and backgrounds. To understand the scenes and activities from human life logs, human-object interaction (HOI) is important in terms of visual relationship detection and human pose estimation. Activities understanding and interaction recognition between human and object along with the pose estimation and interaction modeling have been explained. Some existing algorithms and feature extraction procedures are complicated including accurate detection of rare human postures, occluded regions, and unsatisfactory detection of objects, especially small-sized objects. The existing HOI detection techniques are instance-centric (object-based) where interaction is predicted between all the pairs. Such estimation depends on appearance features and spatial information. Therefore, we propose a novel approach to demonstrate that the appearance features alone are not sufficient to predict the HOI. Furthermore, we detect the human body parts by using the Gaussian Matric Model (GMM) followed by object detection using YOLO. We predict the interaction points which directly classify the interaction and pair them with densely predicted HOI vectors by using the interaction algorithm. The interactions are linked with the human and object to predict the actions. The experiments have been performed on two benchmark HOI datasets demonstrating the proposed approach.
Egocentric visual scene description based on human-object interaction and deep spatial relations among objects
Visual Scene interpretation is one of the major areas of research in the recent past. Recognition of human object interaction is a fundamental step towards understanding visual scenes. Videos can be described via a variety of human-object interaction scenarios such as when both human and object are static (static-static), one is static while other is dynamic (static-dynamic) and both are dynamic (dynamic-dynamic). This paper presents a unified framework for the explanation of these interactions between humans and a variety of objects using deep learning as a pivot methodology. Human-object interaction is extracted through native machine learning techniques, while spatial relations are captured by training a model through convolution neural network. We also address the recognition of human posture in detail to provide egocentric visual description. After extracting visual features, sequential minimal optimization is employed for training our model. Extracted inter-action, spatial relations and posture information are fed into natural language generation module along with interacting object label to generate scene understanding. Evaluation of the proposed framework is done for two state of the art datasets i.e., MSCOCO and MSR3D Daily activity dataset; where achieved results are 78 and 91.16% accurate, respectively.