Catalogue Search | MBRL

A systematic survey on human pose estimation: upstream and downstream tasks, approaches, lightweight models, and prospects

by Jin, Yucheng , Tian, Dingxiaofei , Chen, Jinyan in Computer vision , Datasets , Deep learning

2025

In recent years, human pose estimation has been widely studied as a branch task of computer vision. Human pose estimation plays an important role in the development of medicine, fitness, virtual reality, and other fields. Early human pose estimation technology used traditional manual modeling methods. Recently, human pose estimation technology has developed rapidly using deep learning. This study not only reviews the basic research of human pose estimation but also summarizes the latest cutting-edge technologies. In addition to systematically summarizing the human pose estimation technology, this article also extends to the upstream and downstream tasks of human pose estimation, which shows the positioning of human pose estimation technology more intuitively. In particular, considering the issues regarding computer resources and challenges concerning model performance faced by human pose estimation, the lightweight human pose estimation models and the transformer-based human pose estimation models are summarized in this paper. In general, this article classifies human pose estimation technology around types of methods, 2D or 3D representation of outputs, the number of people, views, and temporal information. Meanwhile, classic datasets and targeted datasets are mentioned in this paper, as well as metrics applied to these datasets. Finally, we generalize the current challenges and possible development of human pose estimation technology in the future.

Journal Article

Share this book

Add to My Shelf

Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

in Classification , Deep learning , Estimation

2021

This paper presents a comprehensive survey on vision-based robotic grasping. We conclude three key tasks during vision-based robotic grasping, which are object localization, object pose estimation and grasp estimation. In detail, the object localization task contains object localization without classification, object detection and object instance segmentation. This task provides the regions of the target object in the input data. The object pose estimation task mainly refers to estimating the 6D object pose and includes correspondence-based methods, template-based methods and voting-based methods, which affords the generation of grasp poses for known objects. The grasp estimation task includes 2D planar grasp methods and 6DoF grasp methods, where the former is constrained to grasp from one direction. These three tasks could accomplish the robotic grasping with different combinations. Lots of object pose estimation methods need not object localization, and they conduct object localization and object pose estimation jointly. Lots of grasp estimation methods need not object localization and object pose estimation, and they conduct grasp estimation in an end-to-end manner. Both traditional methods and latest deep learning-based methods based on the RGB-D image inputs are reviewed elaborately in this survey. Related datasets and comparisons between state-of-the-art methods are summarized as well. In addition, challenges about vision-based robotic grasping and future directions in addressing these challenges are also pointed out.

Journal Article

Share this book

Add to My Shelf

DeepIM: Deep Iterative Matching for 6D Pose Estimation

by Wang, Gu , Yu, Xiang , Fox, Dieter in Artificial neural networks , Image manipulation , Matching

2020

Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over state-of-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.

Journal Article

Share this book

Add to My Shelf

A comprehensive survey on human pose estimation approaches

by Dixit, Manish , Dubey, Shradha in Accelerometers , Accuracy , Algorithms

2023

The human pose estimation is a significant issue that has been taken into consideration in the computer vision network for recent decades. It is a vital advance toward understanding individuals in videos and still images. In simple terms, a human pose estimation model takes in an image or video and estimates the position of a person’s skeletal joints in either 2D or 3D space. Several studies on human posture estimation can be found in the literature, however, they center around a specific class; for instance, model-based methodologies or human movement investigation, and so on. Later, various Deep Learning (DL) algorithms came into existence to overcome the difficulties which were there in the earlier approaches. In this study, an exhaustive review of human pose estimation (HPE), including milestone work and recent advancements is carried out. This survey discusses the different two-dimensional (2D) and three-dimensional human (3D) pose estimation techniques along with their classical and deep learning approaches which provide the solution to the various computer vision problems. Moreover, the paper also considers the different deep learning models used in pose estimation, and the analysis of 2D and 3D datasets is done. Some of the evaluation metrics used for estimating human poses are also discussed here. By knowing the direction of the individuals, HPE opens a road for a few real-life applications some of which are talked about in this study.

Journal Article

Share this book

Add to My Shelf

VTP: volumetric transformer for multi-view multi-person 3D pose estimation

by Huang, Ouhan , Jia, Gangyong , Chen, Yuxing in Cameras , Design , Error correction

2023

This paper presents Volumetric Transformer Pose Estimator (VTP), the first 3D volumetric transformer framework for multi-view multi-person 3D human pose estimation. VTP aggregates features from 2D keypoints in all camera views and directly learns the spatial relationships in the 3D voxel space in an end-to-end fashion. The aggregated 3D features are passed through 3D convolutions before being flattened into sequential embeddings and fed into a transformer. A residual structure is designed to further improve the performance. In addition, the sparse Sinkhorn attention is empowered to reduce the memory cost, which is a major bottleneck for volumetric representations, while also achieving excellent performance. The output of the transformer is again concatenated with 3D convolutional features by a residual design. The proposed VTP framework integrates the high performance of the transformer with volumetric representations, which can be used as a good alternative to the convolutional backbones. Experiments on the Shelf, Campus and CMU Panoptic benchmarks show promising results in terms of both Mean Per Joint Position Error (MPJPE) and Percentage of Correctly estimated Parts (PCP). Our code will be available.

Journal Article

Share this book

Add to My Shelf

AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild

by Zhang, Zhe , Qiu Weichao , Zeng Wenjun in Cameras , Datasets , Occlusion

2021

Occlusion is probably the biggest challenge for human pose estimation in the wild. Typical solutions often rely on intrusive sensors such as IMUs to detect occluded joints. To make the task truly unconstrained, we present AdaFuse, an adaptive multiview fusion method, which can enhance the features in occluded views by leveraging those in visible views. The core of AdaFuse is to determine the point-point correspondence between two views which we solve effectively by exploring the sparsity of the heatmap representation. We also learn an adaptive fusion weight for each camera view to reflect its feature quality in order to reduce the chance that good features are undesirably corrupted by “bad” views. The fusion model is trained end-to-end with the pose estimation network, and can be directly applied to new camera configurations without additional adaptation. We extensively evaluate the approach on three public datasets including Human3.6M, Total Capture and CMU Panoptic. It outperforms the state-of-the-arts on all of them. We also create a large scale synthetic dataset Occlusion-Person, which allows us to perform numerical evaluation on the occluded joints, as it provides occlusion labels for every joint in the images. The dataset and code are released at https://github.com/zhezh/adafuse-3d-human-pose.

Journal Article

Share this book

Add to My Shelf

Human Pose Estimation Using Deep Learning: A Systematic Literature Review

by Al Ghamdi, Mohammed A. , Alghamdi, Manal , Samkari, Esraa in 2D person pose estimation , Accuracy , Analysis

2023

Human Pose Estimation (HPE) is the task that aims to predict the location of human joints from images and videos. This task is used in many applications, such as sports analysis and surveillance systems. Recently, several studies have embraced deep learning to enhance the performance of HPE tasks. However, building an efficient HPE model is difficult; many challenges, like crowded scenes and occlusion, must be handled. This paper followed a systematic procedure to review different HPE models comprehensively. About 100 articles published since 2014 on HPE using deep learning were selected using several selection criteria. Both image and video data types of methods were investigated. Furthermore, both single and multiple HPE methods were reviewed. In addition, the available datasets, different loss functions used in HPE, and pretrained feature extraction models were all covered. Our analysis revealed that Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are the most used in HPE. Moreover, occlusion and crowd scenes remain the main problems affecting models’ performance. Therefore, the paper presented various solutions to address these issues. Finally, this paper highlighted the potential opportunities for future work in this task.

Journal Article

Share this book

Add to My Shelf

Sign2Pose: A Pose-Based Approach for Gloss Prediction Using a Transformer Model

by J, Andrew , Hemanth, D. Jude , Eunice, Jennifer in Accuracy , Algorithms , Communication

2023

Word-level sign language recognition (WSLR) is the backbone for continuous sign language recognition (CSLR) that infers glosses from sign videos. Finding the relevant gloss from the sign sequence and detecting explicit boundaries of the glosses from sign videos is a persistent challenge. In this paper, we propose a systematic approach for gloss prediction in WLSR using the Sign2Pose Gloss prediction transformer model. The primary goal of this work is to enhance WLSR’s gloss prediction accuracy with reduced time and computational overhead. The proposed approach uses hand-crafted features rather than automated feature extraction, which is computationally expensive and less accurate. A modified key frame extraction technique is proposed that uses histogram difference and Euclidean distance metrics to select and drop redundant frames. To enhance the model’s generalization ability, pose vector augmentation using perspective transformation along with joint angle rotation is performed. Further, for normalization, we employed YOLOv3 (You Only Look Once) to detect the signing space and track the hand gestures of the signers in the frames. The proposed model experiments on WLASL datasets achieved the top 1% recognition accuracy of 80.9% in WLASL100 and 64.21% in WLASL300. The performance of the proposed model surpasses state-of-the-art approaches. The integration of key frame extraction, augmentation, and pose estimation improved the performance of the proposed gloss prediction model by increasing the model’s precision in locating minor variations in their body posture. We observed that introducing YOLOv3 improved gloss prediction accuracy and helped prevent model overfitting. Overall, the proposed model showed 17% improved performance in the WLASL 100 dataset.

Journal Article

Share this book

Add to My Shelf

Enhancing 3D hand pose estimation using SHaF: synthetic hand dataset including a forearm

by Kim, Seon Ho , Lee, Jeongho , Kim, Jaeyun in Ablation , Annotations , Datasets

2024

Currently, there is an increased need for training images in 3D hand pose estimation and a higher reliance on computationally intensive 3D mesh annotations for 3D coordinate estimations. Considering this, this study introduces a new hand image dataset called Synthetic Hand Dataset Including a Forearm (SHaF) and an efficient transformer-based three-dimensional (3D) hand pose estimation model tailored to extract hand postures from hand images. The proposed dataset comprises diverse synthetic hand posture images, across various cameras and environmental settings, which were generated using the Unity 3D hand model. It differs from existing artificial hand datasets in that it includes the forearm in its synthetic images. Given that real-world hand images often capture both the hand and forearm, our dataset bolsters the accuracy of hand pose estimation in practical scenarios. Regarding the proposed model, it uses the pose graph module (PGM) and auxiliary pose estimation module (APEM), thereby offering efficient 3D hand pose estimation without requiring 3D mesh information. Through comparative experiments with established datasets and models in hand pose estimation as well as various ablation studies, we confirmed the efficacy of our dataset and the superior performance of the estimation model over that of other methods.

Journal Article

Share this book

Add to My Shelf

PosturePose: Optimized Posture Analysis for Semi-Supervised Monocular 3D Human Pose Estimation

by Agam, Gady , Amadi, Lawrence in Annotations , Biomechanics , Bones

2023

One motivation for studying semi-supervised techniques for human pose estimation is to compensate for the lack of variety in curated 3D human pose datasets by combining labeled 3D pose data with readily available unlabeled video data—effectively, leveraging the annotations of the former and the rich variety of the latter to train more robust pose estimators. In this paper, we propose a novel, fully differentiable posture consistency loss that is unaffected by camera orientation and improves monocular human pose estimators trained with limited labeled 3D pose data. Our semi-supervised monocular 3D pose framework combines biomechanical pose regularization with a multi-view posture (and pose) consistency objective function. We show that posture optimization was effective at decreasing pose estimation errors when applied to a 2D–3D lifting network (VPose3D) and two well-studied datasets (H36M and 3DHP). Specifically, the proposed semi-supervised framework with multi-view posture and pose loss lowered the mean per-joint position error (MPJPE) of leading semi-supervised methods by up to 15% (−7.6 mm) when camera parameters of unlabeled poses were provided. Without camera parameters, our semi-supervised framework with posture loss improved semi-supervised state-of-the-art methods by 17% (−15.6 mm decrease in MPJPE). Overall, our pose models compete favorably with other high-performing pose models trained under similar conditions with limited labeled data.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter