Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Reading LevelReading Level
-
Content TypeContent Type
-
YearFrom:-To:
-
More FiltersMore FiltersItem TypeIs Full-Text AvailableSubjectCountry Of PublicationPublisherSourceTarget AudienceDonorLanguagePlace of PublicationContributorsLocation
Done
Filters
Reset
12,467
result(s) for
"video recognition"
Sort by:
Background modeling and foreground detection for video surveillance
Background modeling and foreground detection are important steps in video processing used to detect robustly moving objects in challenging environments. This requires effective methods for dealing with dynamic backgrounds and illumination changes as well as algorithms that must meet real-time and low memory requirements.Incorporating both established and new ideas, Background Modeling and Foreground Detection for Video Surveillance provides a complete overview of the concepts, algorithms, and applications related to background modeling and foreground detection.
Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection
2022
Recently, Transformer-based video recognition models have achieved state-of-the-art results on major video recognition benchmarks. However, their high inference cost significantly limits research speed and practical use. In video compression, methods considering small motions and residuals that are less informative and assigning short code lengths to them (e.g., MPEG4) have successfully reduced the redundancy of videos. Inspired by this idea, we propose Informative Patch Selection (IPS), which efficiently reduces the inference cost by excluding redundant patches from the input of the Transformer-based video model. The redundancy of each patch is calculated from motions and residuals obtained while decoding a compressed video. The proposed method is simple and effective in that it can dynamically reduce the inference cost depending on the input without any policy model or additional loss term. Extensive experiments on action recognition demonstrated that our method could significantly improve the trade-off between the accuracy and inference cost of the Transformer-based video model. Although the method does not require any policy model or additional loss term, its performance approaches that of existing methods that do require them.
Journal Article
Your face belongs to us : a secretive startup's quest to end privacy as we know it
by
Hill, Kashmir, author
in
Clearview AI (Software company) History.
,
Human face recognition (Computer science) Social aspects.
,
Data privacy.
2023
\"In this riveting feat of reporting, Kashmir Hill illuminates the improbable rise of Clearview AI and how Hoan Ton-That, a computer engineer and Richard Schwartz, a Giuliani associate, launched a terrifying facial recognition app with society-altering potential. They were assisted by a cast of controversial characters, including conservative provocateur Charles Johnson and billionaire Trump backer Peter Thiel. The app can scan a blurry portrait, and, in just seconds, collect every instance of a person's online life. It can find your name, your social media profiles, your friends and family, even your home address (as well as photos of you that you may not even have known existed). The story of Clearview AI opens up a window into a larger, more urgent one about our tortured relationship to technology, the way it entertains and seduces us even as it steals our privacy and lays us bare to bad actors in politics, criminal justice, and tech. This technology has been quietly growing more powerful for decades. Ubiquitous in China and Russia, it was also developed by American companies, including Google and Facebook, who decided it was too radical to release. That did not stop Clearview. They gave demos of the tech to interested private investors and contracted it out to hundreds of law enforcement agencies around the country. American law enforcement, including the Department of Homeland Security, has already used it to arrest people for everything from petty theft to assault. Without regulation it could expand the reach of policing-as it has in China and Russia-to a terrifying, dystopian level\"-- Provided by publisher.
PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition
2024
In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP’s positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal factorized positional MLP blocks, which not only decrease model complexity but also maintain good performance. Additionally, we enrich relative positional relationships by using channel grouping. Experimental results on three video-related tasks demonstrate that PosMLP-Video achieves competitive speed-accuracy trade-offs compared to the previous state-of-the-art models. In particular, PosMLP-Video pre-trained on ImageNet1K achieves 59.0%/70.3% top-1 accuracy on Something-Something V1/V2 and 82.1% top-1 accuracy on Kinetics-400 while requiring much fewer parameters and FLOPs than other models. The code is released at https://github.com/zhouds1918/PosMLP_Video.
Journal Article
A vision-based deep learning approach for independent-users Arabic sign language interpretation
by
Hassan, Muhammed
,
Salama, Mohamed
,
Emad, Eslam
in
Artificial neural networks
,
Communication
,
Computer Communication Networks
2023
More than 5% of the people around the world are deaf and have severe difficulties in communicating with normal people according to the World Health Organization (WHO). They face a real challenge to express anything without an interpreter for their signs. Nowadays, there are a lot of studies related to Sign Language Recognition (SLR) that aims to reduce this gap between deaf and normal people as it can replace the need for an interpreter. However, there are a lot of challenges facing the sign recognition systems such as low accuracy, complicated gestures, high-level noise, and the ability to operate under variant circumstances with the ability to generalize or to be locked to such limitations. Hence, many researchers proposed different solutions to overcome these problems. Each language has its signs and it can be very challenging to cover all the languages’ signs. The current study objectives: (i) presenting a dataset of 20 Arabic words, and (ii) proposing a deep learning (DL) architecture by combining convolutional neural network (CNN) and recurrent neural network (RNN). The suggested architecture reported 98% accuracy on the presented dataset. It also reported 93.4% and 98.8% for the top-1 and top-5 accuracies on the UCF-101 dataset.
Journal Article
VLG: General Video Recognition with Web Textual Knowledge
2024
Video recognition (action recognition) in an open world is quite challenging, as we need to handle different settings such as closed-set, long-tail, few-shot, and open-set. The majority of existing works often address each individual setting separately using various frameworks. However, these separate investigations would ignore the possibility of knowledge sharing across different settings, and stymie progress in video recognition as well as its application in the real world. By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we focus on the general video recognition (GVR) task of solving recognition problems of different settings within a unified framework. The core contribution of this paper is twofold. First, we build a comprehensive video recognition benchmark to facilitate the research of GVR, called Kinetics-Text. This dataset covers the mentioned four common settings, and provides multi-source text descriptions for all action classes for utilizing external textual knowledge from the Internet. Second, inspired by the flexibility of language representation, we analyse the correspondence between the video and text descriptions of its category and present a unified visual-linguistic framework (VLG) to solve the problem of GVR with an effective two-stage training paradigm. Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings. Extensive results show that our VLG obtains the state-of-the-art performance under four settings, and the superior performance demonstrates the effectiveness and generalization ability of our proposed framework. We hope our work makes a step towards the general video recognition and could serve as a baseline for future research. Code and datasets have been released in https://github.com/MCG-NJU/VLG.
Journal Article
Deep Insights into Convolutional Networks for Video Recognition
by
Feichtenhofer Christoph
,
Wildes, Richard P
,
Pinz Axel
in
Computer vision
,
Human motion
,
Object motion
2020
As the success of deep models has led to their deployment in all areas of computer vision, it is increasingly important to understand how these representations work and what they are capturing. In this paper, we shed light on deep spatiotemporal representations by visualizing the internal representation of models that have been trained to recognize actions in video. We visualize multiple two-stream architectures to show that local detectors for appearance and motion objects arise to form distributed representations for recognizing human actions. Key observations include the following. First, cross-stream fusion enables the learning of true spatiotemporal features rather than simply separate appearance and motion features. Second, the networks can learn local representations that are highly class specific, but also generic representations that can serve a range of classes. Third, throughout the hierarchy of the network, features become more abstract and show increasing invariance to aspects of the data that are unimportant to desired distinctions (e.g. motion patterns across various speeds). Fourth, visualizations can be used not only to shed light on learned representations, but also to reveal idiosyncrasies of training data and to explain failure cases of the system.
Journal Article
Vision transformer-powered conversational agent for real-time Indian Sign Language e-governance accessibility
2025
This paper presents a vision transformer-based, context-aware, real-time Indian Sign Language (ISL) conversational agent designed to enhance digital accessibility for India’s Deaf and Hard-of-Hearing community within e-governance services. The system supports continuous sign language recognition, from isolated words to full ISL sentences, and delivers information and services directly in sign language. A domain-specific ISL video dataset, incorporating diverse signing styles and environments, addresses ISL’s low-resource challenges, enabling robust, scalable real-world deployment. The hybrid architecture, combining convolutional neural networks with vision transformer models, effectively handles ISL’s spatial–temporal complexities, maintaining reliability even with complex queries. A dynamic context-response mapping engine uses contextual data to increase accuracy, particularly for ambiguous inputs. The modular design ensures efficient scalability, facilitating seamless integration of new services. System evaluations, including stress testing and usability studies, confirmed its effectiveness in enabling real-time, inclusive digital interactions. Under optimized conditions, the system achieved 97.5% accuracy, a mean response time of 6.53 s, and an average system usability score of 71.83, significantly advancing digital inclusion in India.
Journal Article
Campus Violence Detection Based on Artificial Intelligent Interpretation of Surveillance Video Sequences
by
Ferdinando, Hany
,
Seppänen, Tapio
,
Ye, Liang
in
acoustics
,
algorithms
,
artificial intelligence
2021
Campus violence is a common social phenomenon all over the world, and is the most harmful type of school bullying events. As artificial intelligence and remote sensing techniques develop, there are several possible methods to detect campus violence, e.g., movement sensor-based methods and video sequence-based methods. Sensors and surveillance cameras are used to detect campus violence. In this paper, the authors use image features and acoustic features for campus violence detection. Campus violence data are gathered by role-playing, and 4096-dimension feature vectors are extracted from every 16 frames of video images. The C3D (Convolutional 3D) neural network is used for feature extraction and classification, and an average recognition accuracy of 92.00% is achieved. Mel-frequency cepstral coefficients (MFCCs) are extracted as acoustic features, and three speech emotion databases are involved. The C3D neural network is used for classification, and the average recognition accuracies are 88.33%, 95.00%, and 91.67%, respectively. To solve the problem of evidence conflict, the authors propose an improved Dempster–Shafer (D–S) algorithm. Compared with existing D–S theory, the improved algorithm increases the recognition accuracy by 10.79%, and the recognition accuracy can ultimately reach 97.00%.
Journal Article
Measurement of level of consciousness by AVPU scale assessment system based on automated video and speech recognition technology
2023
To develop an alert/verbal/painful/unresponsive (AVPU) scale assessment system based on automated video and speech recognition technology (AVPU-AVSR) that can automatically assess a patient's level of consciousness and evaluate its performance through clinical simulation.
We developed an AVPU-AVSR system with a whole-body camera, face camera, and microphone. The AVPU-AVSR system automatically extracted essential audiovisual features to assess the AVPU score from the recorded video files. Arm movement, pain stimulus, and eyes-open state were extracted using a rule-based approach using landmarks estimated from pre-trained pose and face estimation models. Verbal stimuli were extracted using a pre-trained speech-recognition model. Simulations of a physician examining the consciousness of 12 simulated patients for 16 simulation scenarios (4 for each of “Alert”, “Verbal”, “Painful”, and “Unresponsive”) were conducted under the AVPU-AVSR system. The accuracy, sensitivity, and specificity of the AVPU-AVSR system were assessed.
A total of 192 cases with 12 simulated patients were assessed using the AVPU-AVSR system with a multi-class accuracy of 0.95 (95% confidence interval [CI] (0.92–0.98). The sensitivity and specificity (95% CIs) for detecting impaired consciousness were 1.00 (0.97–1.00) and 0.88 (0.75–0.95), respectively. The sensitivity and specificity of each extracted feature ranged from 0.88 to 1.00 and 0.98 to 1.00.
The AVPU-AVSR system showed good accuracy in assessing consciousness levels in a clinical simulation and has the potential to be implemented in clinical practice to automatically assess mental status.
Journal Article