Catalogue Search | MBRL

by Redmon, Joseph in Accuracy , Algorithms , Approximation

2021

The last half-decade ushered in a new era of vision research. Computer vision now works on real images, in natural environments, solving hard problems. But the technology is far from ubiquitous and many researchers are most concerned with getting the best performance on a handful of datasets. This hyperfocus on accuracy has largely turned vision into a numbers game and research tends toward complex, finely-tuned systems that are brittle and impractical in the real world. I focus on aspects of research that are often neglected in vision: speed, scalability, usability. I design new vision systems and algorithms from the ground up with the goal of making them useful in the real world. This involves high-level algorithm improvements, mid-level architectural design, and low-level optimization and approximation. It also involves educating the next generation of vision experts and giving them the tools to solve the problems they care about.

Dissertation

Share this book

Add to My Shelf

YOLOv3: An Incremental Improvement

by Farhadi, Ali , Redmon, Joseph

2018

We present some updates to YOLO! We made a bunch of little design changes to make it better. We also trained this new network that's pretty swell. It's a little bigger than last time but more accurate. It's still fast though, don't worry. At 320x320 YOLOv3 runs in 22 ms at 28.2 mAP, as accurate as SSD but three times faster. When we look at the old .5 IOU mAP detection metric YOLOv3 is quite good. It achieves 57.9 mAP@50 in 51 ms on a Titan X, compared to 57.5 mAP@50 in 198 ms by RetinaNet, similar performance but 3.8x faster. As always, all the code is online at https://pjreddie.com/yolo/

Paper

Share this book

Add to My Shelf

High-Resolution Live Fuel Moisture Content (LFMC) Maps for Wildfire Risk from Multimodal Earth Observation Data

by Johnson, Patrick Alan , Zhang, Yawen , Sjahli, Virginia in Fuels , High resolution , Moisture content

2025

Wildfires are increasing in intensity and severity at an alarming rate. Recent advances in AI and publicly available satellite data enable monitoring critical wildfire risk factors globally, at high resolution and low latency. Live Fuel Moisture Content (LFMC) is a critical wildfire risk factor and is valuable for both wildfire research and operational response. However, ground-based LFMC samples are both labor intensive and costly to acquire, resulting in sparse and infrequent updates. In this work, we explore the use of a pretrained, highly-multimodal earth-observation model for generating large-scale spatially complete (wall-to-wall) LFMC maps. Our approach achieves significant improvements over previous methods using randomly initialized models (20 reduction in RMSE). We provide an automated pipeline that enables rapid generation of these LFMC maps across the United States, and demonstrate its effectiveness in two regions recently impacted by wildfire (Eaton and Palisades).

Paper

Share this book

Add to My Shelf

OlmoEarth v1.1: A more efficient family of OlmoEarth models

by Wilhelm, Christopher , Sablon, Hadrien , Johnson, Patrick Alan

2026

We present a set of improvements to the OlmoEarth family. These improvements allow us to cut compute costs during training (\\(1.7 \\) reduction in GPU hours required to train our Base models) and inference (\\(2.9\\) reductions in MACs on Sentinel-2 tasks), while maintaining the models' overall performance. All training code is available at github.com/allenai/olmoearth_pretrain.

Paper

Share this book

Add to My Shelf

YOLO9000: Better, Faster, Stronger

by Farhadi, Ali , Redmon, Joseph in Classification , Object recognition , Real time

2016

We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don't have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.

Paper

Share this book

Add to My Shelf

Real-Time Grasp Detection Using Convolutional Neural Networks

by Redmon, Joseph , Angelova, Anelia in Artificial neural networks , Frames per second , Mathematical models

2015

We present an accurate, real-time approach to robotic grasp detection based on convolutional neural networks. Our network performs single-stage regression to graspable bounding boxes without using standard sliding window or region proposal techniques. The model outperforms state-of-the-art approaches by 14 percentage points and runs at 13 frames per second on a GPU. Our network can simultaneously perform classification so that in a single step it recognizes the object and finds a good grasp rectangle. A modification to this model predicts multiple grasps per object by using a locally constrained prediction mechanism. The locally constrained model performs significantly better, especially on objects that can be grasped in a variety of ways.

Paper

Share this book

Add to My Shelf

OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

by Wilhelm, Christopher , Shelhamer, Evan , Wood, Sebastian in Data collection , Data management , Machine learning

2025

Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at \\(https://github.com/allenai/olmoearth_pretrainhttps://github.com/allenai/olmoearth_pretrain\\).

Paper

Share this book

Add to My Shelf

Who Let The Dogs Out? Modeling Dog Behavior From Visual Data

by Mottaghi, Roozbeh , Farhadi, Ali , Bagherinezhad, Hessam in Computer vision , Dogs , Domains

2018

We introduce the task of directly modeling a visually intelligent agent. Computer vision typically focuses on solving various subtasks related to visual intelligence. We depart from this standard approach to computer vision; instead we directly model a visually intelligent agent. Our model takes visual information as input and directly predicts the actions of the agent. Toward this end we introduce DECADE, a large-scale dataset of ego-centric videos from a dog's perspective as well as her corresponding movements. Using this data we model how the dog acts and how the dog plans her movements. We show under a variety of metrics that given just visual input we can successfully model this intelligent agent in many situations. Moreover, the representation learned by our model encodes distinct information compared to representations trained on image classification, and our learned representation can generalize to other domains. In particular, we show strong results on the task of walkable surface estimation by using this dog modeling task as representation learning.

Paper

Share this book

Add to My Shelf

IQA: Visual Question Answering in Interactive Environments

by Rastegari, Mohammad , Farhadi, Ali , Fox, Dieter in Computer simulation , Conditioning , Control methods

2018

We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: \"Are there any apples in the fridge?\" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN, we introduce IQUAD V1, a new dataset built upon AI2-THOR, a simulated photo-realistic environment of configurable indoor scenes with interactive objects (code and dataset available at https://github.com/danielgordon10/thor-iqa-cvpr-2018). IQUAD V1 has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1. For sample questions and results, please view our video: https://youtu.be/pXd3C-1jr98

Paper

Share this book

Add to My Shelf

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

by Rastegari, Mohammad , Farhadi, Ali , Redmon, Joseph in Artificial neural networks , Classification , Neural networks

2016

We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32x memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58x faster convolutional operations and 32x memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is only 2.9% less than the full-precision AlexNet (in top-1 measure). We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16% in top-1 accuracy.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter