Catalogue Search | MBRL

Grounded-Knowledge-Enhanced Instruction Understanding for Multimodal Assistant Applications

by Wu, Te-Lin in Artificial intelligence , Computer Engineering , Computer science

2024

With the recent advancements in artificial intelligence (AI), researchers are making endeavours towards building an AI that can understand humans, collaborate with humans, and help or guide them to accomplish certain everyday chores. The actualization of such an assistant AI can pose several challenges including planning (on certain events), comprehending human instructions, multimodal understanding, and grounded conversational ability.Imagine a scenario that one wishes to perform a task, such as “making a plate of fried rice”, or “purchasing a suitable sofa bed”, which can require multiple steps of actions and manipulation of certain objects. How would an assistant AI collaborate with humans to accomplish such desired tasks? One crucial aspect of the system is to understand how and when to take a certain action, which is often learned from interpreting and following a guidance, a piece of resource that encompasses knowledge about accomplishing the task and potentially the events that will occur during task completions. The guidance can come from human verbal interactions (e.g., in the form of a conversation or a question) or static written instructional manuals.In the first part of this thesis, I will decompose the proposed system framework into three foundational components: (1) task-step sequencing/planning, where the AI needs to understand the appropriate sequential procedure of performing each sub-task to accomplish the whole task, especially when the task knowledge is learned from instructional resources online that can be many and do not always come consolidated with proper ordering; (2) action-dependencies understanding, where an agent should be able to infer dependencies of performing an action and the outcomes after executing a particular action, in order to examine the situations and adjust the plan of accomplishing tasks; (3) multimodal grounding and active perception, that we equip the AI with the ability to actively ground the visually perceived surroundings to the textual instructions (or verbal interactions) and perform reasoning over multimodal information along the task completions.In the second part of this thesis, I will introduce two newly curated resources that foresee the next-phase challenges towards building a strong and helpful assistive AI. One such resource focuses on counterfactual reasoning, a type of reasoning capability humans frequently rely on when performing complex decision making processes; while the other presents a comprehensive suite of multimodal capabilities of an assistive AI to function in a virtually created world.Combining the two parts, the foundational components as well as the established novel challenging benchmarks, this thesis aims at providing a comprehensive research road map for the research direction of next-generation (multimodal) AI assistants.

Dissertation

Share this book

Add to My Shelf

Four New Names in Chinese and Vietnamese Zingiberaceae

by Turland, Nicholas J. , Wu Delin (Wu Te-lin) , Larsen, Kai in Holotypes , Homonyms , Nomen novum

2000

During preparation of the account of Zingiberaceae for the Flora of China, volume 24, it was noticed that four species are illegitimately named, being later homonyms: Amomum thyrsoideum Gagnepain (1903), not Ruiz and Pavón (1798), A. aurantiacum H. T. Tsai & S. W. Zhao (1979), not Ridley (1920), Hedychium carneum Y. Y. Qian (1994), not Loddiges (1823), and Zingiber truncatum S. Q. Tong (1987), not Stokes (1812). Therefore, the following new names (nomina nova) are proposed here, respectively: A. gagnepainii T. L. Wu, K. Larsen & Turland, A. neoaurantiacum T. L. Wu, K. Larsen & Turland, H. neocarneum T. L. Wu, K. Larsen & Turland, and Z. neotruncatum T. L. Wu, K. Larsen & Turland.

Journal Article

Share this book

Add to My Shelf

Validation of the Name Zingiber koshunense (Zingiberaceae), a Species Endemic to Taiwan

by Turland, Nicholas J. , Wu Delin (Wu Te-lin) , Larsen, Kai in Plants

2000

During preparation of the account of Zingiberaceae for the Flora of China, volume 24, it was noticed that one species, Zingiber koshunense Hayata, reported from Taiwan in 1930, was invalidly named because no description was provided. The species was later described in 1978 as Z. koshunense Hayata ex C. T. Moo, but the name remained invalid because, although two specimens were cited, there was no indication of a type. The name is here validated, with one of these specimens designated as the holotype.

Journal Article

Share this book

Add to My Shelf

Notes on the Lowiaceae, Musaceae, and Zingiberaceae for the Flora of China

by Te-lin, Wu in Bracts , Calyx , Holotypes

1997

A new combination in the Lowiaceae, Orchidantha chinensis T. L. Wu var. longisepala (D. Fang) T. L. Wu, as well as a new species, Alpinia jianganfeng T. L. Wu, and two new combinations, Amomum petaloideum (S. J. Tong) T. L. Wu and Roscoea cautleoides Gagnepain var. pubescens (Z. Y. Zhu) T. L. Wu, in the Zingiberaceae are proposed. Eight species names in the Lowiaceae, Musaceae, and Zingiberaceae are reduced to synonymy.

Journal Article

Share this book

Add to My Shelf

Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge

by Te-Lin, Wu , Zhou, Yu , Peng, Nanyun in Large language models , Object recognition , Tracking

2023

The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually. One important step towards this goal is to localize and track key active objects that undergo major state change as a consequence of human actions/interactions to the environment without being told exactly what/where to ground (e.g., localizing and tracking the `sponge` in video from the instruction \"Dip the `sponge` into the bucket.\"). While existing works approach this problem from a pure vision perspective, we investigate to which extent the textual modality (i.e., task instructions) and their interaction with visual modality can be beneficial. Specifically, we propose to improve phrase grounding models' ability on localizing the active objects by: (1) learning the role of `objects undergoing change` and extracting them accurately from the instructions, (2) leveraging pre- and post-conditions of the objects during actions, and (3) recognizing the objects more robustly with descriptional knowledge. We leverage large language models (LLMs) to extract the aforementioned action-object knowledge, and design a per-object aggregation masking technique to effectively perform joint inference on object phrases and symbolic knowledge. We evaluate our framework on Ego4D and Epic-Kitchens datasets. Extensive experiments demonstrate the effectiveness of our proposed framework, which leads to>54% improvements in all standard metrics on the TREK-150-OPE-Det localization + tracking task, >7% improvements in all standard metrics on the TREK-150-OPE tracking task, and >3% improvements in average precision (AP) on the Ego4D SCOD task.

Paper

Share this book

Add to My Shelf

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

by Zhao, Songyan , Te-Lin, Wu , Pan, Lu in Debugging , Decoding , Errors

2024

Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bottleneck for visual reasoning. To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger's ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task. Code, data and models are made publicly available at https://github.com/shirley-wu/vdebugger/

Paper

Share this book

Add to My Shelf

InSpaceType: Dataset and Benchmark for Reconsidering Cross-Space Type Performance in Indoor Monocular Depth

by Te-Lin, Wu , Jing-Wen, Chen , Chin-Cheng, Hsu in Ablation , Benchmarks , Best practice

2024

Indoor monocular depth estimation helps home automation, including robot navigation or AR/VR for surrounding perception. Most previous methods primarily experiment with the NYUv2 Dataset and concentrate on the overall performance in their evaluation. However, their robustness and generalization to diversely unseen types or categories for indoor spaces (spaces types) have yet to be discovered. Researchers may empirically find degraded performance in a released pretrained model on custom data or less-frequent types. This paper studies the common but easily overlooked factor-space type and realizes a model's performance variances across spaces. We present InSpaceType Dataset, a high-quality RGBD dataset for general indoor scenes, and benchmark 13 recent state-of-the-art methods on InSpaceType. Our examination shows that most of them suffer from performance imbalance between head and tailed types, and some top methods are even more severe. The work reveals and analyzes underlying bias in detail for transparency and robustness. We extend the analysis to a total of 4 datasets and discuss the best practice in synthetic data curation for training indoor monocular depth. Further, dataset ablation is conducted to find out the key factor in generalization. This work marks the first in-depth investigation of performance variances across space types and, more importantly, releases useful tools, including datasets and codes, to closely examine your pretrained depth models. Data and code: https://depthcomputation.github.io/DepthPublic/

Paper

Share this book

Add to My Shelf

ARMADA: Attribute-Based Multimodal Data Augmentation

by Te-Lin, Wu , Zhou, Yu , Ji, Heng in Data augmentation , Image enhancement , Image manipulation

2024

In Multimodal Language Models (MLMs), the cost of manually annotating high-quality image-text pair data for fine-tuning and alignment is extremely high. While existing multimodal data augmentation frameworks propose ways to augment image-text pairs, they either suffer from semantic inconsistency between texts and images, or generate unrealistic images, causing knowledge gap with real world examples. To address these issues, we propose Attribute-based Multimodal Data Augmentation (ARMADA), a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes of the mentioned entities. Specifically, we extract entities and their visual attributes from the original text data, then search for alternative values for the visual attributes under the guidance of knowledge bases (KBs) and large language models (LLMs). We then utilize an image-editing model to edit the images with the extracted attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation, (ii) generates visually similar images of disparate categories using neighboring entities in the KB hierarchy, and (iii) uses the commonsense knowledge of LLMs to modulate auxiliary visual attributes such as backgrounds for more robust representation of original entities. Our empirical results over four downstream tasks demonstrate the efficacy of our framework to produce high-quality data and enhance the model performance. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.

Paper

Share this book

Add to My Shelf

InSpaceType: Reconsider Space Type in Indoor Monocular Depth Estimation

by Te-Lin, Wu , Jing-Wen, Chen , Chin-Cheng, Hsu in Benchmarks , Datasets , Indoor environments

2024

Indoor monocular depth estimation has attracted increasing research interest. Most previous works have been focusing on methodology, primarily experimenting with NYU-Depth-V2 (NYUv2) Dataset, and only concentrated on the overall performance over the test set. However, little is known regarding robustness and generalization when it comes to applying monocular depth estimation methods to real-world scenarios where highly varying and diverse functional \\textit{space types} are present such as library or kitchen. A study for performance breakdown into space types is essential to realize a pretrained model's performance variance. To facilitate our investigation for robustness and address limitations of previous works, we collect InSpaceType, a high-quality and high-resolution RGBD dataset for general indoor environments. We benchmark 12 recent methods on InSpaceType and find they severely suffer from performance imbalance concerning space types, which reveals their underlying bias. We extend our analysis to 4 other datasets, 3 mitigation approaches, and the ability to generalize to unseen space types. Our work marks the first in-depth investigation of performance imbalance across space types for indoor monocular depth estimation, drawing attention to potential safety concerns for model deployment without considering space types, and further shedding light on potential ways to improve robustness. See \\url{https://depthcomputation.github.io/DepthPublic} for data and the supplementary document. The benchmark list on the GitHub project page keeps updates for the lastest monocular depth estimation methods.

Paper

Share this book

Add to My Shelf

Learning Action Conditions from Instructional Manuals for Instruction Understanding

by Te-Lin, Wu , Spangher, Alex , Hu, Qingyuan in Datasets , Manuals

2024

The ability to infer pre- and postconditions of an action is vital for comprehending complex instructions, and is essential for applications such as autonomous instruction-guided agents and assistive AI that supports humans to perform physical tasks. In this work, we propose a task dubbed action condition inference, and collecting a high-quality, human annotated dataset of preconditions and postconditions of actions in instructional manuals. We propose a weakly supervised approach to automatically construct large-scale training instances from online instructional manuals, and curate a densely human-annotated and validated dataset to study how well the current NLP models can infer action-condition dependencies in the instruction texts. We design two types of models differ by whether contextualized and global information is leveraged, as well as various combinations of heuristics to construct the weak supervisions. Our experimental results show a >20% F1-score improvement with considering the entire instruction contexts and a >6% F1-score benefit with the proposed heuristics.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter