Catalogue Search | MBRL

Actor-critic multi-objective reinforcement learning for non-linear utility functions

by Hayes, Conor F. , Roijers, Diederik M. , Reymond, Mathieu in Algorithms , Artificial Intelligence , Computer Science

2023

We propose a novel multi-objective reinforcement learning algorithm that successfully learns the optimal policy even for non-linear utility functions. Non-linear utility functions pose a challenge for SOTA approaches, both in terms of learning efficiency as well as the solution concept. A key insight is that, by proposing a critic that learns a multi-variate distribution over the returns, which is then combined with accumulated rewards, we can directly optimize on the utility function, even if it is non-linear. This allows us to vastly increase the range of problems that can be solved compared to those which can be handled by single-objective methods or multi-objective methods requiring linear utility functions, yet avoiding the need to learn the full Pareto front. We demonstrate our method on multiple multi-objective benchmarks, and show that it learns effectively where baseline approaches fail.

Journal Article

Share this book

Add to My Shelf

Monte Carlo tree search algorithms for risk-aware and multi-objective reinforcement learning

by Hayes, Conor F. , Mannion, Patrick , Roijers, Diederik M. in Algorithms , Artificial Intelligence , Computer Science

2023

In many risk-aware and multi-objective reinforcement learning settings, the utility of the user is derived from a single execution of a policy. In these settings, making decisions based on the average future returns is not suitable. For example, in a medical setting a patient may only have one opportunity to treat their illness. Making decisions using just the expected future returns–known in reinforcement learning as the value–cannot account for the potential range of adverse or positive outcomes a decision may have. Therefore, we should use the distribution over expected future returns differently to represent the critical information that the agent requires at decision time by taking both the future and accrued returns into consideration. In this paper, we propose two novel Monte Carlo tree search algorithms. Firstly, we present a Monte Carlo tree search algorithm that can compute policies for nonlinear utility functions (NLU-MCTS) by optimising the utility of the different possible returns attainable from individual policy executions, resulting in good policies for both risk-aware and multi-objective settings. Secondly, we propose a distributional Monte Carlo tree search algorithm (DMCTS) which extends NLU-MCTS. DMCTS computes an approximate posterior distribution over the utility of the returns, and utilises Thompson sampling during planning to compute policies in risk-aware and multi-objective settings. Both algorithms outperform the state-of-the-art in multi-objective reinforcement learning for the expected utility of the returns.

Journal Article

Share this book

Add to My Shelf

Using soft maximin for risk averse multi-objective decision-making

by Pihlakas, Roland , Klassert, Robert , Smith, Benjamin J. in Algorithms , Artificial Intelligence , Computer Science

2023

Balancing multiple competing and conflicting objectives is an essential task for any artificial intelligence tasked with satisfying human values or preferences. Conflict arises both from misalignment between individuals with competing values, but also between conflicting value systems held by a single human. Starting with principle of loss-aversion, we designed a set of soft maximin function approaches to multi-objective decision-making. Bench-marking these functions in a set of previously-developed environments, we found that one new approach in particular, ‘split-function exp-log loss aversion’ (SFELLA), learns faster than the state of the art thresholded alignment objective method Vamplew (Engineering Applications of Artificial Intelligenceg 100:104186, 2021) on three of four tasks it was tested on, and achieved the same optimal performance after learning. SFELLA also showed relative robustness improvements against changes in objective scale, which may highlight an advantage dealing with distribution shifts in the environment dynamics. We further compared SFELLA to the multi-objective reward exponentials (MORE) approach, and found that SFELLA performs similarly to MORE in a simple previously-described foraging task, but in a modified foraging environment with a new resource that was not depleted as the agent worked, SFELLA collected more of the new resource with very little cost incurred in terms of the old resource. Overall, we found SFELLA useful for avoiding problems that sometimes occur with a thresholded approach, and more reward-responsive than MORE while retaining its conservative, loss-averse incentive structure.

Journal Article

Share this book

Add to My Shelf

Changing criteria weights to achieve fair VIKOR ranking: a postprocessing reranking approach

by Radovanović, Sandro , Dodevska, Zorica , Petrović, Andrija in Accuracy , Algorithms , Artificial Intelligence

2023

Ranking is a prerequisite for making decisions, and therefore it is a very responsible and frequently applied activity. This study considers fairness issues in a multi-criteria decision-making (MCDM) method called VIKOR (in Serbian language—VIšekriterijumska optimizacija i KOmpromisno Rešenje, which means Multiple Criteria Optimization and Compromise Solution). The method is specific because of its original property to search for the first-ranked compromise solutions based on the parameter v . The VIKOR method was modified in this paper to rank all the alternatives and find compromise solutions for each rank. Then, the obtained ranks were used to satisfy fairness constraints (i.e., the desired level of disparate impact) by criteria weights optimization. We built three types of mathematical models depending on decision makers’ (DMs’) preferences regarding the definition of the compromise parameter v . Metaheuristic optimization algorithms were explored in order to minimize the differences in VIKOR ranking prior to and after optimization. The proposed postprocessing reranking approach ensures fair ranking (i.e., the ranking without discrimination). The conducted experiments involve three real-life datasets of different sizes, well-known in the literature. The comparisons of the results with popular fair ranking algorithms include a comparative examination of several rank-based metrics intended to measure accuracy and fairness that indicate a high-quality competence of the suggested approach. The most significant contributions include developing automated and adaptive optimization procedures with the possibility of further adjustments following DMs’ preferences and matching fairness metrics with traditional MCDM goals in a comprehensive full VIKOR ranking.

Journal Article

Share this book

Add to My Shelf

Fast approximate bi-objective Pareto sets with quality bounds

by Xu, Siyao , Goldsmith, Judy , Bailey, William in Algorithms , Approximation , Artificial Intelligence

2023

We present and empirically characterize a general, parallel, heuristic algorithm for computing small ϵ -Pareto sets. A primary feature of the algorithm is that it maintains and improves an upper bound on the ϵ value throughout the algorithm. The algorithm can be used as part of a decision support tool for settings in which computing points in objective space is computationally expensive. We use the bi-objective TSP and graph clearing problems as benchmark examples. We characterize the performance of the algorithm through ϵ -Pareto set size, upper bound on ϵ value provided, true ϵ value provided, and parallel speedup achieved. Our results show that the algorithm’s combination of small ϵ -Pareto sets and parallel speedup is sufficient to be appealing in settings requiring manual review (i.e., those that have a human in the loop) or real-time solutions.

Journal Article

Share this book

Add to My Shelf

Uniformly constrained reinforcement learning

by Lee, Jaeyoung , Sedwards, Sean , Czarnecki, Krzysztof in Adaptive algorithms , Algorithms , Approximation

2024

We propose new multi-objective reinforcement learning algorithms that aim to find a globally Pareto-optimal deterministic policy that uniformly (in all states) maximizes a reward subject to a uniform probabilistic constraint over reaching forbidden states of a Markov decision process. Our requirements arise naturally in the context of safety-critical systems, but pose a significant unmet challenge. This class of learning problem is known to be hard and there are no off-the-shelf solutions that fully address the combined requirements of determinism and uniform optimality. Having formalized our requirements and highlighted the specific challenge of learning instability, using a simple counterexample, we define from first principles a stable Bellman operator that we prove partially respects our requirements. This operator is therefore a partial solution to our problem, but produces conservative polices in comparison to our previous approach, which was not designed to satisfy the same requirements. We thus propose a relaxation of the stable operator, using adaptive hysteresis , that forms the basis of a heuristic approach that is stable w.r.t. our counterexample and learns policies that are less conservative than those of the stable operator and our previous algorithm. In comparison to our previous approach, the policies of our adaptive hysteresis algorithm demonstrate improved monotonicity with increasing constraint probabilities, which is one of the characteristics we desire. We demonstrate that adaptive hysteresis works well with dynamic programming and reinforcement learning, and can be adapted to function approximation.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter