Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
LanguageLanguage
-
SubjectSubject
-
Item TypeItem Type
-
DisciplineDiscipline
-
YearFrom:-To:
-
More FiltersMore FiltersIs Peer Reviewed
Done
Filters
Reset
135
result(s) for
"Munos, Remi"
Sort by:
α-Rank: Multi-Agent Evaluation by Evolution
by
Rowland, Mark
,
Czarnecki, Wojciech M.
,
Tuyls, Karl
in
631/1647/2198
,
631/181/2469
,
639/705/117
2019
We introduce
α
-
Rank
, a principled evolutionary dynamics methodology, for the
evaluation
and
ranking
of agents in large-scale multi-agent interactions, grounded in a novel dynamical game-theoretic solution concept called
Markov
-
Conley chains
(MCCs). The approach leverages continuous-time and discrete-time evolutionary dynamical systems applied to empirical games, and scales tractably in the number of agents, in the type of interactions (beyond dyadic), and the type of empirical games (symmetric and asymmetric). Current models are fundamentally limited in one or more of these dimensions, and are not guaranteed to converge to the desired game-theoretic solution concept (typically the Nash equilibrium).
α
-Rank automatically provides a ranking over the set of agents under evaluation and provides insights into their strengths, weaknesses, and long-term dynamics in terms of basins of attraction and sink components. This is a direct consequence of the correspondence we establish to the dynamical MCC solution concept when the underlying evolutionary model’s ranking-intensity parameter,
α
, is chosen to be large, which exactly forms the basis of
α
-Rank. In contrast to the Nash equilibrium, which is a static solution concept based solely on fixed points, MCCs are a dynamical solution concept based on the Markov chain formalism, Conley’s Fundamental Theorem of Dynamical Systems, and the core ingredients of dynamical systems: fixed points, recurrent sets, periodic orbits, and limit cycles. Our
α
-Rank method runs in polynomial time with respect to the total number of pure strategy profiles, whereas computing a Nash equilibrium for a general-sum game is known to be intractable. We introduce mathematical proofs that not only provide an overarching and unifying perspective of existing continuous- and discrete-time evolutionary evaluation models, but also reveal the formal underpinnings of the
α
-Rank methodology. We illustrate the method in canonical games and empirically validate it in several domains, including AlphaGo, AlphaZero, MuJoCo Soccer, and Poker.
Journal Article
Game Plan: What AI can do for Football, and What Football can do for AI
by
Luc, Pauline
,
Moreno, Pol
,
Eslami, S. M. Ali
in
Artificial Intelligence
,
Computer Science
,
Computer vision
2021
The rapid progress in artificial intelligence (AI) and machine learning has opened unprecedented analytics possibilities in various team and individual sports, including baseball, basketball, and tennis. More recently, AI techniques have been applied to football, due to a huge increase in data collection by professional teams, increased computational power, and advances in machine learning, with the goal of better addressing new scientific challenges involved in the analysis of both individual players’ and coordinated teams’ behaviors. The research challenges associated with predictive and prescriptive football analytics require new developments and progress at the intersection of statistical learning, game theory, and computer vision. In this paper, we provide an overarching perspective highlighting how the combination of these fields, in particular, forms a unique microcosm for AI research, while offering mutual benefits for professional teams, spectators, and broadcasters in the years to come. We illustrate that this duality makes football analytics a game changer of tremendous value, in terms of not only changing the game of football itself, but also in terms of what this domain can mean for the field of AI. We review the state-of-the-art and exemplify the types of analysis enabled by combining the aforementioned fields, including illustrative examples of counterfactual analysis using predictive models, and the combination of game-theoretic analysis of penalty kicks with statistical learning of player attributes. We conclude by highlighting envisioned downstream impacts, including possibilities for extensions to other sports (real and virtual).
Journal Article
A distributional code for value in dopamine-based reinforcement learning
by
Munos, Rémi
,
Starkweather, Clara Kwon
,
Dabney, Will
in
631/378/116/2396
,
631/378/1595
,
631/378/1788
2020
Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain
1
–
3
. According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning
4
–
6
. We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning.
Analyses of single-cell recordings from mouse ventral tegmental area are consistent with a model of reinforcement learning in which the brain represents possible future rewards not as a single mean of stochastic outcomes, as in the canonical model, but instead as a probability distribution.
Journal Article
KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION
2013
We consider optimal sequential allocation in the context of the so-called stochastic multi-armed bandit model. We describe a generic index policy, in the sense of Gittins [J. R. Stat. Soc. Ser. B Stat. Methodol. 41 (1979) 148—177], based on upper confidence bounds of the arm payoffs computed using the Kullback—Leibler divergence. We consider two classes of distributions for which instances of this general idea are analyzed: the kl-UCB algorithm is designed for one-parameter exponential families and the empirical KL-UCB algorithm for bounded and finitely supported distributions. Our main contribution is a unified finite-time analysis of the regret of these algorithms that asymptotically matches the lower bounds of Lai and Robbins [Adv. in Appl. Math. 6 (1985) 4—22] and Burnetas and Katehakis [Adv. in Appl. Math. 17 (1996) 122—142], respectively. We also investigate the behavior of these algorithms when used with general bounded rewards, showing in particular that they provide significant improvements over the state-of-the-art.
Journal Article
Navigating the landscape of multiplayer games
2020
Multiplayer games have long been used as testbeds in artificial intelligence research, aptly referred to as the Drosophila of artificial intelligence. Traditionally, researchers have focused on using well-known games to build strong agents. This progress, however, can be better informed by characterizing games and their topological landscape. Tackling this latter question can facilitate understanding of agents and help determine what game an agent should target next as part of its training. Here, we show how network measures applied to response graphs of large-scale games enable the creation of a landscape of games, quantifying relationships between games of varying sizes and characteristics. We illustrate our findings in domains ranging from canonical games to complex empirical games capturing the performance of trained agents pitted against one another. Our results culminate in a demonstration leveraging this information to generate new and interesting games, including mixtures of empirical games synthesized from real world games.
Multiplayer games can be used as testbeds for the development of learning algorithms for artificial intelligence. Omidshafiei et al. show how to characterize and compare such games using a graph-based approach, generating new games that could potentially be interesting for training in a curriculum.
Journal Article
Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling
2021
We propose a stochastic approximation (SA) based method with randomization of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal difference learning with linear function approximation, albeit with samples picked uniformly from a given dataset. Our method results in an O(d) improvement in complexity in comparison to LSTD, where d is the dimension of the data. We provide non-asymptotic bounds for our proposed method, both in high probability and in expectation, under the assumption that the matrix underlying the LSTD solution is positive definite. The latter assumption can be easily satisfied for the pathwise LSTD variant proposed by Lazaric (J Mach Learn Res 13:3041–3074, 2012). Moreover, we also establish that using our method in place of LSTD does not impact the rate of convergence of the approximate value function to the true value function. These rate results coupled with the low computational complexity of our method make it attractive for implementation in big data settings, where d is large. A similar low-complexity alternative for least squares regression is well-known as the stochastic gradient descent (SGD) algorithm. We provide finite-time bounds for SGD. We demonstrate the practicality of our method as an efficient alternative for pathwise LSTD empirically by combining it with the least squares policy iteration algorithm in a traffic signal control application. We also conduct another set of experiments that combines the SA-based low-complexity variant for least squares regression with the LinUCB algorithm for contextual bandits, using the large scale news recommendation dataset from Yahoo.
Journal Article
Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model
by
Kappen, Hilbert J.
,
Gheshlaghi Azar, Mohammad
,
Munos, Rémi
in
Applied sciences
,
Artificial Intelligence
,
Complexity
2013
We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with
N
state-action pairs and the discount factor
γ
∈[0,1) only
O
(
N
log(
N
/
δ
)/((1−
γ
)
3
ε
2
)) state-transition samples are required to find an
ε
-optimal estimation of the action-value function with the probability (w.p.) 1−
δ
. Further, we prove that, for small values of
ε
, an order of
O
(
N
log(
N
/
δ
)/((1−
γ
)
3
ε
2
)) samples is required to find an
ε
-optimal policy w.p. 1−
δ
. We also prove a matching lower bound of
Θ
(
N
log(
N
/
δ
)/((1−
γ
)
3
ε
2
)) on the sample complexity of estimating the optimal action-value function with
ε
accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of
N
,
ε
,
δ
and 1/(1−
γ
) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1−
γ
).
Journal Article
Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
by
Antos, András
,
Munos, Rémi
,
Szepesvári, Csaba
in
Applied sciences
,
Artificial Intelligence
,
Computer Science
2008
In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the
VC
-crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.
Journal Article
Performance Bounds in$L_p$ ‐norm for Approximate Value Iteration
2007
Approximate value iteration (AVI) is a method for solving large Markov decision problems by approximating the optimal value function with a sequence of value function representations $V_n$ processed according to the iterations $V_{n+1}=\\mathcal{AT} V_n$, where $\\mathcal{T}$ is the so-called Bellman operator and $\\mathcal{A}$ an approximation operator, which may be implemented by a supervised learning (SL) algorithm. Usual bounds on the asymptotic performance of AVI are established in terms of the $L_\\infty$-norm approximation errors induced by the SL algorithm. However, most widely used SL algorithms (such as least squares regression) return a function (the best fit) that minimizes an empirical approximation error in $L_p$-norm ($p\\geq 1$). In this paper, we extend the performance bounds of AVI to weighted $L_p$-norms, which enables us to directly relate the performance of AVI to the approximation power of the SL algorithm, hence assuring the tightness and practical relevance of these bounds. The main result is a performance bound of the resulting policies expressed in terms of the $L_p$-norm errors introduced by the successive approximations. The new bound takes into account a concentration coefficient that estimates how much the discounted future-state distributions starting from a probability measure used to assess the performance of AVI can possibly differ from the distribution used in the regression operation. We illustrate the tightness of the bounds on an optimal replacement problem.
Journal Article
Variable Resolution Discretization in Optimal Control
2002
The problem of state abstraction is of central importance in optimal control, reinforcement learning and Markov decision processes. This paper studies the case of variable resolution state abstraction for continuous time and space, deterministic dynamic control problems in which near-optimal policies are required. We begin by defining a class of variable resolution policy and value function representations based on Kuhn triangulations embedded in a kd-trie. We then consider top-down approaches to choosing which cells to split in order to generate improved policies. The core of this paper is the introduction and evaluation of a wide variety of possible splitting criteria. We begin with local approaches based on value function and policy properties that use only features of individual cells in making split choices. Later, by introducing two new non-local measures, influence and variance, we derive splitting criteria that allow one cell to efficiently take into account its impact on other cells when deciding whether to split. Influence is an efficiently-calculable measure of the extent to which changes in some state effect the value function of some other states. Variance is an efficiently-calculable measure of how risky is some state in a Markov chain: a low variance state is one in which we would be very surprised if, during any one execution, the long-term reward attained from that state differed substantially from its expected value, given by the value function. The paper proceeds by graphically demonstrating the various approaches to splitting on the familiar, non-linear, non-minimum phase, and two dimensional problem of the \"Car on the hill\". It then evaluates the performance of a variety of splitting criteria on many benchmark problems, paying careful attention to their number-of-cells versus closeness-to-optimality tradeoff curves.[PUBLICATION ABSTRACT]
Journal Article