Catalogue Search | MBRL

Mastering the game of Go with deep neural networks and tree search

by Lillicrap, Timothy , Nham, John , Sutskever, Ilya in 631/378/1788 , 639/705/1042 , 639/705/117

2016

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away. A computer Go program based on deep neural networks defeats a human professional player to achieve one of the grand challenges of artificial intelligence. AlphaGo computer beats Go champion The victory in 1997 of the chess-playing computer Deep Blue in a six-game series against the then world champion Gary Kasparov was seen as a significant milestone in the development of artificial intelligence. An even greater challenge remained — the ancient game of Go. Despite decades of refinement, until recently the strongest computers were still playing Go at the level of human amateurs. Enter AlphaGo. Developed by Google DeepMind, this program uses deep neural networks to mimic expert players, and further improves its performance by learning from games played against itself. AlphaGo has achieved a 99% win rate against the strongest other Go programs, and defeated the reigning European champion Fan Hui 5–0 in a tournament match. This is the first time that a computer program has defeated a human professional player in even games, on a full, 19 x 19 board, in even games with no handicap.

Journal Article

Share this book

Add to My Shelf

Mastering the game of Go without human knowledge

by Baker, Lucas , Bolton, Adrian , Lillicrap, Timothy in 631/378/1788 , 639/705/1042 , 639/705/117

2017

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa , superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa , our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo. Starting from zero knowledge and without human data, AlphaGo Zero was able to teach itself to play Go and to develop novel strategies that provide new insights into the oldest of games. AlphaGo Zero goes solo To beat world champions at the game of Go, the computer program AlphaGo has relied largely on supervised learning from millions of human expert moves. David Silver and colleagues have now produced a system called AlphaGo Zero, which is based purely on reinforcement learning and learns solely from self-play. Starting from random moves, it can reach superhuman level in just a couple of days of training and five million games of self-play, and can now beat all previous versions of AlphaGo. Because the machine independently discovers the same fundamental principles of the game that took humans millennia to conceptualize, the work suggests that such principles have some universal character, beyond human bias.

Journal Article

Share this book

Add to My Shelf

What does dopamine mean?

by Berke, Joshua D in Dopamine , Interneurons , Learning

2018

Dopamine is a critical modulator of both learning and motivation. This presents a problem: how can target cells know whether increased dopamine is a signal to learn or to move? It is often presumed that motivation involves slow (‘tonic’) dopamine changes, while fast (‘phasic’) dopamine fluctuations convey reward prediction errors for learning. Yet recent studies have shown that dopamine conveys motivational value and promotes movement even on subsecond timescales. Here I describe an alternative account of how dopamine regulates ongoing behavior. Dopamine release related to motivation is rapidly and locally sculpted by receptors on dopamine terminals, independently from dopamine cell firing. Target neurons abruptly switch between learning and performance modes, with striatal cholinergic interneurons providing one candidate switch mechanism. The behavioral impact of dopamine varies by subregion, but in each case dopamine provides a dynamic estimate of whether it is worth expending a limited internal resource, such as energy, attention, or time.

Journal Article

Share this book

Add to My Shelf

Sucrose preference test for measurement of stress-induced anhedonia in mice

by Meng-Ying, Liu , Qi-Gang, Zhou , Chun-Xia, Luo in Dependence , Hedonic response , Mental depression

2018

Anhedonia is the inability to experience pleasure from rewarding or enjoyable activities and is a core symptom of depression in humans. Here, we describe a protocol for the measurement of anhedonia in mice, in which anhedonia is measured by a sucrose preference test (SPT) based on a two-bottle choice paradigm. A reduction in the sucrose preference ratio in experimental relative to control mice is indicative of anhedonia. To date, inconsistent and variable results have been reported following the use of the SPT by different groups, probably due to the use of different protocols and equipment. In this protocol, we describe how to set up a clearly defined apparatus for SPT and provide a detailed protocol to ensure greater consistency when carrying out SPT. This optimized protocol is highly sensitive, reliable, and adaptable for evaluation of chronic stress–related anhedonia, as well as morphine-induced dependence. The whole SPT, including adaptation, baseline measurement, and testing, takes 8 d.

Journal Article

Share this book

Add to My Shelf

The hippocampus as a predictive map

by Stachenfeld, Kimberly L , Botvinick, Matthew M , Gershman, Samuel J in 631/378/1595 , 631/378/1595/1554 , 631/378/1788

2017

The authors show how predictive representations are useful for maximizing future reward, particularly in spatial domains. They develop a predictive-map model of hippocampal place cells and entorhinal grid cells that captures a wide variety of effects from human and rodent literature. A cognitive map has long been the dominant metaphor for hippocampal function, embracing the idea that place cells encode a geometric representation of space. However, evidence for predictive coding, reward sensitivity and policy dependence in place cells suggests that the representation is not purely spatial. We approach this puzzle from a reinforcement learning perspective: what kind of spatial representation is most useful for maximizing future reward? We show that the answer takes the form of a predictive representation. This representation captures many aspects of place cell responses that fall outside the traditional view of a cognitive map. Furthermore, we argue that entorhinal grid cells encode a low-dimensionality basis set for the predictive representation, useful for suppressing noise in predictions and extracting multiscale structure for hierarchical planning.

Journal Article

Share this book

Add to My Shelf

Spontaneous behaviour is structured by reinforcement without explicit reward

by Scott, Rebecca , Markowitz, Jeffrey E. , Pinto, Sandra Romero in 13/106 , 13/51 , 14/10

2023

Spontaneous animal behaviour is built from action modules that are concatenated by the brain into sequences 1 , 2 . However, the neural mechanisms that guide the composition of naturalistic, self-motivated behaviour remain unknown. Here we show that dopamine systematically fluctuates in the dorsolateral striatum (DLS) as mice spontaneously express sub-second behavioural modules, despite the absence of task structure, sensory cues or exogenous reward. Photometric recordings and calibrated closed-loop optogenetic manipulations during open field behaviour demonstrate that DLS dopamine fluctuations increase sequence variation over seconds, reinforce the use of associated behavioural modules over minutes, and modulate the vigour with which modules are expressed, without directly influencing movement initiation or moment-to-moment kinematics. Although the reinforcing effects of optogenetic DLS dopamine manipulations vary across behavioural modules and individual mice, these differences are well predicted by observed variation in the relationships between endogenous dopamine and module use. Consistent with the possibility that DLS dopamine fluctuations act as a teaching signal, mice build sequences during exploration as if to maximize dopamine. Together, these findings suggest a model in which the same circuits and computations that govern action choices in structured tasks have a key role in sculpting the content of unconstrained, high-dimensional, spontaneous behaviour. Photometric recordings and optogenetic manipulation show that dopamine fluctuations in the dorsolateral striatum in mice modulate the use, sequencing and vigour of behavioural modules during spontaneous behaviour.

Journal Article

Share this book

Add to My Shelf

Dopamine reward prediction-error signalling: a two-component response

by Schultz, Wolfram in 631/378/116/2396 , 631/378/1788 , 631/378/2649/1662

2016

Phasic signalling by midbrain dopamine neurons is thought to contribute to reward processing by encoding a reward prediction error. Schultz describes recent work suggesting that there are two distinct components of the phasic dopamine response and considers the probable functional role of each response component. Environmental stimuli and objects, including rewards, are often processed sequentially in the brain. Recent work suggests that the phasic dopamine reward prediction-error response follows a similar sequential pattern. An initial brief, unselective and highly sensitive increase in activity unspecifically detects a wide range of environmental stimuli, then quickly evolves into the main response component, which reflects subjective reward value and utility. This temporal evolution allows the dopamine reward prediction-error signal to optimally combine speed and accuracy.

Journal Article

Share this book

Add to My Shelf

Circuits and functions of the lateral habenula in health and in disease

by Cui Yihui , Yang, Yan , Hu Hailan in Animal models , Basal ganglia , Circuits

2020

The past decade has witnessed exponentially growing interest in the lateral habenula (LHb) owing to new discoveries relating to its critical role in regulating negatively motivated behaviour and its implication in major depression. The LHb, sometimes referred to as the brain’s ‘antireward centre’, receives inputs from diverse limbic forebrain and basal ganglia structures, and targets essentially all midbrain neuromodulatory systems, including the noradrenergic, serotonergic and dopaminergic systems. Its unique anatomical position enables the LHb to act as a hub that integrates value-based, sensory and experience-dependent information to regulate various motivational, cognitive and motor processes. Dysfunction of the LHb may contribute to the pathophysiology of several psychiatric disorders, especially major depression. Recently, exciting progress has been made in identifying the molecular and cellular mechanisms in the LHb that underlie negative emotional state in animal models of drug withdrawal and major depression. A future challenge is to translate these advances into effective clinical treatments.The lateral habenula (LHb) has received increasing attention in part because dysfunction of this region may play a part in several psychiatric disorders, notably depression. In this Review, Hu et al. examine the neural circuits, physiological functions and potential pathophysiological roles of the LHb.

Journal Article

Share this book

Add to My Shelf

Dorsal anterior cingulate cortex and the value of control

by Shenhav, Amitai , Botvinick, Matthew M , Cohen, Jonathan D in 631/378/1788 , 631/378/2649/1409 , 631/378/2649/2150

2016

The authors propose that dorsal anterior cingulate cortex (dACC) performs a cost/benefit analysis to specify how best to allocate cognitive control. They describe why this theory accounts well for dACC’s role in decision-making, motivation and cognitive control, including its observed role in foraging choice settings. Debates over the function(s) of dorsal anterior cingulate cortex (dACC) have persisted for decades. So too have demonstrations of the region's association with cognitive control. Researchers have struggled to account for this association and, simultaneously, dACC's involvement in phenomena related to evaluation and motivation. We describe a recent integrative theory that achieves this goal. It proposes that dACC serves to specify the currently optimal allocation of control by determining the overall expected value of control (EVC), thereby licensing the associated cognitive effort. The EVC theory accounts for dACC's sensitivity to a wide array of experimental variables, and their relationship to subsequent control adjustments. Finally, we contrast our theory with a recent theory proposing a primary role for dACC in foraging-like decisions. We describe why the EVC theory offers a more comprehensive and coherent account of dACC function, including dACC's particular involvement in decisions regarding foraging or otherwise altering one's behavior.

Journal Article

Share this book

Add to My Shelf

A distributional code for value in dopamine-based reinforcement learning

by Munos, Rémi , Starkweather, Clara Kwon , Dabney, Will in 631/378/116/2396 , 631/378/1595 , 631/378/1788

2020

Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain 1 – 3 . According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning 4 – 6 . We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning. Analyses of single-cell recordings from mouse ventral tegmental area are consistent with a model of reinforcement learning in which the brain represents possible future rewards not as a single mean of stochastic outcomes, as in the canonical model, but instead as a probability distribution.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter