Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
850
result(s) for
"631/378/1788"
Sort by:
Mastering the game of Go with deep neural networks and tree search
2016
The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves. Here we introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves. These deep neural networks are trained by a novel combination of supervised learning from human expert games, and reinforcement learning from games of self-play. Without any lookahead search, the neural networks play Go at the level of state-of-the-art Monte Carlo tree search programs that simulate thousands of random games of self-play. We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks. Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0. This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.
A computer Go program based on deep neural networks defeats a human professional player to achieve one of the grand challenges of artificial intelligence.
AlphaGo computer beats Go champion
The victory in 1997 of the chess-playing computer Deep Blue in a six-game series against the then world champion Gary Kasparov was seen as a significant milestone in the development of artificial intelligence. An even greater challenge remained — the ancient game of Go. Despite decades of refinement, until recently the strongest computers were still playing Go at the level of human amateurs. Enter AlphaGo. Developed by Google DeepMind, this program uses deep neural networks to mimic expert players, and further improves its performance by learning from games played against itself. AlphaGo has achieved a 99% win rate against the strongest other Go programs, and defeated the reigning European champion Fan Hui 5–0 in a tournament match. This is the first time that a computer program has defeated a human professional player in even games, on a full, 19 x 19 board, in even games with no handicap.
Journal Article
Mastering the game of Go without human knowledge
2017
A long-standing goal of artificial intelligence is an algorithm that learns,
tabula rasa
, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting
tabula rasa
, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.
Starting from zero knowledge and without human data, AlphaGo Zero was able to teach itself to play Go and to develop novel strategies that provide new insights into the oldest of games.
AlphaGo Zero goes solo
To beat world champions at the game of Go, the computer program AlphaGo has relied largely on supervised learning from millions of human expert moves. David Silver and colleagues have now produced a system called AlphaGo Zero, which is based purely on reinforcement learning and learns solely from self-play. Starting from random moves, it can reach superhuman level in just a couple of days of training and five million games of self-play, and can now beat all previous versions of AlphaGo. Because the machine independently discovers the same fundamental principles of the game that took humans millennia to conceptualize, the work suggests that such principles have some universal character, beyond human bias.
Journal Article
What does dopamine mean?
2018
Dopamine is a critical modulator of both learning and motivation. This presents a problem: how can target cells know whether increased dopamine is a signal to learn or to move? It is often presumed that motivation involves slow (‘tonic’) dopamine changes, while fast (‘phasic’) dopamine fluctuations convey reward prediction errors for learning. Yet recent studies have shown that dopamine conveys motivational value and promotes movement even on subsecond timescales. Here I describe an alternative account of how dopamine regulates ongoing behavior. Dopamine release related to motivation is rapidly and locally sculpted by receptors on dopamine terminals, independently from dopamine cell firing. Target neurons abruptly switch between learning and performance modes, with striatal cholinergic interneurons providing one candidate switch mechanism. The behavioral impact of dopamine varies by subregion, but in each case dopamine provides a dynamic estimate of whether it is worth expending a limited internal resource, such as energy, attention, or time.
Journal Article
Sucrose preference test for measurement of stress-induced anhedonia in mice
by
Meng-Ying, Liu
,
Qi-Gang, Zhou
,
Chun-Xia, Luo
in
Dependence
,
Hedonic response
,
Mental depression
2018
Anhedonia is the inability to experience pleasure from rewarding or enjoyable activities and is a core symptom of depression in humans. Here, we describe a protocol for the measurement of anhedonia in mice, in which anhedonia is measured by a sucrose preference test (SPT) based on a two-bottle choice paradigm. A reduction in the sucrose preference ratio in experimental relative to control mice is indicative of anhedonia. To date, inconsistent and variable results have been reported following the use of the SPT by different groups, probably due to the use of different protocols and equipment. In this protocol, we describe how to set up a clearly defined apparatus for SPT and provide a detailed protocol to ensure greater consistency when carrying out SPT. This optimized protocol is highly sensitive, reliable, and adaptable for evaluation of chronic stress–related anhedonia, as well as morphine-induced dependence. The whole SPT, including adaptation, baseline measurement, and testing, takes 8 d.
Journal Article
The hippocampus as a predictive map
by
Stachenfeld, Kimberly L
,
Botvinick, Matthew M
,
Gershman, Samuel J
in
631/378/1595
,
631/378/1595/1554
,
631/378/1788
2017
The authors show how predictive representations are useful for maximizing future reward, particularly in spatial domains. They develop a predictive-map model of hippocampal place cells and entorhinal grid cells that captures a wide variety of effects from human and rodent literature.
A cognitive map has long been the dominant metaphor for hippocampal function, embracing the idea that place cells encode a geometric representation of space. However, evidence for predictive coding, reward sensitivity and policy dependence in place cells suggests that the representation is not purely spatial. We approach this puzzle from a reinforcement learning perspective: what kind of spatial representation is most useful for maximizing future reward? We show that the answer takes the form of a predictive representation. This representation captures many aspects of place cell responses that fall outside the traditional view of a cognitive map. Furthermore, we argue that entorhinal grid cells encode a low-dimensionality basis set for the predictive representation, useful for suppressing noise in predictions and extracting multiscale structure for hierarchical planning.
Journal Article
Spontaneous behaviour is structured by reinforcement without explicit reward
2023
Spontaneous animal behaviour is built from action modules that are concatenated by the brain into sequences
1
,
2
. However, the neural mechanisms that guide the composition of naturalistic, self-motivated behaviour remain unknown. Here we show that dopamine systematically fluctuates in the dorsolateral striatum (DLS) as mice spontaneously express sub-second behavioural modules, despite the absence of task structure, sensory cues or exogenous reward. Photometric recordings and calibrated closed-loop optogenetic manipulations during open field behaviour demonstrate that DLS dopamine fluctuations increase sequence variation over seconds, reinforce the use of associated behavioural modules over minutes, and modulate the vigour with which modules are expressed, without directly influencing movement initiation or moment-to-moment kinematics. Although the reinforcing effects of optogenetic DLS dopamine manipulations vary across behavioural modules and individual mice, these differences are well predicted by observed variation in the relationships between endogenous dopamine and module use. Consistent with the possibility that DLS dopamine fluctuations act as a teaching signal, mice build sequences during exploration as if to maximize dopamine. Together, these findings suggest a model in which the same circuits and computations that govern action choices in structured tasks have a key role in sculpting the content of unconstrained, high-dimensional, spontaneous behaviour.
Photometric recordings and optogenetic manipulation show that dopamine fluctuations in the dorsolateral striatum in mice modulate the use, sequencing and vigour of behavioural modules during spontaneous behaviour.
Journal Article
Dopamine reward prediction-error signalling: a two-component response
2016
Phasic signalling by midbrain dopamine neurons is thought to contribute to reward processing by encoding a reward prediction error. Schultz describes recent work suggesting that there are two distinct components of the phasic dopamine response and considers the probable functional role of each response component.
Environmental stimuli and objects, including rewards, are often processed sequentially in the brain. Recent work suggests that the phasic dopamine reward prediction-error response follows a similar sequential pattern. An initial brief, unselective and highly sensitive increase in activity unspecifically detects a wide range of environmental stimuli, then quickly evolves into the main response component, which reflects subjective reward value and utility. This temporal evolution allows the dopamine reward prediction-error signal to optimally combine speed and accuracy.
Journal Article
Circuits and functions of the lateral habenula in health and in disease
2020
The past decade has witnessed exponentially growing interest in the lateral habenula (LHb) owing to new discoveries relating to its critical role in regulating negatively motivated behaviour and its implication in major depression. The LHb, sometimes referred to as the brain’s ‘antireward centre’, receives inputs from diverse limbic forebrain and basal ganglia structures, and targets essentially all midbrain neuromodulatory systems, including the noradrenergic, serotonergic and dopaminergic systems. Its unique anatomical position enables the LHb to act as a hub that integrates value-based, sensory and experience-dependent information to regulate various motivational, cognitive and motor processes. Dysfunction of the LHb may contribute to the pathophysiology of several psychiatric disorders, especially major depression. Recently, exciting progress has been made in identifying the molecular and cellular mechanisms in the LHb that underlie negative emotional state in animal models of drug withdrawal and major depression. A future challenge is to translate these advances into effective clinical treatments.The lateral habenula (LHb) has received increasing attention in part because dysfunction of this region may play a part in several psychiatric disorders, notably depression. In this Review, Hu et al. examine the neural circuits, physiological functions and potential pathophysiological roles of the LHb.
Journal Article
Dorsal anterior cingulate cortex and the value of control
by
Shenhav, Amitai
,
Botvinick, Matthew M
,
Cohen, Jonathan D
in
631/378/1788
,
631/378/2649/1409
,
631/378/2649/2150
2016
The authors propose that dorsal anterior cingulate cortex (dACC) performs a cost/benefit analysis to specify how best to allocate cognitive control. They describe why this theory accounts well for dACC’s role in decision-making, motivation and cognitive control, including its observed role in foraging choice settings.
Debates over the function(s) of dorsal anterior cingulate cortex (dACC) have persisted for decades. So too have demonstrations of the region's association with cognitive control. Researchers have struggled to account for this association and, simultaneously, dACC's involvement in phenomena related to evaluation and motivation. We describe a recent integrative theory that achieves this goal. It proposes that dACC serves to specify the currently optimal allocation of control by determining the overall expected value of control (EVC), thereby licensing the associated cognitive effort. The EVC theory accounts for dACC's sensitivity to a wide array of experimental variables, and their relationship to subsequent control adjustments. Finally, we contrast our theory with a recent theory proposing a primary role for dACC in foraging-like decisions. We describe why the EVC theory offers a more comprehensive and coherent account of dACC function, including dACC's particular involvement in decisions regarding foraging or otherwise altering one's behavior.
Journal Article
A distributional code for value in dopamine-based reinforcement learning
by
Munos, Rémi
,
Starkweather, Clara Kwon
,
Dabney, Will
in
631/378/116/2396
,
631/378/1595
,
631/378/1788
2020
Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain
1
–
3
. According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning
4
–
6
. We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning.
Analyses of single-cell recordings from mouse ventral tegmental area are consistent with a model of reinforcement learning in which the brain represents possible future rewards not as a single mean of stochastic outcomes, as in the canonical model, but instead as a probability distribution.
Journal Article