A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.
It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. The system starts off with a neural network that knows nothing about the game of Go. It then plays games against itself, by combining this neural network with a powerful search algorithm. As it plays, the neural network is tuned and updated to predict moves, as well as the eventual winner of the games.
This updated neural network is then recombined with the search algorithm to create a new, stronger version of AlphaGo Zero, and the process begins again. In each iteration, the performance of the system improves by a small amount, and the quality of the self-play games increases, leading to more and more accurate neural networks and ever stronger versions of AlphaGo Zero.
After just three days of self-play training, AlphaGo Zero emphatically defeated the previously published version of AlphaGo – which had itself defeated 18-time world champion Lee Sedol – by 100 games to 0. After 40 days of self training, AlphaGo Zero became even stronger, outperforming the version of AlphaGo known as “Master”, which has defeated the world’s best players and world number one Ke Jie.
The new AlphaGo Zero is not impressive just because it uses no data set to become the world leader at what it does. It’s impressive also because it achieves the goal at a pace no human will ever be able to match.
A cognitive map has long been the dominant metaphor for hippocampal function, embracing the idea that place cells encode a geometric representation of space. However, evidence for predictive coding, reward sensitivity and policy dependence in place cells suggests that the representation is not purely spatial. We approach this puzzle from a reinforcement learning perspective: what kind of spatial representation is most useful for maximizing future reward? We show that the answer takes the form of a predictive representation. This representation captures many aspects of place cell responses that fall outside the traditional view of a cognitive map. Furthermore, we argue that entorhinal grid cells encode a low-dimensionality basis set for the predictive representation, useful for suppressing noise in predictions and extracting multiscale structure for hierarchical planning.
Our insights were derived from reinforcement learning, the subdiscipline of AI research that focuses on systems that learn by trial and error. The key computational idea we drew on is that to estimate future reward, an agent must first estimate how much immediate reward it expects to receive in each state, and then weight this expected reward by how often it expects to visit that state in the future. By summing up this weighted reward across all possible states, the agent obtains an estimate of future reward.
Similarly, we argue that the hippocampus represents every situation – or state – in terms of the future states which it predicts. For example, if you are leaving work (your current state) your hippocampus might represent this by predicting that you will likely soon be on your commute, picking up your kids from school or, more distantly, at home. By representing each current state in terms of its anticipated successor states, the hippocampus conveys a compact summary of future events, known formally as the “successor representation”. We suggest that this specific form of predictive map allows the brain to adapt rapidly in environments with changing rewards, but without having to run expensive simulations of the future.
I wonder what Jeff Hawkins thinks about this new theory.
Google’s AI subsidiary DeepMind is getting serious about ethics. The UK-based company, which Google bought in 2014, today announced the formation of a new research group dedicated to the thorniest issues in artificial intelligence. These include the problems of managing AI bias; the coming economic impact of automation; and the need to ensure that any intelligent systems we develop share our ethical and moral values.
DeepMind Ethics & Society (or DMES, as the new team has been christened) will publish research on these topics and others starting early 2018. The group has eight full-time staffers at the moment, but DeepMind wants to grow this to around 25 in a year’s time. The team has six unpaid external “fellows” (including Oxford philosopher Nick Bostrom, who literally wrote the book on AI existential risk) and will partner with academic groups conducting similar research, including The AI Now Institute at NYU, and the Leverhulme Centre for the Future of Intelligence.
Great effort. I’d love to attend a conference arranged by groups like this one.