Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News DeepMind's Agent57 Outperforms Humans on All Atari 2600 Games

DeepMind's Agent57 Outperforms Humans on All Atari 2600 Games

This item in japanese


Researchers at Google's DeepMind have produced a reinforcement-learning (RL) system called Agent57 that has scored above the human benchmark on all 57 Atari 2600 games in the Arcade Learning Environment. Agent57 is the first system to outperform humans on even the hardest games in the suite.

The researchers described the system and a set of experiments in a paper published on arXiv. Agent57 builds on DeepMind's previous RL work on the Never Give Up (NGU) algorithm. The underlying architecture consists of a neural network which encodes a set of policies, ranging from exploratory to exploitative, using an adaptive mechanism to prioritize different policies throughout the training process. Additional improvements address the long-term credit assignment problem by increasing training stability. With these improvements, Agent57 achieved a higher median score than NGU across all games. In addition, Agent57 outperformed human scores on games that previous AI systems could not play at all.

Although much of DeepMind's research has focused on AI for playing games, including classic board games such as Go as well as video games, according to the team their goal is "to use games as a stepping stone for developing systems that learn to excel at a broad set of challenges." Researchers consider the suite of Atari 2600 games a good benchmark for RL performance because each game is interesting enough to represent a practical challenge, and the entire suite contains enough variety to present a general challenge. Despite years of research and several improvements to Deep Q-Networks, the first system to achieve human-level performance on several games, "all deep reinforcement learning agents have consistently failed to score in four games: Montezuma’s Revenge, Pitfall, Solaris and Skiing." Success at these games requires the system to solve two hard problems in RL: the exploration-exploitation problem and the long-term credit assignment problem.

The exploration-exploitation tradeoff is the balance agents must strike between choosing strategies it has already learned and exploring new strategies. Games such as Pitfall and Montezuma's Revenge require agents to explore the game "world" before any reward is achieved. Agent57's predecessor, NGU, used an intrinsic reward that was generated by detecting new game states. It then learns a set of policies for exploration and exploitation. Agent57 improves on this by using a multi-arm bandit meta-controller that adjusts the exploration-exploitation trade-off during training.

Long-term credit assignment problems arise when the actions an agent takes have a delayed reward. For example, in the game Skiing, no score is given until the end of the game, so systems cannot easily learn the effects of actions taken near the beginning. Agent57's improvement on NGU is to split the agent's neural network into two parts: one that learns to predict intrinsic reward of an action, and another for extrinsic. The researchers found that this "significantly" increased training stability.

The DeepMind team compared Agent57's performance with several other systems, including NGU,  Recurrent Replay Distributed DQN (R2D2) and MuZero. Although MuZero has the highest mean and median score across the suite, it "catastrophically" fails to play some games, achieving a score no better than a random policy. Agent57 earned the best score on the hardest 20% of the games and is the only system to exceed human performance on all games.

In a Hacker News discussion about Agent57, one user noted:

This whole evolution looks more and more like expert systems from 1980s where people kept adding more and more complexity to "solve" a specific problem. For RL, we started with simple DQN that was elegant but now the new algorithms looks like a massive hodge podge of band aids. NGU, as it is, [is] extraordinarily complex and looks [like an] adhoc mix of various patches. Now on the top of NGU, we are also throwing in meta-controller and even bandits among other things to complete the proverbial kitchen sink.

DeepMind was launched as a startup in 2010 and was acquired by Google in 2014. DeepMind developed the AlphaGo AI that defeated one of the best human Go players in 2016.

Rate this Article