Facebook Open-Sources Game Playing AI ReBeL

Facebook AI Research (FAIR) published a paper on Recursive Belief-based Learning (ReBeL), their new AI for playing imperfect-information games that can defeat top human players in poker. The algorithm combines reinforcement learning (RL) with state-space search and converges to a Nash equilibrium for any two player zero sum game. Code for training the algorithm to play Liar's Dice has been open-sourced.

FAIR researchers Noam Brown and Anton Bakhtin gave an overview of the system in a blog post. ReBeL is a general-purpose algorithm for use in any two player zero sum game, even those with imperfect information, where players do not have complete knowledge of the game state. By modeling the probability distribution of the beliefs that players may have about game state, ReBeL can apply AI techniques used in perfect-information games. The algorithm is proven to converge to the optimal policy for the game, and FAIR's implementation for Heads-Up No-Limit (HUNL) Texas Hold 'Em poker outperformed previous AI benchmarks and defeated a top human professional poker player. Brown and Bakhtin note:

[W]e view this as a major step toward developing universal techniques for multiagent interactions, and thus as a step toward complex real-world applications like fraud detection and cybersecurity.

Modeling a game for AI automation typically involves encoding the state of the game: for example, the positions of all the pieces on a chess board. The AI agent then uses an algorithm (or policy) to choose its next move, which then updates the game's state. In many games where agents have perfect information about the current state of the game, such as chess, a common implementation involves searching the game's state space: that is, simulating many possible sequences of moves by both players to find the best next move, where "best" is measured using a value function. However, for games where the state space is very large, such as Go, search alone becomes impractical. Instead, researchers for these games have turned to RL, where the agent plays the game and updates its policy based on the results of its moves. DeepMind's AlphaGo Zero combines both RL and search, using a learned policy to guide search by narrowing down the search space.

For games with imperfect information, such as card games where the players' cards are hidden from each other, agents cannot accurately determine the game's full state. While RL techniques do have some success, they lack a guarantee that the policy learned will approach the Nash equilibrium. Because none of the players in the game have full knowledge of the game state, they form beliefs about the unknown state. Furthermore, to determine the likely actions of other players, an agent must form beliefs about the other players' beliefs. DeepStack, the first AI to defeat human professionals at HUNL, uses such recursive reasoning to produce inputs to a neural network which is used to assign values to game states in conjunction with state search.

ReBeL also uses this recursive process to develop probability distributions about the unknown state of the game, and formalizes the process as public belief states (PBS). A PBS is formed by using Bayesian techniques and observations of players' actions to update an initial uniform distribution; thus all information required to produce a PBS is available to all players. By reformulating imperfect-information games to a continuous-state perfect-information based on PBS, ReBeL can use search in the PBS space. The value function for evaluating states in the search is trained using RL during self-play, which reduces the need for expert domain knowledge and makes the algorithm applicable to a wider range of games. While DeepStack also used PBS and search, its value function was trained by randomly generating PBSs from a distribution built using expert knowledge instead of by self-play. FAIR's previous poker-playing AI, Pluribus, does use self-play, but does not adapt its strategy based on observations of other players.

The FAIR team proved that ReBeL converges to a Nash equilibrium policy, and tested the algorithm on both HUNL Texas Hold 'Em poker and Liar's Dice. In poker, ReBeL outperformed two benchmark bots, BabyTartanian8 and Slumbot, and also defeated Dong Kim, a professional poker player. Although FAIR has not released the code for their poker implementation, the source code and model checkpoints for the Liar's Dice experiment are available on GitHub.

Topics

Pitfalls of Unified Memory Models in GPUs

Beyond Platform Thinking at RB Global – Build Things No One Expects, in a Place No One Expects It

Generally AI - Season 2 - Episode 4: Coordinate Systems in AI and the Physical World

Adaptive Responses to Resiliently Handle Hard Problems in Software Operations

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Microsoft Introduces Drasi: Open-Source System for Real-Time Event Processing and Automation

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Podman Desktop 1.13 Launches with Hyper-V Support and Additional Enhancements

Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems

Generally AI - Season 2 - Episode 4: Coordinate Systems in AI and the Physical World

Challenges and Lessons Porting Code from C to Rust

Copilot Now Available in OneDrive: AI-Powered Features for Streamlined Document Management

Ephemeral IDs: Cloudflare's Latest Tool for Fraud Detection

Beyond Platform Thinking at RB Global – Build Things No One Expects, in a Place No One Expects It

Evolving Trainline Architecture for Scale, Reliability and Productivity

Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems

Managing High-Performing Software Teams

Adaptive Responses to Resiliently Handle Hard Problems in Software Operations

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

Distill Your LLMs and Surpass Their Performance: spaCy's Creator at InfoQ DevSummit Munich

Generally AI - Season 2 - Episode 4: Coordinate Systems in AI and the Physical World

University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

Google Cloud Adds Scalable Vector Search to Memorystore for Valkey & Redis Cluster

Podman Desktop 1.13 Launches with Hyper-V Support and Additional Enhancements

Uber Completes Major MySQL Fleet Upgrade, Boosting Performance and Security

QCon San Francisco

QCon London

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Facebook Open-Sources Game Playing AI ReBeL

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter