BT

InfoQ Homepage News The First AI to Beat Pros in 6-Player Poker, Developed by Facebook and Carnegie Mellon

The First AI to Beat Pros in 6-Player Poker, Developed by Facebook and Carnegie Mellon

Bookmarks

Facebook AI Research's Noam Brown and Carnegie Mellon's professor Tuomas Sandholm recently announced Pluribus, the first Artificial Intelligence program able to beat humans in six-player no-limit Texas Hold'em poker game. In the past years, computers have progressively improved, beating humans in checkers, chess, Go, and the Jeopardy TV show.

Libratus, which was Pluribus's predecessor, was able to beat humans in the two-player variation of Hold'em poker back in 2017. Playing poker poses a challenge that the above mentioned games don't -- that of hidden information. Not knowing the opponent's cards introduces the element of bluffing, which is generally not something algorithms are good at.

In many ways, multi player poker games resemble real life situations better than any other board game. In real life, there are usually more than two actors, there is hidden information to one or more of the adversaries, and it's not a zero-sum game.

Many of the games in which AI was able to beat elite human players in the past, like Chess or checkers or Starcraft 2, are two-player zero-sum games. In these games, calculating and playing the Nash equilibrium guarantees that statistically (and in the long run), the player cannot lose, no matter what the opponent does. In the six player hold-em poker game, it's not generally possible to calculate the Nash equilibrium.

Hidden information was addressed in Pluribus predecessor, Libratus, by combining a search procedure for imperfect-information games with self-play algorithms based on Counterfactual Regret Minimization. This technique worked for the two-player variation of the game, but could not scale to six players, even at 10,000 or more computing power. In contrast, what Pluribus does is play many games against itself to improve its strategies against earlier variations of itself in order to calculate a blueprint strategy.

By using iterative Monte Carlo CFR, the algorithm improves on the decisions of one player named the 'traverser' at each round, by examining all the available actions the traverser could have made, and what would be the hypothetical outcome. This is possible because the AI is playing against itself and so it can ask (itself) what would the non-traverser players had done in response to a different action taken by the traverser.

At the end of each iteration of learning, the counterfactual regrets are updated for the traverser's strategy. To reduce complexity and storage requirements, some actions and similar decisions are bucketed together and treated as identical. The blueprint strategy calculated during training will be further improved during playing using a search strategy, but will not be adapted to observed tendencies of adversaries in real time.

Training the algorithm took eight days on a 64-core server with less than 512GB RAM and without any need for GPUs. This is less than $150 in cloud computing costs at the time of publication. In contrast, training algorithms for other two-player zero-sum games would cost in the range of thousands to millions of dollars.

In order to keep the blueprint strategy at a manageable size, the strategy is coarse grained. At the time of play, the AI algorithm will then perform a search to identify the best fine grained strategy for the situation. This search relies on assuming that the opponents may take on one of four different strategies. They may either stick to the blueprint's strategy, they may tend to fold, they may tend to call, or they may tend to raise their call. "Tend to" in this context means that the strategy is biased towards this action from a probabilistic point of view. This way, the search results in a balanced strategy that is difficult to counteract by human players. On the contrary, a strategy that, for example, would always tend to assume that the opponent would fold, would be easily identifiable by a human opponent who would adapt to it by tending not to fold.

Introducing bluffing is another interesting part of the algorithm. Pluribus bluffs by calculating the probability that it would have reached the current situation with each possible hand according to its own strategy. Then Pluribus would calculate its reaction with every possible hand, balancing the strategy across all hands. Finally, Pluribus would execute the action on the actual cards its holding on to.

Pluribus was able to beat elite human players in both 5 humans + 1 AI and 5 AI + 1 human settings. The techniques used by Pluribus are not specific to poker, and doesn't require any expert domain knowledge. Pluribus will be used to further enhance how to build a general AI that can cope with multi-agent environments, in collaboration with humans or other AI agents.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Error in article

    by Alexander Zhidkov /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    6-max poker is still a zero-sum game. The difference with heads-up is that player playing Nash equilibrium strategy still might lose in 6-max but not in heads-up.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.