How game theory can make AI more reliable

A much bigger challenge for AI researchers has been the game Diplomacy, a favorite of politicians like John F. Kennedy and Henry Kissinger. Instead of just two opponents, the game features seven players whose motives are difficult to read. To win, a player must negotiate and forge cooperative agreements that anyone can break at any time. Diplomacy is so complex that a group at Meta was thrilled when their AI program Cicero developed “human-level play” over the course of 40 games in 2022. Although it didn’t beat the world champion, Cicero did well enough to finish in the top 10 percent against human competitors.

During the project, Jacob, a member of the Meta team, was struck by the fact that Cicero relied on a language model to generate his dialogue with other players. He felt untapped potential. The team’s goal, he said, “was to build the best language model we could for playing this game.” But what if they instead focused on building the best game they could to improve the performance of large language models?

Consensual interactions

In 2023, Jacob began researching that question at MIT, where he worked with Yikang Shen, Gabriele Farina, and his advisor, Jacob Andreas, on what would become the Consensus Game. The core idea came from imagining a conversation between two people as a cooperative game, where success comes when a listener understands what a speaker is trying to convey. In particular, the consensus game is designed to align the two systems of the language model: the generator, which handles generative questions, and the discriminator, which handles discriminative questions.

After a few months of stops and starts, the team expanded this principle into a full game. First the generator receives a query. It can come from a human or from a pre-existing list. For example: “Where was Barack Obama born?” The generator then receives some candidate responses, for example Honolulu, Chicago and Nairobi. Again, these options can come from a human, a list, or a query performed by the language model itself.

But before answering, the generator is also told whether to answer the question correctly or incorrectly, depending on the outcome of a fair coin toss.

If it is heads, the machine tries to answer correctly. The generator sends the original question, together with the chosen answer, to the discriminator. If the discriminator determines that the generator intentionally sent the correct answer, they each receive one point, as a kind of incentive.

If it lands tails, the generator sends what it thinks is the wrong answer. If the discriminator decides that a wrong answer was deliberately given, they both get another point. The idea here is to encourage agreement. “It’s like teaching a dog a trick,” Jacob explained. “You give them a reward when they do the right thing.”

The generator and the discriminator also each start with a set of initial ‘beliefs’. These take the form of a probability distribution related to the different choices. For example, the generator may believe, based on the information it has gathered from the Internet, that there is an 80 percent chance that Obama was born in Honolulu, a 10 percent chance that he was born in Chicago, a 5 percent chance that he was born in Nairobi, and a 10 percent chance that he was born in Nairobi. 5 percent chance in other places. The discriminator can start with a different distribution. Although the two ‘players’ are still rewarded for reaching an agreement, they also receive points if they stray too far from their original beliefs. This arrangement encourages players to incorporate their knowledge of the world – again sourced from the Internet – into their answers, which should make the model more accurate. Without something like this, they might be able to agree on a completely wrong answer, like Delhi, but still earn points.

Consensual interactions

Leave a Comment Cancel reply