Talk of superhuman artificial intelligence (AI) is heating up. But research has revealed weaknesses in one of the most successful AI systems — a bot that plays the board game Go and can beat the world’s best human players — showing that such superiority can be fragile. The study raises questions about whether more general AI systems will suffer from vulnerabilities that could compromise their safety and reliability, and even their claim to be ‘superhuman’.
“The paper leaves a significant question mark on how to achieve the ambitious goal of building robust real-world AI agents that people can trust,” says Huan Zhang, a computer scientist at the University of Illinois Urbana-Champaign. Stephen Casper, a computer scientist at the Massachusetts Institute of Technology in Cambridge, adds: “It provides some of the strongest evidence to date that making advanced models robustly behave as desired is hard.”
The analysis, which was posted online as a preprint in June and has not been peer reviewed, makes use of what are called adversarial attacks — feeding AI systems inputs that are designed to prompt the systems to make mistakes, either for research or for nefarious purposes. For example, certain prompts can ‘jailbreak’ chatbots, making them give out harmful information that they were trained to suppress.
On supporting science journalism
If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
In Go, two players take turns placing black and white stones on a grid to surround and capture the other player’s stones. In 2022, researchers reported training adversarial AI bots to defeat KataGo, the best open-source Go-playing AI system, which typically beats the best humans handily (and handlessly). Their bots found exploits that regularly beat KataGo, even though the bots were otherwise not very good — human amateurs could beat them. What’s more, humans could understand the bots’ tricks and adopt them to beat KataGo.
Exploiting KataGo
Was this a one-off, or did that work point to a fundamental weakness in KataGo — and, by extension, other AI systems with seemingly superhuman capabilities? To investigate, the researchers, led by Adam Gleave, chief executive of FAR AI, a non-profit research organization in Berkeley, California and co-author of the 2022 paper, used adversarial bots to test three ways of defending Go AIs against such attacks.
The first defence was one that the KataGo developers had already deployed after the 2022 attacks: giving KataGo examples of board positions involved in the attacks, and having it play itself to learn how to play against those positions. That is similar to how it taught itself to play Go more generally. But the authors of the latest paper found that an adversarial bot could learn to beat even this updated version of KataGo, winning 91% of the time.
The second defensive strategy that Gleave’s team tried was iterative: training a version of KataGo against adversarial bots, then training attackers against the updated KataGo and so on, for nine rounds. But this didn’t result in an unbeatable version of KataGo either. Adversaries kept finding exploits, with the final one beating KataGo 81% of the time.
As a third defensive strategy, the researchers trained a new Go-playing AI system from scratch. KataGo is based on a computing model known as a convolutional neural network (CNN). The researchers suspected that CNNs might focus too much on local details and miss global patterns, so they built a Go player using an alternative neural network called a vision transformer (ViT). But their adversarial bot found a new attack that helped it to win 78% of the time against the ViT system.
Weak adversaries
In all these cases, the adversarial bots — although able to beat KataGo and other top Go-playing systems — were trained to discover hidden vulnerabilities in other AIs, not to be well-rounded strategists. “The adversaries are still pretty weak — we’ve beaten them ourselves fairly easily,” says Gleave.
And with humans able use the adversarial bots’ tactics to beat expert Go AI systems, does it still make sense to call those systems superhuman? “It’s a great question I definitely wrestled with,” Gleave says. “We’ve started saying ‘typically superhuman’.” David Wu, a computer scientist in New York City who first developed KataGo, says strong Go AIs are “superhuman on average” but not “superhuman in the worst cases”.
Gleave says that the results could have broad implications for AI systems, including the large language models that underlie chatbots such as ChatGPT. “The key takeaway for AI is that these vulnerabilities will be difficult to eliminate,” Gleave says. “If we can’t solve the issue in a simple domain like Go, then in the near-term there seems little prospect of patching similar issues like jailbreaks in ChatGPT.”
What the results mean for the possibility of creating AI that comprehensively outpaces human capabilities is less clear, says Zhang. “While this might superficially suggest that humans may retain important cognitive advantages over AI for some time,” he says, “I believe the most crucial takeaway is that we do not fully understand the AI systems we build today.”
This article is reproduced with permission and was first published on July 8, 2024.