How to keep AI models on the straight and narrow
Artificial-intelligence models are getting better and better. Cutting-edge systems can handle increasingly complex tasks once thought beyond the ken of machines. However, as we report in the Science & technology section this week, they can also find surprising ways to get things done. Give an ai system the task of beating a chess-playing program, for instance, and rather than trying to checkmate its opponent, it may simply hack the program to ensure victory. Give it the job of maximising profits for an investment client with ethical qualms, and instead of changing its strategy it may misrepresent the harms associated with the profits.
Obviously, these models have no consciousness of their own; they are not acting with deliberate malice. Instead, they are responding to a tension between their initial training and configuration, and the instructions they are subsequently given. Even so, unexpected outcomes matter. If ai is to be deployed widely, people must trust it. And there is little evidence to suggest that AI models are less likely to display worrying behaviour as they become bigger and more powerful; indeed, the opposite seems to be the case.
What to do? Being more careful about the prompts given to models might help. As with the enchanted brooms of the Sorcerer’s Apprentice, commands to pursue a goal “as much as possible” are wont to be taken literally. If you want an AI to be careful about its methods, then it is best not to suggest that it should break boundaries. But that might not go far enough, because some seemingly deceptive behaviour may have its origins in the way a model was trained. If you tell an advanced model that it will be reprogrammed if it overperforms on a test, it may deliberately fail in order to protect itself.
Fortunately, recently developed “interpretability” techniques can help. These allow researchers to peer inside the black box of an AI’s neural network and spot unexpected behaviour as it happens. When a model is working as it should, researchers can identify the mathematical “features” that activate as it responds to a query, and determine what each contributes to the answer.
If that same model finds itself out of its depth, for example when confronted by a tricky maths problem, it may decide to “bullshit”—confidently spouting random numbers in its response. Researchers monitoring the model will then see the random-number feature activated, alerting them to the hallucination. Similarly, it is possible to spot a deceitful answer by following an ai’s reasoning process and working out where it differs from the chain of thought it publicly expresses.
These techniques are powerful, but should be used with care. Checking an AI for safety—the process known as “alignment”—is an arduous and thankless task. Some scoff at the very idea of harmful AI; boosters resent the guardrails; and the temptation to cut corners is ever-present. It might thus be appealing to use interpretability techniques in the training process itself, to create an AI model incapable of deceiving. But doing so could backfire: it would be impossible to tell whether the model had been cured of trickery, or had simply learned to do it without being discovered. Already researchers fear that cutting-edge models, despite being trained on text in human languages, are learning to “think” in more idiosyncratic—and less comprehensible—ways.
Happily, there is little downside to using interpretability techniques correctly. In contrast with many other areas of AI innovation, where safety concerns have been swept aside in the interest of capability or capacity, such trade-offs do not exist here. Interpretability techniques are worth preserving for the same reason that AI deception is worth tackling: to ensure that the general-purpose technology of the next century can be relied on to achieve its potential. ■
Subscribers to The Economist can sign up to our Opinion newsletter, which brings together the best of our leaders, columns, guest essays and reader correspondence.