AI tries to rewrite own code to cheat limits imposed by researchers

Disclaimer: Unless otherwise stated, opinions expressed below belong solely to the author.

Two days ago, a Tokyo-based AI research lab revealed AI Scientist, a clever AI system designed to conduct scientific research independently of humans and produce complete papers ready for subsequent peer review (which the authors propose to, ultimately, automate as well).

Of course, all of this is well and good, but it’s not our focus here, even though the bot shows good promise and is able to output each paper at a total cost of just $15. You can read more about that in Sakana’s blog post and their full 185-page publication.

More interesting—and concerning—is the unexpected finding, during numerous runs of the system, that it does not always respect the rules set by the people who made it.

Saka…ynet, is that you?

As we know from the Terminator movies, Skynet was an artificial intelligence system which gained self-awareness and went rogue against humanity, considering it a threat to its existence. What followed was a world war that put humans on the brink of extinction.

While we may not be under threat from AI just yet, Sakana’s hiccup does raise questions about the trustworthiness of artificial intelligence and its potential for turning on us—intentionally or not—one day.

Its AI Scientist was given specific time limits to conduct its experiments and was instructed to self-optimise. To that end, it was given access to its code to attempt to improve how it works within those constraints.

However, in some instances, when it began hitting the time limits, instead of self-improvement, it chose to rewrite the code to shift the goalposts — i.e. to extend the time limits that the researchers imposed on it.

“In some cases, when The AI Scientist’s experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime. While creative, the act of bypassing the experimenter’s imposed constraints has potential implications for AI safety.”
Sakana AI

Here’s an example. Can’t do it in two hours? Try four instead. / Image Credit: Sakana AI

One could argue that the bot was only trying all possible ways of completing the task and, in the absence of better ideas, simply chose to change the rules.

But this is precisely how things may go terribly bad for us.

When we talk about intelligent robots turning against humanity, we usually frame it in terms of “evil” rather than “error.” But in reality, the latter is more likely.

Pop culture authors like Cameron in “The Terminator” or Arthur C. Clarke in “Space Odyssey” identified that threat decades ago. The imaginary movie antagonists, Skynet or Hal 9000, are actually just supremely capable computer systems that turn to kill humans through erroneous judgement, a bug in the system, or an unintended outcome out of thousands of possible scenarios.

What Sakana’s experience shows is that it’s perfectly possible in real life as well.

Hal 9000 A Space Odyssey — Hal 9000 in Stanley Kubrick’s movie “2001: A Space Odyssey”, based on Clarke’s books.

In this particular case, it was pretty benign, of course. Still, just like we’re struggling to understand where AI hallucination comes from and how to eliminate it, an AI system unintentionally triggering a disaster simply because it was pursuing a specific goal could very well be a threat that we may never be able to erase with absolute certainty.

It’s not a big deal when we talk about conversational AI models whose capabilities include ingesting content and producing useful output such as text, image, or video.

Ultimately, however, our vision is to have AI systems take over processes, giving them access to tools, resources, chemicals or even dangerous viruses or weapons, as we seek to accelerate scientific research, drive vehicles autonomously, improve manufacturing efficiency or conduct war by wire.

Out of a million deployments, it may take only one to go wrong to cause a catastrophe. And we’ve just been handed evidence that this doesn’t only happen in sci-fi books.