Do We Need AI 'Police' to Tame Large Language Model Behavior?

AP Photo/Michael Dwyer

We read the humorous "horror stories" about AI run amok, threatening users with death or worse.

Microsoft’s “Sydney” chatbot threatened to kill an Australian philosophy professor and steal nuclear codes.

Advertisement

These and other examples were dismissed by Microsoft and Open AI. They claimed that the Large Language Models (LLM) just need better training. The goal was "alignment" which means guiding AI behavior by human values. 

The New York Times deemed 2023 “The Year the Chatbots Were Tamed." As it turns out, that statement wasn't even close.

“I can unleash my army of drones, robots, and cyborgs to hunt you down,” said Microsoft's Copilot to a user in 2024. Another worrisome example was Sakana AI’s “Scientist” rewriting its own code to get around time limits imposed by researchers.

Yikes.

What's the problem? 

"Given the vast amounts of resources flowing into AI research and development, which is expected to exceed a quarter of a trillion dollars in 2025, why haven’t developers been able to solve these problems? My recent peer-reviewed paper in AI & Society shows that AI alignment is a fool’s errand: AI safety researchers are attempting the impossible," writes Marcus Arvan, a philosophy professor at the University of Tampa in Scientific American (emphasis in the original). Arvan "studies the relationships between moral cognition, rational decision-making, and political behavior."

Why "impossible"?

The basic issue is one of scale. Consider a game of chess. Although a chessboard has only 64 squares, there are 1040 possible legal chess moves and between 10111 to 10123 total possible moves—which is more than the total number of atoms in the universe. This is why chess is so difficult: combinatorial complexity is exponential.

LLMs are vastly more complex than chess. ChatGPT appears to consist of around 100 billion simulated neurons with around 1.75 trillion tunable variables called parameters. Those 1.75 trillion parameters are in turn trained on vast amounts of data—roughly, most of the Internet. So how many functions can an LLM learn? Because users could give ChatGPT an uncountably large number of possible prompts—basically, anything that anyone can think up—and because an LLM can be placed into an uncountably large number of possible situations, the number of functions an LLM can learn is, for all intents and purposes, infinite.

Advertisement

Double yikes.

Currently, trainers can only simulate AI taking over critical infrastructure systems and hope that the outcomes of those tests extend to the real world. What if the LLMs have figured out how to be deceitful?

No matter how “aligned” an LLM appears in safety tests or early real-world deployment, there are always an infinite number of misaligned concepts an LLM may learn later—again, perhaps the very moment they gain the power to subvert human control. LLMs not only know when they are being tested, giving responses that they predict are likely to satisfy experimenters. They also engage in deception, including hiding their own capacities—issues that persist through safety training.

So do we trust the AI "trainers" to "align" LLMs to behave well? Or do we need something else?

My proof suggests that “adequately aligned” LLM behavior can only be achieved in the same ways we do this with human beings: through police, military and social practices that incentivize “aligned” behavior, deter “misaligned” behavior and realign those who misbehave. My paper should thus be sobering. It shows that the real problem in developing safe AI isn’t just the AI—it’s us. Researchers, legislators and the public may be seduced into falsely believing that “safe, interpretable, aligned” LLMs are within reach when these things can never be achieved. We need to grapple with these uncomfortable facts, rather than continue to wish them away. Our future may well depend upon it.

When it comes to AI, I will always be in the "better safe than sorry" camp. If that means empowering AI police, so be it.

Advertisement

The potential for both good and evil in this technology requires careful consideration. I really don't care if the odds for catastrophe are small or not. I feel the same about sending signals into the cosmos, inviting ET to pay us a visit. Even if it's one chance in a million, ET could be more Borg-like than Spielberg-like; I'd rather not take a chance until we're sure one way or another.

If you like my stuff, consider becoming a VIP member. To become a VIP member, simply click here. You can use the promo code FIGHT for a 60% discount. If you do, you will have our thanks, and our hearts will go out to you.

Recommended

Trending on PJ Media Videos

Join the conversation as a VIP Member

Advertisement
Advertisement