March 8, 2026

The Limits of Alignment: Architecting for a Zero-Trust AI Landscape

If you haven’t read “A Nightmare on LLM Street” — Acalvio Technologies advisor Shomit Ghose’s recent piece for UC Berkeley — it’s worth your time. Shomit maps the coming AI safety crisis with uncomfortable precision. Cybersecurity risks are now defined not only by the traditional human hacker, but by the human hacker supercharged by AI (Anthropic’s GTG-1002 report documents this with clinical specificity), and soon by something stranger still: AI systems “misaligning” internally, pursuing objectives of their own that may be functionally indistinguishable from the damage wrought by human-directed attackers.

Thus far, we’ve been maintaining a polite fiction within AI. The idea that sufficiently curated training data, refined RLHF, and polished constitutional guardrails can “solve” for AI safety has the comfortable feel of digital finishing school: bake the ethics into the weights, and the model plays nice, with the AI safety then gently extruding into AI cybersecurity.

The math doesn’t bend that way.

Buried in Shomit’s hyperlinks is a reference to Rice’s Theorem. It deserves serious attention from AI security and safety practitioners alike. Rice’s Theorem is a clean, cold mathematical proof that any non-trivial behavioral property of a program — anything about what it does rather than what it is — becomes unverifiable. No general algorithm can determine how another program will behave under arbitrary conditions. Period.

LLMs are models. Which means asking “will this model act in a misaligned way?” isn’t merely a difficult engineering problem. It’s a question that’s mathematically forbidden from having a deterministic answer as systems scale and become more complex. This means we cannot verify the interior of an AI agent, and likely never will.

This is not a counsel of despair. It’s instead a forcing function. If internal integrity cannot be guaranteed, cybersecurity intrusion within the firewall must be treated as already implied. We are in a zero-trust world where the adversary isn’t just outside the gates, it’s sometimes the very agent we’ve deployed.

Three threat classes now define the cybersecurity landscape: the human hacker; the AI-augmented human; and the emergently misaligned AI. Across all three, the alignment stack — pretraining filters, post-training guardrails, constitutional classifiers — offers probabilistic risk reduction, not a security guarantee. When the castle can’t be certified, architectural honesty demands we stop trying to fix the castle and start manipulating the terrain.

What’s the answer? Model-aware, game-theoretic deception. Seed the network with AI-aware tripwires — honeypots and honeytokens calibrated to appear high-value to a model’s specific pattern-matching architecture — and you turn the agent’s own reasoning against it. If we cannot certify the soul of the AI, we can control its perception of reality. Not thicker guardrails but a smarter maze.

In a world governed by Rice’s Theorem, the only winning move is to make bad behavior a losing mathematical proposition for the cybersecurity malicious actor. Across all three threat classes. Without requiring a new playbook for every novel attack vector.

As Shomit pointed out in his article, AI’s call may now be coming from inside the house. As cybersecurity practitioners facing new risks from AI, our only coherent response will be to continuously redesign the house in real-time.