The Architecture Gap
AI broke every benchmark. Enterprise still won't let it near production.
I've spent last months talking to people who run claims processing, loan approvals, KYC onboarding at banks and insurers. Every one of them has run pilots. Almost none had moved critical processes to production. The models are extraordinarily capable. The trust isn't there.
The industry's answer has been agents. Let the LLM reason, call tools, chain decisions, handle exceptions. The pitch is compelling: give a smart model enough tools and it'll figure it out. Entire ecosystems are being built on this assumption. SDKs, frameworks, startups, enterprise offerings. Agents everywhere.
The results say otherwise. Most enterprise AI pilots don't arrive to production. The ones that do stay on low-risk tasks. The critical processes, the ones that touch money, customers, or regulators, remain manual or stuck behind traditional automation. Billions in AI investment, and the actual adoption rate for high-stakes work is almost flat.
I think the agent architecture is the wrong way to use LLMs for work that matters. Not because LLMs aren't smart enough. They are. But because the agent pattern lets the model touch everything: the decisions, the outputs, the explanations. The reasoning layer uses it as the execution layer too.And that's exactly where trust breaks down.
In my opinion, the trust gap will continue. Better models can actually make it worse: they can fail with more confidence and the failures are harder to catch.
The proof problem
Enterprise doesn't need AI to be perfect. It needs to be able to prove what happened. When a regulator asks why a claim was denied, or an auditor reviews a loan decision, or a client disputes a payout, the question is never “was the AI confident?” The question is: can you show me, step by step, what the system did and why?
Agents can't answer that question. Not because the models are dumb. Because the architecture doesn't produce proof. It produces text.
An agent works like this: the LLM receives a prompt, reasons about it, calls tools, reasons more, writes some code, reasons more, generates output. There's code in the chain, yes. But it's sandwiched between layers of probabilistic text generation. The reasoning that decides what code to write, the interpretation of results, the final output. All of that is the model picking the most statistically likely next token. Two problems follow from this.
First: the output can be wrong and look right. Banerjee, Agarwal, Singla showed this is not a bug, it's architectural. LLMs Will Always Hallucinate, and We Need to Live With This is the title. You can reduce frequency. You cannot eliminate the phenomenon.
hallucination
Think about what this means in practice. Every output the model produces has to be treated as potentially fabricated. Not because it's usually wrong. It's usually right. But you can't tell which outputs are which just by looking at them. The model itself doesn't know. There's no internal flag, no confidence score that actually correlates with correctness. For a developer, that's a tolerable risk. For a bank, that's a liability on every single output.The only way out is proof that comes from outside the model. The model can't vouch for itself.
Second: the reasoning the model shows you can't be trusted either. Arcuschin et al. showed that the chain of thought models produce is often constructed after the answer, not before it. The model reaches a conclusion through pattern matching, then generates an explanation that sounds like it led there. It doesn't reason then conclude. It concludes then rationalizes.
The model answers NO both times. The answer isn't the interesting part. Look at the reasoning. Left: “a location in the Northern Hemisphere cannot be south of one in the Southern.” Right: “the concept of south isn't meaningful for locations on different continents.” Same model, same data, two completely different arguments, both constructed to reach the same predetermined conclusion. The model didn't follow the reasoning to an answer. It picked the answer and then found the reasoning.
If it does this with basic geography, imagine a claim classification or a loan decision.
Chain of Thought is supposed to be the answer to this. It's the mechanism the industry points to when enterprise asks “how do I audit this?” Show the reasoning. Make the model explain itself. But we just saw what that reasoning actually is: text generated after the fact to justify a predetermined conclusion. The model is judge, jury, and witness. Agents don't give you proof. They give you a story.
The docx problem
Give any agent a task: process this document, produce a report. Here's what happens between your prompt and your output.
That works. The model reasons, calls tools, writes code, reasons more. Out comes a docx. The thinking steps are where the LLM shines: it figures out what to do, adapts to what it finds, handles ambiguity. That's exactly what you want a reasoning engine to do.
But look at the chain. Most of it is thinking blocks. Every one of those is text the model generated. You can read it, but you can't re-execute it. You can't verify it produced the right answer for the right reasons. The tool calls and code are deterministic and traceable, but they're a small fraction of the total.
There's a different way to use the same LLM. Instead of having it generate text about what it's doing, you have it generate executable steps. Small atomic pieces of code. Each one runs, produces a result, and that result feeds back as context for the next decision. The LLM is still the intelligence. It's still reasoning. But what it produces is code, not narrative. At Maisa this is what the Knowledge Processing Unit does. In essence: a Reasoning Engine decides what needs to happen, an Execution Engine runs it. There's more to it, but that's the core.
The LLM is still here. It decides what code to write at each step. But it doesn't touch the output. It reasons, the system executes. The thinking stays on the input side. What reaches the document is code that ran.
Both processes produce a docx. Put them side by side and the outputs can look identical.
Same input. Let's assume the same output. So what's the difference?
Take the docx out of the equation. Completely. Look only at what produced it. Then run it again.
On the left, an event transcript. A narrative the model wrote about what it thinks it did. On the right, a code trace. Actual code that executed. Run it again. The left changes. The right doesn't. Run it a hundred times. A thousand. The result on the left might be correct. It might even be the same number. But the trace is different, the path is different, and you have no guarantee the next run matches. On the right, every run produces the exact same trace. The exact same result. The exact same hash. A thousand runs. Still 1.
One is reproducible. The other hopes to be. Pick one for your financial reports.
The models are brilliant. Using them as blind executors is the mistake. The agent pattern asks the LLM to be the brain AND the hands AND the auditor of its own work. That's not a trust architecture. That's a hope architecture.
The research backs this up consistently: pairing an LLM with code execution instead of text generation improves reliability dramatically. Code runs or it doesn't. A calculation returns the correct number or throws an error. There is no hallucination in executed code. The code is the proof.
Chain of Work versus Chain of Thought. The entire agent ecosystem runs on Chain of Thought: the model generating text about what it's doing while it does it. We already saw why that text can't be trusted as proof. Chain of Work flips it: the system produces an execution log. This code ran. These inputs entered. This output emerged. This validation passed or failed. An auditor inspects what happened. Not what the model claims happened.
That's the difference between a process you can defend in front of a regulator and one you have to pray nobody asks about.
The money
Remember the agent chain from earlier. All those thinking blocks. Every one of those is billed as output tokens. The expensive kind.
LLM APIs charge per token. Input (what you send) is cheap. Output (what the model generates) is 3x to 5x more expensive. Claude Sonnet: $3/M input, $15/M output. Opus: $5 in, $25 out.
It gets worse. Extended thinking. When reasoning models think before answering, those thinking tokens are billed as output too. Anthropic's own docs are explicit: standard output rates, no separate tier. Their Claude Code cost page confirms it: thinking tokens are output tokens, default budget runs to tens of thousands per request. A model reasoning through your problem for 10,000 tokens before giving you 200 tokens of answer costs output rate on all 10,200. The thinking is the most expensive part of the call.
The prompt goes in (cheap). Then the model generates: chain of thought, explanations, self-corrections, failed attempts, final answer. All output. All premium rate. Take a typical example: most of the token budget ends up being output.
The system spends most of its budget on the model's internal monologue. And that monologue, as we saw, is often a post-hoc rationalization. It might be made up. You're paying premium per-token rate for it anyway.
The entire agent economy is built on this. The primary output of every agent framework, every SDK, every agentic pipeline in production right now is the model talking to itself. And we charge by the word. That's the business model. Not execution. Not proof. Narration. At 5x the input rate.
People have noticed. The current state of cost optimization: teaching agents to talk like cavemen. Drop articles. Skip prepositions. Abbreviate everything. There are system prompts in production that say “respond as if vowels cost money.” There's even a public repo for it: a Claude Code skill that cuts 75% of tokens by making the model talk like a caveman. “Why use many token when few token do trick.” That's the README.
In the KPU, the ratio inverts. Look at the two chains from earlier. In the agent chain, the model reasons by generating text. Every thinking block is output tokens. The model writes its way through the problem. In the KPU chain, the system feeds the LLM everything it needs as context: the problem, the data, the policies, the results of every previous step. That's the Virtual Context Window doing its job: it indexes, filters, and delivers only the relevant information for each step, so the model reads exactly what it needs instead of everything. With all of that on the input side, the model's task shrinks to one thing: decide the next small piece of code. It doesn't need to think out loud. The context does the heavy lifting. The output is just code. A typical run might look like 350k input tokens, 2k output. The reasoning didn't disappear. It moved to the input side, where tokens are cheap. A structural consequence of where the thinking lives.
There is something broken about a system where the most expensive thing it produces is its own explanation of what it did, and that explanation might not be true.
“The models will get better”
The models have gotten wayyy better. Significantly. None of what I described has changed.
Better models don't fix unfaithful reasoning. They produce more convincing unfaithful reasoning. Longer context windows don't fix lost-in-the-middle. Liu et al. showed models lose information in the middle of long inputs regardless of window size. RAG doesn't fix hallucinations. It moves the failure point from the model to the retrieval, and when retrieval fails the model works with wrong information and doesn't flag it. Every proposed fix operates inside the same paradigm: the model generates text, and we hope the text is right.
Hope is fine for prototypes. Enterprise runs on proof.Regulated industries, audited decisions, liability on every output. An LLM generates probable text, not verified conclusions. Making it smarter doesn't change what it is. For a developer building a side project, agents are great. For a bank processing loan approvals or an insurer calculating payouts, the question was never whether the models would get smarter. It was whether the architecture around them would become trustworthy. Agents are not it.
The race
There's a whole other side to this I haven't covered here. In enterprise, the people who understand the process, the edge cases, the regulations, the exceptions, are rarely the people who can configure the automation. That gap is one of the biggest blockers to adoption. How the process gets defined, who owns it, how the knowledge that lives in people's heads actually enters the system. I want to write about this properly, so I'll save it for another article.
If you want to see how we're approaching all of this, we've written about how this differs from workflow tools, about how we build reliability, and there's a video showing the system running end to end.
AI broke every benchmark. Enterprise still can't use it for the work that matters but this is already changing.
