Axiomatic Reasoning for LLMs

What is Lacking in AI: Intelligence or System?

1. Observed Performance Ceiling

Contemporary large language model agents demonstrate competence in isolated tasks—code generation, single-step reasoning, and tool invocation. Yet, when composed into swarms tasked with open-ended design of complete software or hardware systems, success rates plateau below the threshold required for autonomous operation. This plateau persists across model scales and training data expansions. The bottleneck resides not in the raw predictive capacity of the underlying neural networks but in the systemic architecture that organizes perception, action, and evaluation.

2. Structural Discontinuities in Current Agent Architectures

Three architectural gaps separate present agent implementations from closed-loop cognitive systems.

2.1 Prediction–Execution Decoupling (Cartesian Cut)

Component	Function	Integration Limitation
LLM Core	Token sequence prediction	Trained exclusively on static corpora; lacks exposure to action-outcome contingencies.
External Runtime	Tool calling, code execution	Executes model outputs without modifying internal representations of the model.

The LLM generates instructions based on statistical regularities. The runtime executes them and returns textual logs. The model ingests these logs as input tokens rather than as parameter updates. This arrangement forms an open-loop control system. Errors detected during execution do not reshape the predictive distribution except through prompt concatenation.

2.2 Semantic Grounding Gap

Models generate syntactically valid artifacts—Verilog modules, mechanical designs, architectural specifications—that violate physical or logical constraints undetectable from surface form alone. The following failure modes appear consistently across electronic design automation (EDA) benchmarks:

Timing violations in generated hardware description language code that pass syntax checks.
Thermal or power density infeasibility in floorplan proposals.
Logical contradictions in multi-component interface definitions.

Human engineers circumvent these failures through internal simulation: activation of sensorimotor and spatial reasoning circuits that predict physical outcomes before formal verification. The model possesses no equivalent mechanism; it relies on external simulators and static analysis tools as post-hoc filters.

2.3 Absence of Recursive Self-Evaluation

Empirical assessments of self-critique reveal a performance degradation effect. In high-accuracy coding tasks, requesting a model to review and revise its own output reduces success rates from 98% to 57%. The degradation originates from the model’s inability to maintain a distinct evaluative stance toward its own generations. The same predictive machinery that produced the initial output applies identical heuristics during critique, amplifying initial biases rather than correcting them.

A closed-loop improvement structure requires:

A differentiable separation between actor and critic.
A persistent memory of prior errors and their resolutions.
A mechanism to update the actor’s policy based on critic feedback.

Present systems implement none of these natively; they rely on external orchestration frameworks to simulate iterative refinement through prompt chaining.

3. Evidence Summary

Observation	Structural Implication
Multi-agent coordination fails at rates exceeding 68% for self-organizing topologies.	Communication protocols and shared memory coherence are external to model architecture.
Agent performance on tool-intensive tasks degrades as agent count increases.	Coordination overhead scales superlinearly without native consensus mechanisms.
Models with larger parameter counts exhibit lower spontaneous collaboration rates.	Increased predictive self-sufficiency reduces reliance on external feedback loops.
Autonomous chip design workflows remain dependent on human sign-off at verification stages.	Deterministic correctness requirements cannot be satisfied by probabilistic token generation alone.
Long-horizon planning accuracy remains below 50% for tasks exceeding 20 steps.	State tracking and value propagation are not integrated into the generation process.

4. Conclusion

The limitations observed in agentic systems originate from architectural choices, not from insufficient model capacity. Scaling model size, training data, or inference compute does not resolve the three structural discontinuities:

Open-loop control between prediction and execution.
Absence of grounded semantics linking symbols to physical constraints.
Missing recursive self-modification pathways.

These discontinuities define the boundary between current language model agents and closed-loop cognitive systems capable of autonomous design iteration. Future progress toward autonomous system construction requires architectural integration—embedding action feedback into model parameters, grounding generation in physical simulators, and implementing native meta-cognitive loops—rather than incremental expansion of pretraining scale.

This site is open source. Improve this page.