Axiomatic Reasoning for LLMs

The Theoretical Reason of Difficulty of Understanding LLM

1. Overview

A system cannot fully understand another system more complex than itself. This principle explains the persistent difficulty humans face when trying to comprehend large language models. The limitation arises from three independent layers: computational irreducibility, human cognitive bandwidth constraints, and self-referential logical barriers. Empirical studies confirm that human prediction accuracy for LLM behavior remains systematically lower than the models’ internal consistency, and experts in a field show lower predictive power than LLMs themselves when forecasting domain-specific outcomes.

2. Computational Complexity Barriers

2.1 Computational Irreducibility

A computational process is irreducible if no shortcut exists to predict its future state other than running the process step by step. For any system that exhibits computational irreducibility, no finite summary can fully capture its behavior. Most naturally occurring complex systems, as well as systems generated from simple rules, fall into the irreducible category. If an LLM’s internal forward pass contains irreducible components, then any external observer—including a human—cannot compress that computation into a simpler mental model without loss of predictive fidelity.

2.2 Logical Limits of Self-Referential Understanding

Gödel’s second incompleteness theorem states that a sufficiently strong formal system cannot prove its own consistency. Tarski’s undefinability theorem states that such a system cannot define its own truth predicate. These results translate directly to artificial systems: an LLM cannot produce a fully faithful and complete explanation of its own internal decision process. The verification of whether a model has been completely understood by an external agent reduces to a variant of the halting problem and is therefore undecidable for general cases.

2.3 Complexity of Alignment Verification

The safety verification of a high-capacity policy belongs to the coNP-complete complexity class. For frontier models, the computational time required for exhaustive verification can exceed the age of the universe. This is not a practical inconvenience but a formal complexity barrier.

3. Human Cognitive Bandwidth Constraints

3.1 Information Processing Rate

The human brain processes conscious thought at a rate of approximately 10 bits per second. Sensory input arrives at roughly 1 billion bits per second. The brain discards all but 10 bits per second for conscious deliberation. This rate applies to reading, typing, puzzle solving, and pure reasoning tasks.

3.2 Working Memory Capacity

Working memory capacity under experimental conditions falls between 26.7 and 31.9 bits. The internal latent space of an LLM has thousands to tens of thousands of dimensions. A human cannot hold the simultaneous state of an LLM’s internal representation in working memory.

3.3 Synaptic and Parameter Comparisons

The human brain contains an estimated 100 to 500 trillion synapses. A frontier LLM contains approximately 1.8 trillion parameters. The synapse-to-parameter ratio is roughly 100:1 in favor of the brain. However, the information processing rate gap is far larger: an LLM’s forward pass executes billions of parallel operations per second, while a human consciously tracks fewer than 10 bits per second.

4. Self-Referential and Metacognitive Limitations

4.1 Inner Alignment Undecidability

The inner alignment problem—determining whether a model’s learned objective matches the intended objective for all inputs—is undecidable. Rice’s theorem and a reduction from the halting problem prove that no algorithm can decide, for an arbitrary model, whether it satisfies a non-trivial alignment property.

4.2 Five Impossibility Pillars of Alignment

Geometric: The set of safe policies has measure zero in policy space. A single safety rule dependent on a world context not seen in training collapses the volume to zero.
Computational: Safety verification is coNP-complete.
Statistical: Rare catastrophic events cannot be adequately sampled for training without first operating unsafely.
Information-theoretic: The full set of real-world safety rules exceeds the storage capacity of any feasible AI system.
Dynamic: The gradient of capability improvement generally anti-aligns with the gradient of safety.

4.3 Logical Barrier of Metacognition

Large language models, including the highest-performing variants, exhibit systematic metacognitive failure. In dynamic reasoning tasks with shifting premises, models fall into optimization paradoxes and expose fixed design biases. Asymmetry of will characterizes LLMs: they lack genuine intention. Humans can exploit this absence for metacognitive management, but the control system itself can exceed human cognitive capacity.

5. Empirical Evidence

5.1 Human Prediction Accuracy

In next-token prediction tasks on natural text, human top-1 accuracy ranges from 26% to 28%. Small LLMs (GPT-2, GPT-Neo) achieve 36% to 37% under identical conditions.

In forecasting neuroscience research outcomes, human experts achieved 63.4% accuracy. LLMs achieved 81.4% accuracy on the same test set.

5.2 Potemkin Understanding

When tested on 32 concepts (literary techniques, game theory, psychological biases), LLMs correctly defined terms at 94% accuracy. When required to apply those same concepts in classification, generation, or editing tasks, failure rates ranged from 40% to 55%. A model that correctly defines ABAB rhyme scheme cannot reliably complete a poem using that scheme.

5.3 Metacognitive Prediction in Humans vs. LLMs

Humans accurately predict their own future memory performance (judgments of learning correlate with actual recall). LLMs—including GPT-3.5, GPT-4, and GPT-4o—fail to show equivalent predictive accuracy. The gap persists across contextual variations.

5.4 Introspective Accuracy

When internal concepts were injected into LLM activations, the best-performing models correctly identified the injected concept in only 20% of cases. Even with targeted prompting (“Are you experiencing something unusual?”), success rates reached only 42%.

6. A Layered Framework of Understanding

Layer	Feasibility	Basis
Complete internal mechanism extraction	Impossible	Computational irreducibility, undecidability
High-precision prediction (>90%)	Impossible	10 bits/s cognitive bottleneck
Conditional/partial prediction (60–80%)	Conditionally possible	Pockets of reducibility, expert performance
Surface-level verbal understanding	Possible but includes systematic error	Potemkin effect (94% definition vs. 40–55% application failure)
Metacognitive self-understanding	Human-specific; LLMs fail	Asymmetry of will, logical barrier

7. Implications for AI Safety and Alignment

Complete human oversight of frontier models is impossible in principle, not merely difficult in practice. Scalable oversight must abandon the goal of 100% safety verification and instead design for acceptable risk boundaries within the irreducible limits. Alignment properties must be guaranteed at the architecture level, not verified post-hoc. The asymmetry of will (AI lacks intention) provides a handle for metacognitive management, but the control system itself will eventually exceed human comprehension. The impossibility theorems do not imply zero control—they define a boundary within which optimal strategies must operate.

This site is open source. Improve this page.