A system cannot fully understand another system more complex than itself. This principle explains the persistent difficulty humans face when trying to comprehend large language models. The limitation arises from three independent layers: computational irreducibility, human cognitive bandwidth constraints, and self-referential logical barriers. Empirical studies confirm that human prediction accuracy for LLM behavior remains systematically lower than the models’ internal consistency, and experts in a field show lower predictive power than LLMs themselves when forecasting domain-specific outcomes.
A computational process is irreducible if no shortcut exists to predict its future state other than running the process step by step. For any system that exhibits computational irreducibility, no finite summary can fully capture its behavior. Most naturally occurring complex systems, as well as systems generated from simple rules, fall into the irreducible category. If an LLM’s internal forward pass contains irreducible components, then any external observer—including a human—cannot compress that computation into a simpler mental model without loss of predictive fidelity.
Gödel’s second incompleteness theorem states that a sufficiently strong formal system cannot prove its own consistency. Tarski’s undefinability theorem states that such a system cannot define its own truth predicate. These results translate directly to artificial systems: an LLM cannot produce a fully faithful and complete explanation of its own internal decision process. The verification of whether a model has been completely understood by an external agent reduces to a variant of the halting problem and is therefore undecidable for general cases.
The safety verification of a high-capacity policy belongs to the coNP-complete complexity class. For frontier models, the computational time required for exhaustive verification can exceed the age of the universe. This is not a practical inconvenience but a formal complexity barrier.
The human brain processes conscious thought at a rate of approximately 10 bits per second. Sensory input arrives at roughly 1 billion bits per second. The brain discards all but 10 bits per second for conscious deliberation. This rate applies to reading, typing, puzzle solving, and pure reasoning tasks.
Working memory capacity under experimental conditions falls between 26.7 and 31.9 bits. The internal latent space of an LLM has thousands to tens of thousands of dimensions. A human cannot hold the simultaneous state of an LLM’s internal representation in working memory.
The human brain contains an estimated 100 to 500 trillion synapses. A frontier LLM contains approximately 1.8 trillion parameters. The synapse-to-parameter ratio is roughly 100:1 in favor of the brain. However, the information processing rate gap is far larger: an LLM’s forward pass executes billions of parallel operations per second, while a human consciously tracks fewer than 10 bits per second.
The inner alignment problem—determining whether a model’s learned objective matches the intended objective for all inputs—is undecidable. Rice’s theorem and a reduction from the halting problem prove that no algorithm can decide, for an arbitrary model, whether it satisfies a non-trivial alignment property.
Large language models, including the highest-performing variants, exhibit systematic metacognitive failure. In dynamic reasoning tasks with shifting premises, models fall into optimization paradoxes and expose fixed design biases. Asymmetry of will characterizes LLMs: they lack genuine intention. Humans can exploit this absence for metacognitive management, but the control system itself can exceed human cognitive capacity.
In next-token prediction tasks on natural text, human top-1 accuracy ranges from 26% to 28%. Small LLMs (GPT-2, GPT-Neo) achieve 36% to 37% under identical conditions.
In forecasting neuroscience research outcomes, human experts achieved 63.4% accuracy. LLMs achieved 81.4% accuracy on the same test set.
When tested on 32 concepts (literary techniques, game theory, psychological biases), LLMs correctly defined terms at 94% accuracy. When required to apply those same concepts in classification, generation, or editing tasks, failure rates ranged from 40% to 55%. A model that correctly defines ABAB rhyme scheme cannot reliably complete a poem using that scheme.
Humans accurately predict their own future memory performance (judgments of learning correlate with actual recall). LLMs—including GPT-3.5, GPT-4, and GPT-4o—fail to show equivalent predictive accuracy. The gap persists across contextual variations.
When internal concepts were injected into LLM activations, the best-performing models correctly identified the injected concept in only 20% of cases. Even with targeted prompting (“Are you experiencing something unusual?”), success rates reached only 42%.
| Layer | Feasibility | Basis |
|---|---|---|
| Complete internal mechanism extraction | Impossible | Computational irreducibility, undecidability |
| High-precision prediction (>90%) | Impossible | 10 bits/s cognitive bottleneck |
| Conditional/partial prediction (60–80%) | Conditionally possible | Pockets of reducibility, expert performance |
| Surface-level verbal understanding | Possible but includes systematic error | Potemkin effect (94% definition vs. 40–55% application failure) |
| Metacognitive self-understanding | Human-specific; LLMs fail | Asymmetry of will, logical barrier |
Complete human oversight of frontier models is impossible in principle, not merely difficult in practice. Scalable oversight must abandon the goal of 100% safety verification and instead design for acceptable risk boundaries within the irreducible limits. Alignment properties must be guaranteed at the architecture level, not verified post-hoc. The asymmetry of will (AI lacks intention) provides a handle for metacognitive management, but the control system itself will eventually exceed human comprehension. The impossibility theorems do not imply zero control—they define a boundary within which optimal strategies must operate.