Axiomatic Reasoning for LLMs

AI Smartness: Implicit Reasoning and Explicit Reasoning

1. Two Distinct Reasoning Modes in Large Language Models

Large language models (LLMs) exhibit two qualitatively different forms of reasoning that current performance metrics routinely conflate.

Explicit reasoning refers to literal, step‑by‑step logical operations applied to perfectly specified inputs. It requires:

No ambiguity in the prompt.
No background inference beyond the given symbols.
The ability to compute a deterministic output from a closed‑form description.

Implicit reasoning refers to intent understanding and grounded inference. It requires:

Recovering the user’s latent goals from under‑specified or ambiguous prompts.
Activating world knowledge that is not literally stated.
Adapting to conversational context and pragmatic expectations.

The distinction matters because state‑of‑the‑art benchmarks measure a mixture of these two modes, but users experience them as a single “smartness” score.

2. Benchmark Design and the Implicit‑Explicit Confusion

2.1 Low‑level cognitive bias in current benchmarks

Taxonomic analyses of popular benchmarks (MMLU, GSM8K, HellaSwag, BIG‑bench) show a concentration on the lower levels of cognitive complexity: memorisation and basic understanding. Higher‑order skills (analysis, evaluation, creation) are underrepresented.

Consequently, a high benchmark score does not guarantee the ability to perform literal explicit reasoning on tasks that require no background inference. Conversely, a model may appear weak on explicit reasoning tests while still delivering high user satisfaction on implicit tasks.

2.2 Construct validity failures

Some benchmarks suffer from severe construct validity problems. For example, a commonsense reasoning benchmark can be solved without reading the question – more than 65% of predictions remain unchanged when the question is removed. This indicates that the intended measurement (implicit reasoning about events) diverges from the actual measurement (pattern matching on surface statistics).

Other widely used benchmarks mix memory retrieval with reasoning in a way that cannot be disentangled. A model may answer correctly by recalling a training example rather than by performing any logical operation on the prompt.

2.3 Task‑dependent load mapping

Benchmark	Implicit reasoning load	Explicit reasoning load	Primary measured construct
MMLU	medium to high	medium (domain‑dependent)	mixed knowledge + inference
GSM8K	medium	high	multi‑step arithmetic with reading comprehension
HellaSwag	low (actual) / high (intended)	low	surface pattern matching (flawed)
BIG‑bench	variable	variable	heterogeneous – no single load
ARC	medium	medium	science knowledge + reasoning

3. Prompt Perfectness Sensitivity as a Proxy for Implicit Load

The difference in performance between a perfect prompt (highly precise, unambiguous, literal) and a standard prompt (natural, slightly ambiguous, under‑specified) reveals the implicit reasoning requirement of a task.

3.1 Magnitude of prompt sensitivity

Minor wording changes cause accuracy fluctuations of up to 15 percentage points. Proprietary models are more sensitive to prompt formulation than open‑source models, which suggests that higher instruction‑following capability (a form of implicit reasoning) comes with greater variability.

3.2 Paraphrase effects

Meaning‑preserving paraphrases systematically lower accuracy. The drop is not uniform: some models and tasks are more vulnerable. This indicates that models rely on surface‑level lexical cues rather than on the underlying semantic structure.

3.3 Underspecification collapse

When a prompt leaves a requirement unspecified, models can sometimes infer it (41.1% of cases), but the behaviour is fragile. Changing the model or a single word can cause regression two times larger than normal, with accuracy drops exceeding 20 percentage points.

3.4 Role‑based prompt artefacts

Adding an “expert persona” to a prompt degrades factual accuracy on memory‑ and logic‑dependent tasks (e.g., humanities, mathematics, coding) while improving performance on tasks where tone and structure matter more (e.g., writing, extraction, technical explanation). This reversal shows that prompt perfectness is task‑specific: a prompt that is perfect for one type of reasoning may be imperfect for another.

4. Discrepancy Cases: High Benchmark Scores, Low Explicit Reasoning

Several controlled tests expose the gap between benchmark performance and pure explicit reasoning ability.

4.1 Arithmetic rule understanding

Models achieve 74–100% numerical accuracy on two‑integer addition. However:

Commutativity is violated in up to 20% of cases (A+B ≠ B+A).
When digits are replaced by arbitrary symbols (e.g., 7 → Y), accuracy drops below 7.5%.
Explicitly giving the addition rule worsens performance by 29.5%.

Conclusion: numerical accuracy comes from pattern matching, not from rule induction.

4.2 Reversal curse

A model trained on “A is B” does not infer “B is A”. For real‑world facts: GPT‑4 answers “Who is Tom Cruise’s mother?” with 79% accuracy, but “Who is Mary Lee Pfeiffer’s son?” with 33% accuracy. Literal logical transitivity fails because the training order does not match the test order.

4.3 Simple logical reasoning (Alice in Wonderland problem)

A straightforward logic puzzle (“Alice has N brothers and M sisters. How many sisters does Alice’s brother have?”) is answered incorrectly by most state‑of‑the‑art models. Even GPT‑4o reaches only 60% accuracy on the basic version, and near 0% on a harder variant. Yet the same models score highly on MMLU.

4.4 One‑phrase condition change

Changing a single phrase in a primary‑school‑level reasoning problem causes cutting‑edge models (including o1 and DeepSeek‑R1) to lose most of their accuracy. The behaviour is consistent with recitation of memorised solutions rather than first‑principles reasoning.

4.5 Saturation masks deep gaps

Benchmarks that appear saturated (e.g., GSM, MMLU) still hide large reasoning gaps. When evaluated with methods that probe the reasoning process rather than the final answer, even the best models drop by more than 40 percentage points.

5. User‑Layer Perception Gaps: A Multiplicative Effect

The confusion between implicit and explicit reasoning is amplified when moving from researchers to everyday users.

5.1 Systematic overestimation of own performance

Users who solve logic problems with AI assistance overestimate their own accuracy by about 4 questions (e.g., reporting 17 correct out of 20 when the actual number is 13). This overestimation does not occur in the no‑AI condition.

5.2 AI literacy paradox

Higher self‑reported AI literacy correlates with stronger overconfidence. The classic Dunning‑Kruger effect (low ability → high overestimation) disappears; instead, higher literacy predicts larger miscalibration.

5.3 Trust calibration failure

Users consistently overestimate the reliability of LLM outputs. A common bias is the belief that longer explanations imply higher accuracy, although length is uncorrelated with actual correctness. LLMs themselves are overconfident, especially on hard tasks, and their confident phrasing reinforces user overestimation.

5.4 Usage patterns shield users from explicit‑reasoning failures

Analysis of 1.5 million real conversations shows:

Asking (information retrieval, explanation): 49%
Doing (task completion, writing, analysis): 40%
Programming: only 4.2%

Users rarely encounter tasks that demand pure explicit reasoning. Their satisfaction is based on implicit‑reasoning tasks, where current LLMs perform adequately. This creates an illusion of general intelligence.

5.5 Self‑reinforcing loop

Metacognitive laziness (offloading cognitive effort to the AI) reduces the user’s ability to assess output quality. Each successful interaction increases confidence without correcting the underlying miscalibration. The loop strengthens over time.

6. Implications for Measuring and Communicating AI Smartness

6.1 Separated reporting

Benchmark scores should report at least two numbers:

Explicit reasoning performance (measured with perfect, literal prompts on closed‑form tasks).
Implicit reasoning performance (measured with natural, under‑specified prompts on grounded tasks).

6.2 Task‑aware prompt standards

A prompt that is “perfect” for one model or one task may be suboptimal for another. Standardised prompt banks with controlled ambiguity levels are needed for fair comparisons.

6.3 User‑facing indicators

General users require indicators that reflect their actual usage patterns. Grounded, everyday tasks (e.g., GAIA‑style problems) should complement abstract benchmarks. Overconfidence can be reduced by showing explicit‑reasoning failure examples, not just by increasing AI literacy.

6.4 Model design trade‑offs

Improving implicit reasoning (instruction following, intent guessing) may increase prompt sensitivity and reduce performance on literal explicit tasks. Optimising for one type of smartness can degrade the other. The trade‑off must be explicitly documented.

7. Conclusion

Current assessments of AI smartness conflate two distinct capabilities: implicit reasoning (intent understanding, grounded inference) and explicit reasoning (literal computation under perfect prompts). Benchmarks, prompt sensitivity experiments, and discrepancy cases all show that high scores on popular tests do not guarantee basic explicit reasoning. Among users, this confusion combines with systematic overconfidence, creating a perception gap that is far larger than the one found in the research community. Separating the two reasoning modes in evaluation and communication is a necessary step toward accurate measurement of AI capabilities.

This site is open source. Improve this page.