Axiomatic Reasoning for LLMs

How do you evaluate the prompt

1. Evaluation as a Multi-Dimensional Problem

Prompt evaluation requires separating two distinct axes of instruction following:

Output format adherence – The model conforms to specified syntactic and structural constraints (JSON, XML, step numbering).
Reasoning process adherence – The model follows explicit step‑by‑step reasoning instructions.

Empirical findings show an inversion: output format adherence improves task‑solving performance (1–6% relative gains when format is decoupled from task content), while reasoning process adherence degrades instruction following accuracy (up to 75% of reasoning traces fail to follow constraints). The two axes are not opposing; they interact through the structural properties of the prompt.

2. Structural Properties That Enable Evaluation

A prompt’s evaluability depends on how it organizes instructions. Three structural properties correlate with successful measurement:

Property	Definition	Evaluation test
Step atomicity	Each instruction performs one logical operation.	Can a single step be removed without breaking other steps?
Causal traceability	Output of each step is explicitly used in subsequent steps.	Does changing a step’s output change the final answer?
Verification hooks	Steps can be tested independently.	Can a step be truncated or counterfactually substituted?

When these properties are present, a user can:

Locate the first step where reasoning diverges from expected logic.
Formulate a targeted correction to that step’s precondition.
Observe whether the correction propagates to the final output.

3. Faithfulness Modes of Chain‑of‑Thought

Chain‑of‑thought (CoT) traces are not literal records of internal computation. Three operational modes exist, each requiring different evaluation criteria:

Mode	Definition	Evaluation approach
Genuine	Steps are necessary; removing them changes output.	Truncation test: output changes when steps are removed.
Scaffolding	Steps provide structural support but are not causally required.	Output unchanged after removal, but generation fluency degrades.
Decorative	Steps are post‑hoc rationalizations; removal does not change output.	Output unchanged; removal has no effect on fluency.

Evaluators must distinguish these modes. Decorative steps inflate perceived reasoning quality without contributing to correctness.

4. Task and Model Dependencies

Prompt evaluation is not task‑agnostic. Structured prompt reading yields measurable gains only in tasks requiring multi‑step logical transformation:

Mathematics
Symbolic reasoning
Planning
llm_code_generation For pattern recognition, factual recall, or simple classification, evaluating prompt structure beyond basic format instructions provides no advantage over direct output evaluation.

Model architecture also modifies evaluation criteria:

Reasoning‑first models (DeepSeek‑R1, o1) perform internal reasoning; explicit CoT adds little to output quality while increasing token cost.
Instruction‑following models (standard instruction‑tuned LLMs) benefit from structured CoT but exhibit lazy reasoning under complex constraints.

5. Quantitative and Qualitative Metrics

A complete prompt evaluation uses multiple metrics:

Metric	Measurement	Interpretation
Output format accuracy	Percentage of responses matching specified syntax.	High values indicate format adherence but do not guarantee semantic correctness.
Semantic coherence	Logical consistency of content independent of format.	Measured via counterfactual substitution or verification chains.
Silent commitment failure rate	Proportion of confident, fluent outputs that contain undetectable errors.	Requires separate verification channel; correlates with over‑trust in plausible explanations.
Deep‑thinking token ratio (DTR)	Proportion of tokens that cause large shifts in the model’s predictive distribution.	Higher DTR correlates with reasoning depth (r = 0.82 with accuracy).
Overthinking / underthinking balance	Ratio of redundant reasoning steps to missing necessary steps.	Optimal balance depends on task difficulty; uniform length constraints degrade performance.

6. Practical Evaluation Procedure

To evaluate a prompt, execute the following steps:

Separate format from task – Run the prompt with and without output format constraints. If format constraints degrade task performance, the prompt needs decoupling (e.g., moving format specifications to a separate parser or post‑processing step).
Test step necessity – For each reasoning step in the prompt, remove it and compare outputs. Steps whose removal changes the final answer are genuine; steps whose removal does not are scaffolding or decorative.
Measure token efficiency – Compute the DTR for the model’s response. Low DTR with long output indicates decorative or lazy reasoning. Short output with high DTR indicates efficient reasoning.
Check for silent commitment – Use a verification chain (same task, no original CoT visible) on a subset of outputs. Compare results. Disagreements indicate silent commitment failures.
Adapt to task complexity – For simple tasks, remove all reasoning instructions. For complex tasks, use adaptive thinking budgets (allocate fewer tokens to easy sub‑steps, more to hard sub‑steps).

7. Summary of Evaluation Criteria

A well‑evaluated prompt meets these conditions:

Output format instructions are separable from task instructions.
Each reasoning step is atomic, traceable, and independently verifiable.
The model’s CoT is examined for decorative steps; only genuine steps are trusted for error localization.
Task domain matches the evaluation method (logical transformation tasks only).
Multiple metrics (accuracy, DTR, silent commitment rate, overthinking balance) are reported, not a single score.

This site is open source. Improve this page.