Axiomatic Reasoning for LLMs

How do you evaluate the prompt

1. Evaluation as a Multi-Dimensional Problem

Prompt evaluation requires separating two distinct axes of instruction following:

Empirical findings show an inversion: output format adherence improves task‑solving performance (1–6% relative gains when format is decoupled from task content), while reasoning process adherence degrades instruction following accuracy (up to 75% of reasoning traces fail to follow constraints). The two axes are not opposing; they interact through the structural properties of the prompt.

2. Structural Properties That Enable Evaluation

A prompt’s evaluability depends on how it organizes instructions. Three structural properties correlate with successful measurement:

Property Definition Evaluation test
Step atomicity Each instruction performs one logical operation. Can a single step be removed without breaking other steps?
Causal traceability Output of each step is explicitly used in subsequent steps. Does changing a step’s output change the final answer?
Verification hooks Steps can be tested independently. Can a step be truncated or counterfactually substituted?

When these properties are present, a user can:

3. Faithfulness Modes of Chain‑of‑Thought

Chain‑of‑thought (CoT) traces are not literal records of internal computation. Three operational modes exist, each requiring different evaluation criteria:

Mode Definition Evaluation approach
Genuine Steps are necessary; removing them changes output. Truncation test: output changes when steps are removed.
Scaffolding Steps provide structural support but are not causally required. Output unchanged after removal, but generation fluency degrades.
Decorative Steps are post‑hoc rationalizations; removal does not change output. Output unchanged; removal has no effect on fluency.

Evaluators must distinguish these modes. Decorative steps inflate perceived reasoning quality without contributing to correctness.

4. Task and Model Dependencies

Prompt evaluation is not task‑agnostic. Structured prompt reading yields measurable gains only in tasks requiring multi‑step logical transformation:

For pattern recognition, factual recall, or simple classification, evaluating prompt structure beyond basic format instructions provides no advantage over direct output evaluation.

Model architecture also modifies evaluation criteria:

5. Quantitative and Qualitative Metrics

A complete prompt evaluation uses multiple metrics:

Metric Measurement Interpretation
Output format accuracy Percentage of responses matching specified syntax. High values indicate format adherence but do not guarantee semantic correctness.
Semantic coherence Logical consistency of content independent of format. Measured via counterfactual substitution or verification chains.
Silent commitment failure rate Proportion of confident, fluent outputs that contain undetectable errors. Requires separate verification channel; correlates with over‑trust in plausible explanations.
Deep‑thinking token ratio (DTR) Proportion of tokens that cause large shifts in the model’s predictive distribution. Higher DTR correlates with reasoning depth (r = 0.82 with accuracy).
Overthinking / underthinking balance Ratio of redundant reasoning steps to missing necessary steps. Optimal balance depends on task difficulty; uniform length constraints degrade performance.

6. Practical Evaluation Procedure

To evaluate a prompt, execute the following steps:

  1. Separate format from task – Run the prompt with and without output format constraints. If format constraints degrade task performance, the prompt needs decoupling (e.g., moving format specifications to a separate parser or post‑processing step).

  2. Test step necessity – For each reasoning step in the prompt, remove it and compare outputs. Steps whose removal changes the final answer are genuine; steps whose removal does not are scaffolding or decorative.

  3. Measure token efficiency – Compute the DTR for the model’s response. Low DTR with long output indicates decorative or lazy reasoning. Short output with high DTR indicates efficient reasoning.

  4. Check for silent commitment – Use a verification chain (same task, no original CoT visible) on a subset of outputs. Compare results. Disagreements indicate silent commitment failures.

  5. Adapt to task complexity – For simple tasks, remove all reasoning instructions. For complex tasks, use adaptive thinking budgets (allocate fewer tokens to easy sub‑steps, more to hard sub‑steps).

7. Summary of Evaluation Criteria

A well‑evaluated prompt meets these conditions: