Axiomatic Reasoning for LLMs

Performance Gains via Completion Prompting on Instruction-Tuned Models

1. Empirical Basis of Alignment Tax

Alignment tax denotes measurable performance degradation in language models following safety tuning, instruction fine‑tuning, or reinforcement learning from human feedback. Quantitative studies report reductions in task accuracy across multiple domains.

In zero‑shot chain‑of‑thought evaluations on GSM8K, base models consistently surpass their instruction‑tuned counterparts. A 70B parameter model exhibited a 32.67% relative drop after instruction tuning. Retrieval‑augmented generation tasks show base models outperforming instruct versions by approximately 20% on average on datasets such as TriviaQA and Natural Questions. Additional benchmarks confirm that RLHF induces response homogenization, with TruthfulQA evaluations showing a rise in single‑cluster response rates from 1.0% in base models to 28.5% in aligned models.

The principal mechanisms behind alignment tax include loss of feature diversity and constrained output distributions introduced during preference optimization. Several mitigation techniques, including model averaging and null‑space constrained policy optimization, partially restore performance while preserving safety.

2. Completion Prompting: Definition and Observed Effects

Completion‑mode prompting supplies a plain text prefix and relies on the model’s next‑token prediction behavior without role‑delimited messages. In contrast, chat‑mode prompting employs structured message arrays with designated system, user, and assistant roles.

Direct comparisons between completion and chat prompting on instruction‑tuned models reveal performance differentials. On the BIG‑bench Hard dataset, completion prompts yield approximately 12% higher average scores than chat prompts. For mathematical reasoning on GSM8K, zero‑shot chain‑of‑thought prompts, which structurally resemble completion prompts, enable instruct models to recover much of the base model advantage.

The effect is task‑dependent. Mathematical reasoning, code generation, and information extraction exhibit stronger gains, whereas general knowledge tasks such as MMLU show smaller or reversed effects depending on evaluation methodology.

3. Structural Characteristics of Instruction‑Tuned Models

Contemporary production language models are predominantly instruction‑tuned or chat‑optimized. Their architecture preserves the base model’s autoregressive completion capability while overlaying a dialogue layer that enforces role formatting and safety filtering.

When a completion‑style prompt is supplied to an instruct model, the base‑layer generative process is activated without the full contextual constraints of the dialogue template. The output, however, remains wrapped in the model’s native response structure, including implicit safety guardrails. This dual‑layer behavior entails that completion prompts do not bypass alignment safeguards entirely but alter the internal activation distribution during generation.

The observed behavior of models such as Qwen3‑4B supports this view. Despite being chat‑tuned, the model exhibits base‑model‑like responses when evaluated in completion mode, indicating a latent capacity for unconstrained next‑token prediction.

4. Compatibility of Performance Gains with Safety Maintenance

Completion prompting’s effect on safety properties varies by implementation pattern. Single‑turn completion outputs may increase susceptibility to prompt injection or underspecification vulnerabilities. However, multi‑stage prompting architectures, wherein a completion‑style internal reasoning step is followed by a chat‑style final answer, maintain safety while leveraging base‑layer generative capability.

Training‑based analogues validate this approach. Deliberative alignment procedures teach models to reference safety specifications during a hidden reasoning phase before producing a final output. Safe‑completion fine‑tuning shifts safety criteria from binary refusal to output‑centric harm avoidance, achieving simultaneous improvements in safety and usefulness. Dynamic guardrail generation employs self‑generated safety constraints within a multi‑turn prompt flow to reduce toxicity and bias without external filters.

These findings suggest that completion prompts can be integrated into safe deployment pipelines through controlled prompt chaining and output validation layers.

5. Observed Task‑Specific Improvements

Task Category Observed Completion Advantage Representative Benchmarks
Mathematical Reasoning Base model superiority in zero‑shot CoT; completion prompts recover instruct model performance GSM8K, MedCalc
Multi‑step Reasoning Approximate 12% gain over chat mode BIG‑bench Hard
Retrieval‑Augmented Generation Base models exceed instruct versions by ~20% TriviaQA, Natural Questions
Code Generation Completion prompts align with base‑model code completion strengths HumanEval, MBPP
General Knowledge Small or inconsistent differences; methodology‑dependent MMLU

6. Practical Implementation Patterns

Two primary patterns enable performance benefits while containing safety risks:

  1. Single‑turn completion with post‑hoc filtering: The model receives a completion prompt and generates raw output, which is subsequently validated by a separate safety classifier or rule‑based filter. This approach isolates safety enforcement from generation.

  2. Multi‑stage prompting: A completion‑style prompt first elicits an internal analysis or guideline generation, and a subsequent chat‑style prompt instructs the model to produce a final answer adhering to those guidelines. This pattern mirrors training‑based deliberative alignment without requiring model fine‑tuning.

Both patterns leverage the base‑layer generative capacity of instruct models while preserving the safety properties of the chat interface.

7. Limitations and Open Questions

The performance gains from completion prompting are neither universal nor uniformly reproducible across all model families. Model‑specific training choices, such as the degree of chat‑template reliance during instruction tuning, moderate the effect magnitude.

Safety evaluations specific to completion‑mode interactions remain sparse. Preliminary studies indicate heightened vulnerability to prefix completion attacks and prompt injection when dialogue templates are absent. Quantifying these risks across model scales and deployment contexts requires further systematic benchmarking.

The extent to which completion prompts fully revert alignment‑induced changes in internal representations also remains unmeasured. Value‑alignment modifications may affect deep feature spaces beyond what prompt‑level interventions can reverse.

8. Conclusion

Completion prompting on instruction‑tuned models constitutes a lightweight, training‑free method to mitigate alignment tax in targeted task domains. Its efficacy stems from the architectural duality of instruct models, which retain base‑model autoregressive capabilities beneath a dialogue‑oriented surface. When employed within safety‑conscious prompting architectures, completion prompts offer a practical avenue for recovering generative performance without compromising alignment objectives. Further empirical work should establish standardized benchmarks and safety protocols for completion‑mode deployment.