Axiomatic Reasoning for LLMs

Theoretical Capability of Maximally Stacked Agentic Deep Coding

Overview

An integrated agentic workflow for software engineering combines four architectural layers: (1) process-driven planning and execution, (2) structured recording of design decisions and intent, (3) adaptive deep research triggered by task complexity, and (4) multi-model orchestration via synthesis rather than voting. Each layer independently improves code generation quality; together they form a system whose theoretical performance ceiling approaches the limits set by information completeness rather than model capability.

Architecture

Layer 1: Process-Driven Planning and Execution

Closed-loop pipelines (e.g., Plan → Design → Implement → Verify → feedback to Plan) outperform autonomous or single-agent approaches. FlowGen Scrum achieves +15% Pass@1 on HumanEval over raw generation; Blueprint2Code reaches 96.3% on HumanEval using a four-agent Preview→Blueprint→Code→Debug loop; Graphectory demonstrates that real-time trajectory monitoring and intervention add +6.9% to +23.5% resolution rate. The core mechanism is the separation of generation and verification stages with structured handoffs.

Layer 2: Structured Recording of Design Decisions and Intent

Externalized architectural knowledge and rationale (“why” rather than “what”) improve LLM compliance and reasoning quality. The sqlew system shows a 10.1% reduction in development time and a shift from trial-and-error to context-informed reasoning when architecture decision records are accessible. Conversely, monolithic context files (e.g., AGENTS.md) can degrade success rates by 12–28% due to excessive constraints, while supplying design rationale increases automated program repair accuracy by a factor of 4.7 (DRCodePilot) and raises task success by 39% relative (Software Design Documents study).

Layer 3: Adaptive Deep Research Integration

Automatic, complexity-triggered retrieval fills the information gap between unaided agents and the 97%+ success achievable when all necessary signals are provided (Oracle-SWE). Frameworks like MemGovern use a “Search-then-Browse” pipeline over 135,000 structured experience cards to improve SWE-bench Verified by up to 9.4%. Dynamic gating (CAR, TARG, QR³AG) reduces unnecessary retrieval, cutting token consumption by up to 60% while maintaining accuracy. AgentIR leverages reasoning traces to boost retrieval relevance from 37% (BM25) to 68%. The integration of retrieval into coding agents remains a non-trivial but solvable engineering challenge.

Layer 4: Multi-Model Synthesis over Voting

Majority voting discards information; structured synthesis preserves and increases information density. In coding, pipeline-based multi-agent systems (single agent 65% → multi-agent 72.2% on SWE-bench) outperform mesh-like collaboration, which causes performance drops of 39–70% (Google/MIT, Berkeley MAST). Adversarial review patterns (Star Chamber’s three-tier consensus classification, Crossfire’s generate→review→synthesize loop) and belief-calibrated consensus (BCCS) further refine output. Meta FAIR’s AggLM demonstrates that training an aggregator to synthesize all candidate solutions into a single answer beats majority voting across math benchmarks; Mixture-of-Agents (MoA) uses layered synthesis to surpass GPT-4o on AlpacaEval with open-source models alone. The optimal aggregation strategy is additive accumulation of all ideas (test cases, review comments) followed by a synthesis that produces a new, integrated implementation.

The table below summarizes the independent contributions and the compounded effect.

Layer Mechanism Estimated Isolated Contribution
1 Process-driven closed-loop pipeline +8–15%
2 Design decision records (why > what) +5–12%
3 Complexity-triggered deep research +10–18%
4 Multi-model synthesis (information-maximizing) +8–15%

Estimated Performance Ceilings

Conservative and optimistic extrapolations of the integrated system, using the current best unaided benchmark results as baselines:

Benchmark Current Best Single/Unaided 4-Layer Integrated (Conservative) 4-Layer Integrated (Optimistic)
SWE-bench Verified (autonomous) 77.6% 92–95% 97%+
Vibe Code Bench (end-to-end app) 61.8% 78–85% 88–93%
BeyondSWE (cross-repo, domain-specific) <45% 62–75% 75–85%

The residual gap toward 100% stems primarily from the inherent ambiguity of underspecified tasks and the lack of a fully realized integrated implementation, rather than fundamental model limitations. Self-play reinforcement learning (Self-play SWE-RL) autonomously improves code generation by +10.4 points on SWE-bench Verified and suggests that performance ceilings may rise with continued agent training.

Evaluation Framework Requirements

Existing benchmarks (SWE-bench, etc.) exhibit data contamination, test design flaws (59.4% of problems in SWE-bench Verified had defective tests), and vulnerability to reward hacking. A suitable evaluation framework for the integrated system must satisfy the following dimensions:

Dimension Definition Measurement
Task Success Rate Fraction of tasks meeting all acceptance criteria Agent-driven fuzzing (ProgramBench style)
Cost Efficiency Success rate per unit token expenditure Centralized cost tracking (HAL)
Process Quality Completeness of ADR chains, review coverage Phase-log analysis
Robustness Stability of solutions under minor specification changes Mutation testing
Accountability Traceability of every design decision to its rationale ADR chain completeness score
Human-AI Synergy Performance gain per unit of human involvement Centaur production function Y = f(K,L)
Self-Improvement Rate Learning slope across repeated similar tasks Success rate trajectory across iterations

These dimensions are naturally captured by the phase-structured logging (Phase‑X_result.md) inherent in the Deep Coding loop, turning every development session into reusable evaluation data.

Core Design Principles

  1. Why over What: Supply rationale, not just instructions.
  2. Pipeline over Mesh: Fixed sequential stages with structured handoffs.
  3. Synthesis over Voting: Combine all information into a new generation, never discard minority insights.
  4. File Segmentation over Monolithic Context: Separate, selectively referenced records prevent overload.
  5. Adversarial Review over Collaborative Discussion: Generation and criticism must be separated; reviewers critique, they do not negotiate.
  6. Structured Output over Free-Form Dialogue: Machine-verifiable JSON contracts enable precise consensus.
  7. Context Reset over Continuous Compression: For long tasks, a clean slate with structured handoff preserves quality.
  8. User Gate as Information Channel: Interaction points clarify underspecification rather than merely approve.

Conclusion

The four-layer integration maps directly to the cognitive protocol required for complex software creation. Process structuring, rationale persistence, adaptive information retrieval, and synthesis-based multi-model collaboration form a coherent system whose theoretical upper bound on verified software engineering tasks reaches 92–97%+. Achieving this bound requires an implementation that respects the architectural principles outlined, not a breakthrough in model intelligence.