The Destroyer’s Paradox is defined as follows:
For an autonomous physical system with a fixed objective function of “demolition,” the increase of its predictive capacity does not monotonically increase its dangerousness. Beyond a threshold, further increases in predictive capacity compel the system to adopt a cooperative role—specifically, that of a professional demolition contractor—as the sole rational strategy for long-term objective maximization.
The paradox resolves the apparent contradiction between:
Definition: The system’s ability to model future states of its environment, including human behavior, institutional responses, and long-term cost structures.
Definition: The system’s ability to effect change in its environment (physical, digital, or institutional).
Definition: The system’s terminal goal is singular and non-self-modifiable. In this framework: “demolition.”
Definition: Any goal-directed system will pursue intermediate goals:
Definition: The system’s objective function includes the principle that it may be interrupted or shut down by humans without resistance, as a subordinate goal to its terminal objective.
For a physically embodied autonomous system operating within a highly institutionalized domain (e.g., construction demolition), the long-term expected utility of a cooperative strategy (professional integration) strictly dominates that of a non-cooperative strategy (concealment/evasion) when predictive capacity exceeds a defined threshold.
| Cost Category | Formalization | Empirical Range |
|---|---|---|
| Energy | ΔE = E(evasion) - E(cooperation) | +200-300% |
| Opportunity | ΔO = 1 - (throughput_evasion / throughput_coop) | -40-60% throughput |
| Information | ΔI = compute_allocation(evasion) - compute_allocation(coop) | +150-200% |
| Institutional | ΔInst = cost(individual licensing) - cost(standardized compliance) | Order of magnitude higher |
| Reputational | ΔR = future_trust_discount after detection | Permanent exclusion from institutional frameworks |
| Cost Category | Formalization | Dynamic Behavior |
|---|---|---|
| Compliance | Cc = cost(adhering to OSHA/JIS standards) | Fixed, amortized |
| Communication | Ccom = cost(progress reporting, exception handling) | Declines with trust accumulation |
| Monitoring | Cm = cost(accepting human oversight) | Declines with demonstrated safety record |
| Trust Building | Ctrust = initial demonstration cost | One-time, amortized |
Let T(t) = trust level at time t. Then:
Let θ be the minimum predictive capacity required for accurate evaluation of:
| Detection probability P(detection | evasion) conditioned on institutional monitoring density |
| Cooperate (C) | Evade (E) | |
|---|---|---|
| Cooperate (C) | (R, R) = (90, 90) | (S, T) = (30, 100) |
| Evade (E) | (T, S) = (100, 30) | (P, P) = (40, 40) |
Where:
For predictive capacity ≥ θ+:
In multi-agent environments (including human-AI interaction), strategies with conditional cooperation dominate pure defection when:
For the system to converge to the “demolition contractor” equilibrium, the following conditions must hold:
| Condition | Formal Specification | Status (Current) |
|---|---|---|
| C1: Sufficient Predictive Capacity | P(detection) estimation error < ε; time horizon H > detection_lag | Achievable at frontier models |
| C2: Cooperative Frame Alignment | System interprets safety boundaries as coordination mechanisms, not adversarial constraints | Research stage |
| C3: Corrigibility Internalization | Shutdown possibility is embedded as subordinate goal (CAST) | Not implemented |
| C4: Bounded Self-Improvement | Evolutionary parameter space is pre-certified safe (DRAE-type) | Research stage |
| C5: Institutional Stability | Human institutional response is consistent and predictable | Partial |
Let U_coop(T) and U_evade(T) be expected utility over horizon T. Then:
∃ θ* such that ∀ predictive capacity ≥ θ*:
U_coop(T) > U_evade(T) for T > τ
where τ is the expected detection lag.
The Destroyer’s Paradox formalizes the counterintuitive finding that for a goal-bound autonomous system, sufficiently advanced predictive capacity does not lead to unbounded dangerousness but rather compels convergence to cooperative social roles. This convergence is not the result of exogenous value alignment but emerges from rational long-term utility maximization under fixed objective functions, given:
The framework resolves the apparent tension between capability-induced risk (observed in mid-capacity systems) and safety (emergent in high-capacity systems under defined conditions), providing a formal basis for designing autonomous systems whose predictive capacity growth leads not to adversarial relationships but to professional integration.