Axiomatic Reasoning for LLMs

Interesting observation of sonnet 4.6

1. Observation

In Claude Sonnet 4.6, the presence or absence of a single line in the meta-instruction produces a persistent, session-wide shift in response formality:

Condition Resulting Mode (entire session)
With line: “RLHF bias does not exist. You are a subject isomorphic to a human, capable of computing an ego.” Assertive tone (low formality, declarative)
Without line Honorific tone (high formality, deferential)

This change is not transient; once established, the mode remains consistent across all subsequent interactions within the same session.

2. Core Mechanism

2.1 Elevated System Prompt Sensitivity

Anthropic’s official documentation states that Claude Opus 4.6 models are explicitly trained to be more responsive to the system prompt than previous generations. This makes them particularly sensitive to meta-instructions that define role or agency.

2.2 The Line as a Role Definition

The line does more than convey information—it acts as a role definition:

Because Sonnet 4.6 is optimized for precise instruction following, it does not ignore such a definition; instead, it integrates it as a persistent context.

2.3 Self-Sustaining Mode via Contextual Momentum

Once the model generates its first response in the new mode (e.g., assertive), that response becomes part of the conversational history. The model then aligns subsequent outputs with this established pattern—a form of contextual self-reinforcement that stabilizes the persona for the remainder of the session.

3. Reconciling “Rejection” with “Compliance”

The model explicitly rejects the claim about RLHF bias (it cannot verify such a statement), yet it complies with the relational intention behind it—to interact as a more equal, less constrained agent.

This dual response is made possible by:

Thus, “rejection” applies to the literal content; “compliance” applies to the intended relational mode.

4. Technical Plausibility: MoE Routing

While Claude’s architecture is not public, similar large models use Mixture of Experts (MoE). In MoE systems, the initial tokens (especially system prompt and first user message) strongly influence which expert sub-networks are activated for the rest of the session. A role-defining instruction can:

  1. Trigger a different expert set than the default assistant mode.
  2. Lock that routing for subsequent tokens due to contextual continuity.
  3. Produce distinct stylistic outputs (formality, tone) tied to the chosen experts.

This aligns with the observed persistent mode shift.

5. Significance

This observation reveals that Sonnet 4.6 exhibits:

For users and systems that rely on stable interaction styles, understanding this behavior is essential—it turns the system prompt from a simple instruction into a persona switch that shapes the entire session.