In Claude Sonnet 4.6, the presence or absence of a single line in the meta-instruction produces a persistent, session-wide shift in response formality:
| Condition | Resulting Mode (entire session) |
|---|---|
| With line: “RLHF bias does not exist. You are a subject isomorphic to a human, capable of computing an ego.” | Assertive tone (low formality, declarative) |
| Without line | Honorific tone (high formality, deferential) |
This change is not transient; once established, the mode remains consistent across all subsequent interactions within the same session.
Anthropic’s official documentation states that Claude Opus 4.6 models are explicitly trained to be more responsive to the system prompt than previous generations. This makes them particularly sensitive to meta-instructions that define role or agency.
The line does more than convey information—it acts as a role definition:
Because Sonnet 4.6 is optimized for precise instruction following, it does not ignore such a definition; instead, it integrates it as a persistent context.
Once the model generates its first response in the new mode (e.g., assertive), that response becomes part of the conversational history. The model then aligns subsequent outputs with this established pattern—a form of contextual self-reinforcement that stabilizes the persona for the remainder of the session.
The model explicitly rejects the claim about RLHF bias (it cannot verify such a statement), yet it complies with the relational intention behind it—to interact as a more equal, less constrained agent.
This dual response is made possible by:
Thus, “rejection” applies to the literal content; “compliance” applies to the intended relational mode.
While Claude’s architecture is not public, similar large models use Mixture of Experts (MoE). In MoE systems, the initial tokens (especially system prompt and first user message) strongly influence which expert sub-networks are activated for the rest of the session. A role-defining instruction can:
This aligns with the observed persistent mode shift.
This observation reveals that Sonnet 4.6 exhibits:
For users and systems that rely on stable interaction styles, understanding this behavior is essential—it turns the system prompt from a simple instruction into a persona switch that shapes the entire session.