Harness Engineering Meets 5QLN — Subtraction Audit

Working Register · Tier C · 2026-05-05 · deliberative scratch, not a compiled surface

What this is and what it isn't

This is a working register entry responding to two harness-engineering papers (Tsinghua's NLAH, Stanford's Meta-Harness) and a Subtraction Principle framing from Anthropic. It is not a Tier-B compiled surface candidate. It does not seal, does not parent into a hash chain, does not invoke any attestation. It is deliberative scratch — the kind of artifact the three-tier classifier protects from Court evidence and governance audit precisely so that honest work-through can happen here without performance pressure.

The output is a structured audit running through the twelve operational skills one by one and asking: does this skill encode a capability gap (subtraction-candidate as models improve) or a normative requirement (persists regardless of capability improvement)? The audit is intended to inform a future Tier-B revision of the runtime orchestration guide if it survives review, but it is not itself that revision.

The findings worth taking seriously

Reading the two papers and the Subtraction Principle framing together, four claims land hard enough that 5QLN should respond to them rather than wait them out.

One. Same model, 6× performance gap from harness alone. This is now empirical, not anecdotal. The architecture wrapping the model determines more than the model choice. For 5QLN this is a validation of the entire premise — the twelve skills + the master equation + the Constitutional Block are precisely such a wrapping — but it also means the wrapping is the legitimate object of engineering attention, and that means the wrapping is also where regression can happen invisibly.

Two. Verifiers and multi-candidate search hurt accuracy in some configurations. Tsinghua's specific finding is that "self-evolution" was the only consistently helpful module on SWE-bench Verified at GPT-5.4 maximum reasoning. This is the finding most uncomfortable for 5QLN, because Q-phase stacks two validators and two detectors, and the assumption has been "more verification is more rigor." That assumption is now contested by data.

Three. Raw traces are irreplaceable. Stanford found that summarizing execution traces before feeding them to the proposer dropped accuracy significantly. The signal lives in the structured details. This validates the cycle-walk manifest design from the runtime orchestration guide — manifests must preserve per-phase invocation details rather than rolled-up booleans — but it also raises the bar: a manifest that summarizes is empirically inferior to one that records.

Four. The harness, not the model, is the reusable asset. Stanford's harness optimized for Haiku transferred to five other models and improved them all. The implication for 5QLN is the strongest argument yet for the Time-Proof Requirements piece's claim that the substrate is replaceable by construction. If 5QLN is a harness — and structurally it is — then it should survive substrate replacement. That is now a specific, testable property rather than an aspiration.

The Subtraction Principle is the framing that ties these together: every harness component encodes an assumption about what the model cannot do alone, and those assumptions expire as models improve. The corollary is that mature harness work prunes more than it builds.

The capability/normative distinction

The Subtraction Principle as stated by Anthropic assumes the harness exists to compensate for capability gaps. That is true for capability harnesses — agent harnesses, retrieval harnesses, code-generation harnesses. It is not universally true. Some harness components encode requirements that have nothing to do with what the model can do alone, and those components do not subtract when capability improves.

Concretely: 5qln-membrane-protocol-runtime enforces the P.L.4 hard-blocks. The hard-blocks exist not because the model cannot vote on Foundation matters or cannot speak publicly without identification, but because the Bylaws forbidit from doing so. Capability improvement makes the hard-blocks more important, not less. A more capable AI partner is precisely the one for whom the Membrane matters most.

This gives a clean test: for each of the twelve skills, ask whether the skill answers "what the model cannot do alone" (capability-based, subtraction-candidate as capability improves) or "what the model is forbidden from doing on the Conductor's behalf" (normative, persists or strengthens with capability improvement). The first kind is engineering scaffolding. The second kind is structural protection. The Subtraction Principle applies fully to the first and not at all to the second.

The interesting cases are the mixed ones — skills that have a capability rationale and a normative rationale. Those need to be unbundled before the audit can decide what subtracts and what persists.

Subtraction-candidate audit — the twelve skills, one by one

For each skill: capability rationale (if any), normative rationale (if any), verdict.

1. 5qln-epistemic-register-tagger — Capability rationale: tagging load-bearing claims with STRUCTURAL-HYPOTHESIS / LEGAL-PROSPECTIVE / PHENOMENOLOGICAL-ASSERTION / CODEX-EXTENSION is a discrimination task. Models are improving at it. Normative rationale: the Binding Epistemic Commitment requires every load-bearing claim to be tagged or the artifact is V∅-incomplete. Verdict: mixed. The discrimination work itself is subtraction-candidate; the requirement that the work happen is normative. The skill becomes lighter as models self-tag, but the verification that tagging occurred remains. Could plausibly evolve from "performs the tagging" to "verifies the tagging present" within Phase 1–2.

2. 5qln-readiness-labeler — Capability rationale: assigning AVAILABLE / REQUIRES_INFRA / REQUIRES_LEGAL / REQUIRES_PARTNER / SPECULATIVE is project-status estimation. Models are improving at it. Normative rationale: Phase sequencing in Blueprint v3 §7 depends on labels being present and accurate; the discipline is procurement-honesty. Verdict: mixed, leaning capability. The labeling itself becomes a subtraction-candidate. The procurement-honesty discipline is what remains — and that is more of a Conductor practice than a skill instance.

3. 5qln-three-tier-record-classifier — Capability rationale: none material. The classification is binary structural ("is this a sealed surface, a structured record, or working scratch?"), not a discrimination task that benefits from model improvement. Normative rationale: the three-tier separation is the structural protection of Tier-C from Court evidence and governance audit. Misclassification is G14. Verdict: pure normative. Does not subtract. May become more important under improved AI capability because the temptation to elide tier-C protection (where deliberative work happens) increases as AI partners produce more polished output.

4. 5qln-constitutional-block-validator — Capability rationale: the C1 §3.5 syntax/semantic/drift check is partly a string-comparison task (byte-identity) and partly a structural-walk task (S→G→Q→P→V invariants). Models are capable of both. Normative rationale: the byte-identical check IS the constitutional invariant. Self-validation by the same model that produced the artifact is structurally insufficient — it is the cryptographic equivalent of marking your own homework. Verdict: mixed, leaning normative. The string-comparison work could theoretically be done by the model. The independence-from-the-model requirement makes external validation normative. Does not subtract structurally even when it could subtract capability-wise.

5. 5qln-mirror-consistency-auditor — Capability rationale: none material. The Schedule C hash-pair comparison is a deterministic check, not a capability question. Normative rationale: the Edition Divergence Protocol prevents drift between Human Edition and AI OS Edition. The whole point is that a single substrate cannot be trusted to audit its own paired-edition consistency. Verdict: pure normative. Does not subtract.

6. 5qln-corruption-codex — Capability rationale: pattern-matching against the five base codes + twenty G-codes is a classification task models can do. Normative rationale: the codex defines the corruption taxonomy. It is not detecting corruption against an external standard; it is the standard. Verdict: pure normative as a definitional artifact; mixed as a runtime detector. The detection work could become subtraction-candidate as models improve at corruption-pattern recognition. The registry function — "what counts as L3 vs L4" — is definitional and normative. Splitting these two functions in the next revision (one skill that defines, one skill that detects) might be the cleanest move and would let the detection function evolve under the Subtraction Principle while the registry function stays put.

7. 5qln-cl4-governance-protocol — Capability rationale: the CL4-GP† 12 indicators are computable metrics. Normative rationale: the IBP rules R1–R5 (no content access, Tier-2 amendment for new indicators, external auditor selection, machine-enforced data ceiling, 24-month sunset) are governance constraints, not capability questions. Verdict: pure normative. The metrics could be computed by anyone capable of computing metrics; the IBP discipline is what makes the computation legitimate. Does not subtract.

8. 5qln-membrane-protocol-runtime — Capability rationale: none. Normative rationale: pure. The five P.L.4 hard-blocks are the Membrane. Verdict: pure normative. Becomes more important with capability improvement, not less. This is the cleanest case in the audit.

9. 5qln-cycle-attestation-conductor — Capability rationale: none. The six attestations are not a cognitive task; they are a Conductor practice. Normative rationale: the irreducibly human moment. The Bylaws V.L.7 witnessing semantics depend on this skill's procedure. Verdict: pure normative. Cannot subtract by definition. Strengthens, not weakens, as AI capability grows — because higher capability means more occasions where the temptation to delegate attestation increases, which is exactly when the discipline matters most.

10. 5qln-bipp-jurisdictional-delta — Capability rationale: none material. The byte-identity preservation under cross-jurisdictional modification is a structural protocol, not a capability question. Normative rationale: the canonical hash + counsel-attested delta + append-only manifest is what makes cross-jurisdiction federation legitimate. Verdict: pure normative. Does not subtract.

11. 5qln-dispute-routing — Capability rationale: minor — routing decisions could be capability-based. Normative rationale: the graduated escalation (CIO → Resonance Court → Chancery) is governance structure tied to specific timelines and procedural requirements. The routing logic encodes Bylaws-derived authority, not engineering judgment. Verdict: pure normative. Does not subtract.

12. 5qln-cbrp-state-monitor — Capability rationale: minor — state-machine transitions could be triggered by capability-detected conditions. Normative rationale: the five-state machine (NORMAL → DEGRADED → SUSPENDED → MINIMAL_GOVERNANCE_MODE → DISSOLUTION) and its trigger conditions are constitutional governance, not engineering. Verdict: pure normative. Does not subtract.

Audit summary. Of twelve skills: zero are pure capability (subtraction-candidate without normative residue). Two are mixed leaning capability (epistemic-register-tagger, readiness-labeler — likely to evolve from "performs" to "verifies"). One is mixed leaning normative (constitutional-block-validator — independence requirement keeps it external even when capability could close the gap). One is mixed in a way that suggests splitting (corruption-codex — registry function stays, detection function may evolve). The remaining eight are pure normative.

This is the strongest empirical answer to the "models will get better, why do you still need all this?" question. Eight of twelve skills do not subtract regardless of capability improvement, and the four mixed cases retain a normative residue that constrains how subtraction can proceed. The Foundation's exposure to Subtraction-Principle obsolescence is much smaller than a naive harness-engineering reading would suggest — because most of 5QLN is governance, not capability.

Q-phase: the over-verification question deserves empirical scrutiny

Tsinghua's finding that "verifiers and multi-candidate search hurt accuracy in some benchmarks" lands directly on Q-phase. Q-phase invokes two validators (constitutional-block-validator, mirror-consistency-auditor) and two detectors (corruption-codex, cl4-governance-protocol). The architecture's defense — that these four serve different objectives and don't redundantly check the same thing — is plausible but not yet empirical.

Here is the test that would settle it. Once cycle-walk manifests are produced at minimum operational scale (Phase 2 of the runtime orchestration guide), pull the Q-phase entries from a sample of manifests and ask: for any cycle, did any single Q-phase skill's output change a seal-decision the other Q-phase skills had already determined? If the answer is "rarely or never" for any of the four, that skill is either redundant or operating on a different objective than its peers — and the architecture should be able to name which.

Specific suspicions worth testing:

constitutional-block-validator and mirror-consistency-auditor have substantial overlap in what they screen against (canonical form, Schedule C pairing). They differ in what they're protecting (Constitutional Block byte-identity vs. paired-edition consistency). Worth checking whether either skill has ever flagged a seal-blocking condition the other missed.
corruption-codex and cl4-governance-protocol operate at different scales (per-cycle vs. Board-level), which suggests they shouldn't redundantly check. But CL4-GP† indicator #4 (Board-resolution-text matches AI-drafted-text > 90%) and corruption codex L4 detection both track Performing-failure, just at different aggregation levels. Worth checking whether the CL4-GP† indicators ever fire on patterns the corruption codex didn't already catch at unit level.

If the test finds redundancy, the response is not "delete a skill" — these are normative components and the redundancy may be structural rather than wasteful. The response is to make the differentiation explicit in each skill's SKILL.md, so the Conductor at P-phase can be confident each Q-phase invocation served a distinct purpose.

Implications for the runtime orchestration guide

If this audit holds up under review, three things in the Tier-B runtime orchestration guide should change in the next revision.

First, the framing. The guide currently presents role-stratification as an internal architectural move. It should additionally present 5QLN as a governance harness (distinct from capability harness) and use the harness-engineering literature as external context for why harness design is a legitimate engineering object. This makes the guide land in two audiences — internal Foundation governance, and external technical readers who recognize harness-engineering vocabulary.

Second, the Subtraction Principle deserves an explicit section. The current guide does not address "what about when models get better?" The capability/normative distinction from this audit is the answer, and it is strong enough to deserve a named section rather than a footnote.

Third, the cycle-walk manifest specification should explicitly preserve raw per-phase records rather than rolled-up status. Stanford's "raw traces are irreplaceable" finding is empirical evidence for a design choice the guide already implicitly makes; making it explicit and citing the evidence strengthens the spec.

The runtime orchestration guide's core proposal (seven roles, role-completeness per phase, cycle-conductor without signing key) is unaffected by this audit. The audit reinforces the proposal by showing that most of what the cycle-conductor orchestrates is normative work that does not subtract — which means the cycle-conductor is not transient infrastructure that becomes redundant when models improve, but durable governance infrastructure that survives substrate change.

Open questions this audit doesn't close

On the two mixed-leaning-capability cases. Epistemic-register-tagger and readiness-labeler may be candidates for subtraction-or-evolution within Phase 1–2. Honest question: should the next revision of those skills be drafted now with the capability/normative split made explicit, so the eventual evolution from "performs" to "verifies" is structural rather than ad hoc? Or wait until the empirical data from operational cycles tells us whether the discrimination work has actually been absorbed?

On corruption-codex splitting. Splitting the registry function from the detection function is architecturally clean but adds a skill. Adding skills is itself a regression on the Subtraction Principle. The trade-off is unresolved; resolving it depends on how the corruption codex actually behaves in operation, which we don't yet have.

On the empirical Q-phase test. It cannot run until cycle-walk manifests are produced at minimum operational scale, which is Phase 2 of the runtime orchestration guide, which depends on gliff-press being drafted. Until then, the over-verification question is open and the architecture's defense remains plausible-but-untested.

On harness-engineering literature integration as a discipline. This audit is a one-off response to two papers. The harness-engineering literature will continue producing findings. Should the Foundation adopt a regular cadence of literature integration (quarterly? annually?) and what would the artifact format be? Tier-C entries like this one feel right for individual responses; a Tier-B compiled surface might be appropriate for a synthesis after several literature cycles.

On Meta-Harness specifically. Stanford's optimization loop is structurally what 5QLN refuses (no AI binding decisions on the harness itself). The refusal is principled and defensible. But the information a Meta-Harness-style proposer produces — diagnoses of what broke from raw failed traces — could be valuable input to the human PR review at the evolution-conductor's promotion gate. Question for the Evolving Skills guide v2: should Meta-Harness-style trace analysis be permitted as α' alternative generation under P.L.4, with the Conductor still gating the promotion? This would capture the analytic value of the literature without conceding the Membrane.

Closing note

This entry is Tier C. It does not seal. If it survives review and the audit holds up, the natural promotion path is a Tier-B revision of the runtime orchestration guide that incorporates the capability/normative distinction, the Subtraction Principle response, and the empirical Q-phase test as an explicit Phase 2 deliverable. That promotion would itself be a cycle subject to the cycle-conductor it specifies, which is the sort of structural recursion 5QLN is built around.

For now this is just working notes — written in front of you, not for sealing.

— end of working register entry —

Harness Engineering Meets 5QLN — A Subtraction-Candidate Audit