Methodology — Aculeus

1. What this document is

A working description of how the Aculeus engine produces predictions and how those predictions are validated. It exists to back every public claim on the marketing site, the partner one-pager, and the case-by-case scorecard with a single document a critic can read end- to-end.

If a claim appears on the product and is not derivable from this document, the claim is wrong and the document needs to be updated.

2. What the engine actually predicts

For a given audience and message, the engine produces:

Per-entity stance (SUPPORT / OPPOSE / NEUTRAL) for the named

political actors in the brief, grounded in their record history.

Defense pattern ranking for a target entity facing a specific

attack message — top-3 most-likely responses drawn from a corpus of historical attack/defense pairs.

Message-ranking on a paired comparison: given two messages aimed

at the same audience, which is more likely to land.

What the engine does NOT predict:

Outright election outcomes. The engine produces audience-conditional

stance + message-effect predictions; converting those to a vote prediction requires turnout assumptions the engine doesn't supply.

Voter-level individual behavior. The engine works at the

demographic-cell level documented in lib/predictive/cells.ts.

Causal attribution of past results. Validation entries describe

what was OBSERVED to win, not what message-causally drove the outcome. Confounders are listed explicitly per entry.

3. Why paired-comparison framing

The original framing — "ensemble beats naive on raw accuracy at N=100" — is mathematically unreachable at the observed effect size: the 95% CI lower bound on a raw-accuracy claim cannot clear naive at N=100. The right test is paired:

Given an audience + message pair where the ensemble and a naive classifier disagree, is the ensemble systematically more often right?

This is McNemar's test on the discordant subset:

Readout

A correct, B wrong   → b
A wrong, B correct   → c
chi² (continuity)    = (|b - c| - 1)² / (b + c)        for b + c ≥ 25
p (exact binomial)   = 2 · P(X ≤ min(b, c) | n = b+c, p = 0.5)  for b + c < 25

The implementation lives at lib/predictive/statistical-reframe.ts:mcnemar. Smoke verifies textbook values (scripts/_smoke-statistical-reframe.mjs).

The paired framing has two honest consequences:

Headline accuracy numbers don't move much; the claim is about

direction of error on the items where the two classifiers disagree, not about a higher overall percentage.

The discordant subset is smaller than the full validation set, so

p-values look larger than a naive reader might expect. Reporting the discordant count alongside the p-value is mandatory in any external claim.

4. Confidence intervals + multi-run averaging

Every reported point estimate from the validation harness carries a bootstrap 95% CI (statistical-reframe.ts:bootstrapCi, default 1000 resamples). When a number lacks a CI, treat it as a debugging artifact, not a published claim.

The harness defaults to a multi-run pool (≥3 runs) per statistical-reframe.ts:multiRunAggregate. The pooled CI reflects both within-run noise (LLM sampling variance) and run-to-run drift (retry-mediated changes in tool-call ordering). Per-run point estimates are reported alongside the pooled CI so the reader can see the spread.

When multi-run isn't possible (single CI run on a regression check), the result is labeled single_run_provisional and a follow-up multi-run is queued before any external claim moves.

5. Data sources

The engine consumes corpora loaded by:

scripts/calibration/anes/load-anes.mjs — ANES (American National

Election Studies)

scripts/calibration/gss/load-gss.mjs — GSS (General Social Survey)
scripts/calibration/exit-polls/load-exit-polls.mjs — Edison/Mitofsky

national exit polls 2010-2024

scripts/calibration/ppic/load-ppic.mjs — PPIC + CA Field Poll
scripts/calibration/pew/load-pew.mjs — Pew Research toplines (M1)

Each corpus lands in the M0 canonical shape (survey_questions + survey_responses + survey_question_embeddings) with a deterministic train / dev / test split via sha256(source_corpus | question_id) % 3. Splits are stable across re-runs and tied to the underlying source key, so a re-extracted CSV hits the same bucket.

Note: the ANES + GSS loaders use real microdata; the Pew loader consumes a CSV intermediate populated from published toplines (the PDF-parsing question is M4 work).

6. Validation sets

Three validation sets are checked in:

Set	Path	Purpose	M1 floor
RED-7 message-ranking	`tests/predictive/fixtures/ranking-validation/red7-validation-set.json`	Paired-comparison ranking task	≥100 entries
RED-7 heldout	`tests/predictive/fixtures/ranking-validation/red7-heldout.json`	NEVER used for tuning; final-claim gate	≥10 entries
Entity-predictor	`tests/predictive/fixtures/entity-prediction/entity-pred-validation-set.json`	Per-entity stance + confidence band	≥30 entries
Defense-predictor	`tests/predictive/fixtures/defense-prediction/defense-pred-validation-set.json`	Top-3 defense pattern + text	≥30 entries

Schema audits:

scripts/validate-red7.mjs (enforced by .github/workflows/red7-validate.yml)
Entity- and defense-predictor sets have inline node -e audits in

the CI workflow.

The heldout RED-7 entries are pulled aside at curation time and not shown to the engine during prompt tuning or persona iteration. Any public claim that "the engine generalizes" must report performance on the heldout split (with bootstrap CI + McNemar) separately from the main validation set.

7. Known limitations + confounders

Selection effects in RED-7: entries are curated from

observable historical races where outcomes were close enough that message-frame attribution is plausible. Easy-outcome races are underweighted by construction.

Confounders are real and listed per-entry: turnout, incumbency,

party id, money, demographics, ballot wording, religious mobilization, redistricting cycles. The confounders field on every entry is required and is read carefully by the labelers. The engine cannot separate message effects from these confounders; it predicts which frame historically beat which other frame, NOT what would have happened under a counterfactual frame swap.

CA bias by intent: roughly 20% of RED-7 entries are CA (Aculeus's

first market). Operators interpreting the engine's confidence outside CA should expect wider CIs.

Provider drift: DeepSeek is the primary reasoning model; Grok

is wired as a fallback per RED-5. A future swap to Anthropic or OpenAI as primary will invalidate the existing RED-7 numbers and trigger a re-run + this doc bumping to v1.

Prompt versioning: every prompt under prompts/persona/,

prompts/entity-prediction/, prompts/defense/ is checked in, versioned, and locked at run time by the artifact-freeze logic (RED-3). The prompt that produced a published claim is recoverable from the artifact record.

Render verification

Once a Case is approved, the Dispatch can render it for a specific audience — an op-ed, a social line, a newsletter, a press release. Verification of a render is narrower than the Case's own review, and deliberately so: a render is not allowed to assert anything the frozen Case did not.

The approved Case is first frozen into an immutable, content-hashed snapshot and decomposed into claim atoms, each bound to its own receipt. A render may only ever reference those atoms. The model chooses emphasis, order, frame, and format; it does not get to add a fact. Every rendered claim is bound back to its atom, and the attribution is injected by code from the receipt metadata — the model never writes a citation.

A cross-model verifier then checks the render against the snapshot in two tiers. The first is binding: a naked assertion with no atom behind it, or a line that drifts from the record, blocks the render outright. The second is a gestalt pass over the whole piece. What the layer refuses is as load-bearing as what it allows:

It will not drop the counter. The case against the call is a required inclusion on the frozen Case; a render cannot omit it to read cleaner.
It will not ship on a gestalt pass it could not run. When the whole-piece check does not clear, the render routes to human review rather than publishing on its own say-so.
It will not invent attribution. If a claim's receipt cannot be resolved, the binding is marked unresolved rather than papered over with a plausible-looking source.

The same discipline the Case holds — source it, or do not say it — is the discipline the render inherits. The emphasis moves between audiences; the substance stays locked to the record.

The action-intent ranking

A render can be ranked against a synthetic panel of the target audience. The panel reads the piece and reacts in free text, and each reaction is mapped onto a Likert distribution by embedding similarity — Semantic Similarity Rating (SSR), a published method (arXiv:2510.08338). The output is a relative ranking across drafts — which framing the panel reacted to more strongly for the action you named — surfaced as a rank and the panel's strongest and weakest reactions, never as a percentage or a likelihood number.

The score informs; it never gates. It is a ranking signal over drafts — used to choose between variants and to cap the revise-and-rescore loop — not a measurement of real-world persuasion, and it never moves a fact or overrides the substance lock. Until it is validated against real outcomes it stays uncalibrated, so we treat it as relative, not absolute. Scores are logged with each render so they can be calibrated against real results as those accrue.