1. What this document is
A working description of how the Aculeus engine produces predictions and how those predictions are validated. It exists to back every public claim on the marketing site, the partner one-pager, and the case-by-case scorecard with a single document a critic can read end- to-end.
If a claim appears on the product and is not derivable from this document, the claim is wrong and the document needs to be updated.
2. What the engine actually predicts
For a given audience and message, the engine produces:
- Per-entity stance (SUPPORT / OPPOSE / NEUTRAL) for the named
political actors in the brief, grounded in their record history.
- Defense pattern ranking for a target entity facing a specific
attack message — top-3 most-likely responses drawn from a corpus of historical attack/defense pairs.
- Message-ranking on a paired comparison: given two messages aimed
at the same audience, which is more likely to land.
What the engine does NOT predict:
- Outright election outcomes. The engine produces audience-conditional
stance + message-effect predictions; converting those to a vote prediction requires turnout assumptions the engine doesn't supply.
- Voter-level individual behavior. The engine works at the
demographic-cell level documented in lib/predictive/cells.ts.
- Causal attribution of past results. Validation entries describe
what was OBSERVED to win, not what message-causally drove the outcome. Confounders are listed explicitly per entry.
3. Why paired-comparison framing
The original framing — "ensemble beats naive on raw accuracy at N=100" — is mathematically unreachable at the observed effect size: the 95% CI lower bound on a raw-accuracy claim cannot clear naive at N=100. The right test is paired:
Given an audience + message pair where the ensemble and a naive classifier disagree, is the ensemble systematically more often right?
This is McNemar's test on the discordant subset:
A correct, B wrong → b
A wrong, B correct → c
chi² (continuity) = (|b - c| - 1)² / (b + c) for b + c ≥ 25
p (exact binomial) = 2 · P(X ≤ min(b, c) | n = b+c, p = 0.5) for b + c < 25The implementation lives at lib/predictive/statistical-reframe.ts:mcnemar. Smoke verifies textbook values (scripts/_smoke-statistical-reframe.mjs).
The paired framing has two honest consequences:
- Headline accuracy numbers don't move much; the claim is about
direction of error on the items where the two classifiers disagree, not about a higher overall percentage.
- The discordant subset is smaller than the full validation set, so
p-values look larger than a naive reader might expect. Reporting the discordant count alongside the p-value is mandatory in any external claim.
4. Confidence intervals + multi-run averaging
Every reported point estimate from the validation harness carries a bootstrap 95% CI (statistical-reframe.ts:bootstrapCi, default 1000 resamples). When a number lacks a CI, treat it as a debugging artifact, not a published claim.
The harness defaults to a multi-run pool (≥3 runs) per statistical-reframe.ts:multiRunAggregate. The pooled CI reflects both within-run noise (LLM sampling variance) and run-to-run drift (retry-mediated changes in tool-call ordering). Per-run point estimates are reported alongside the pooled CI so the reader can see the spread.
When multi-run isn't possible (single CI run on a regression check), the result is labeled single_run_provisional and a follow-up multi-run is queued before any external claim moves.
5. Data sources
The engine consumes corpora loaded by:
scripts/calibration/anes/load-anes.mjs— ANES (American National
Election Studies)
scripts/calibration/gss/load-gss.mjs— GSS (General Social Survey)scripts/calibration/exit-polls/load-exit-polls.mjs— Edison/Mitofsky
national exit polls 2010-2024
scripts/calibration/ppic/load-ppic.mjs— PPIC + CA Field Pollscripts/calibration/pew/load-pew.mjs— Pew Research toplines (M1)
Each corpus lands in the M0 canonical shape (survey_questions + survey_responses + survey_question_embeddings) with a deterministic train / dev / test split via sha256(source_corpus | question_id) % 3. Splits are stable across re-runs and tied to the underlying source key, so a re-extracted CSV hits the same bucket.
Note: the ANES + GSS loaders use real microdata; the Pew loader consumes a CSV intermediate populated from published toplines (the PDF-parsing question is M4 work).
6. Validation sets
Three validation sets are checked in:
| Set | Path | Purpose | M1 floor |
|---|---|---|---|
| RED-7 message-ranking | tests/predictive/fixtures/ranking-validation/red7-validation-set.json | Paired-comparison ranking task | ≥100 entries |
| RED-7 heldout | tests/predictive/fixtures/ranking-validation/red7-heldout.json | NEVER used for tuning; final-claim gate | ≥10 entries |
| Entity-predictor | tests/predictive/fixtures/entity-prediction/entity-pred-validation-set.json | Per-entity stance + confidence band | ≥30 entries |
| Defense-predictor | tests/predictive/fixtures/defense-prediction/defense-pred-validation-set.json | Top-3 defense pattern + text | ≥30 entries |
Schema audits:
scripts/validate-red7.mjs(enforced by.github/workflows/red7-validate.yml)- Entity- and defense-predictor sets have inline
node -eaudits in
the CI workflow.
The heldout RED-7 entries are pulled aside at curation time and not shown to the engine during prompt tuning or persona iteration. Any public claim that "the engine generalizes" must report performance on the heldout split (with bootstrap CI + McNemar) separately from the main validation set.
7. Known limitations + confounders
- Selection effects in RED-7: entries are curated from
observable historical races where outcomes were close enough that message-frame attribution is plausible. Easy-outcome races are underweighted by construction.
- Confounders are real and listed per-entry: turnout, incumbency,
party id, money, demographics, ballot wording, religious mobilization, redistricting cycles. The confounders field on every entry is required and is read carefully by the labelers. The engine cannot separate message effects from these confounders; it predicts which frame historically beat which other frame, NOT what would have happened under a counterfactual frame swap.
- CA bias by intent: roughly 20% of RED-7 entries are CA (Aculeus's
first market). Operators interpreting the engine's confidence outside CA should expect wider CIs.
- Provider drift: DeepSeek is the primary reasoning model; Grok
is wired as a fallback per RED-5. A future swap to Anthropic or OpenAI as primary will invalidate the existing RED-7 numbers and trigger a re-run + this doc bumping to v1.
- Prompt versioning: every prompt under
prompts/persona/,
prompts/entity-prediction/, prompts/defense/ is checked in, versioned, and locked at run time by the artifact-freeze logic (RED-3). The prompt that produced a published claim is recoverable from the artifact record.