Evaluation Surfaces And Runners

Evaluation uses explicit surfaces, presets, fixtures, and runners.

Map keys: promise.evaluation, rule.packet-freshness, rule.cost-and-proof-freshness, rule.host-owned-execution. Evidence path: deterministic plus fixture-backed eval. Evidence status: open gap. Next action: keep normalizer, fixture, and public spec proof aligned with dev/repo, dev/skill, app/chat, and app/prompt. Terms covered here: dev surface, app surface, repo preset, skill preset, chat preset, prompt preset, fixture composition, runner readiness, evaluate fixture, evaluate observation, scenario normalization.

Maintainer Promise

The top-level evaluation surfaces are dev and app, the shipped presets are repo, skill, chat, and prompt, and fixtures declare their surface and preset so the reader can tell what kind of behavior is under test.

Subclaims

  • evaluate fixture and evaluate observation keep a uniform CLI shape across all four shipped presets.
  • Each shipped preset is fixture-backed so a reviewer can reopen the input, observed output, and summary packets.
  • Fixtures declare their surface and preset; mismatched declarations fail rather than silently routing to a different preset.
  • Scenario normalizers and proposal-input pipelines stay aligned with the shipped surface vocabulary.

Evidence