Evaluation Surfaces
Cautilus exposes two top-level evaluation surfaces, distinguished by whether the thing under test needs a workspace runtime or a messaging runtime.
This spec replaces the prior split across instruction-surface test/evaluate, skill test/evaluate, mode evaluate chatbot mode, and the chatbot/skill/workflow archetype layer.
Status: redesign in progress. The implementation contract is live; existing public commands continue to ship as deprecation aliases until the first slice lands.
Problem
A user opening Cautilus today is asked to choose among instruction-surface, skill, mode evaluate chatbot, and three archetypes (chatbot, skill, workflow).
Those choices mix two unrelated axes:
- whether the evaluation needs a workspace (real instruction files, real plugins, real codebase) or just an LLM API call with a system prompt and messages
- whether the user pins a specific model or rides the operator's habitual coding-agent setup
Workflow as a first-class archetype is also more an artifact of past framing than a real third axis: a workflow is a multi-step chat or prompt with state. The redesign collapses these into two surfaces (each with a small preset set) and one fixture schema that carries composition primitives instead of needing a third archetype.
Current Slice
Define the two-surface contract, the four presets, and the four fixture composition primitives.
The first impl slice ports instruction-surface test/evaluate to the new shape under surface: repo, preset: whole-repo.
Subsequent slices port skill, chat, and prompt presets one at a time.
Fixed Decisions
Two surfaces
repo— workspace runtime. The runner drives a coding-agent harness (Claude Code, Codex) with a real or fixture workspace. Provider stance: ride the operator's habitual setup. Acceptance checks live in tools the harness provides (file system, tool calls, first-routing decisions).app— messaging runtime. The runner sends system prompt + messages to an LLM provider and inspects the response. Provider stance: pin model in fixture. Provider can be direct API (anthropic/openai/...) or a coding-agent CLI in messaging mode (claude -pwith--system-promptoverride,codex execequivalent), so a user who only has CLI tooling installed can still run app fixtures.
Four presets
Each preset belongs to one surface and locks fixture shape so an adapter can validate offline:
| surface | preset | fixture shape | answers |
|---|---|---|---|
| repo | whole-repo |
workspace + prompt | does the agent obey this repo's stated contract? |
| repo | skill |
workspace (real or fixture) + skill reference | does this one skill trigger / execute / validate? |
| app | chat |
system prompt + messages: [...] |
does this multi-turn behavior match expectations? |
| app | prompt |
system prompt + input + expected |
does this single-turn I/O match expectations? |
repo / skill covers both repo-local skills (workspace points at the consuming repo) and portable plugins (workspace points at a plugin-authored fixture workspace).
The --workspace CLI flag is the source of truth for workspace resolution and defaults to the operator's cwd; the fixture body itself does not carry a workspace path.
Skill identity also defaults from suiteId; the fixture may override it through the optional top-level skillId field.
Per-surface preset axes are different on purpose:
repopreset axis = scope (whole-repo= open-ended agent behavior,skill= one bounded capability).apppreset axis = turn shape (chat= multi-turn conversation,prompt= single-turn I/O).
This is acknowledged mixed-axis usage at the surface level, not within one surface.
Adding a cross-axis value (e.g., repo / chat or app / skill) is rejected by the schema validator and by lint:eval-surfaces (a new lint added with this spec).
A new value on either axis must pass the taxonomy-axis checkpoint and update the lint.
Uniform CLI
cautilus eval test --fixture <file>— produce observed input packetcautilus eval evaluate --input <file>— score the packet against expectations- Preset is read from the fixture, not from the CLI. Operators learn one command shape.
Fixture composition (all required)
-
C1 — multi-case suite: one fixture file holds N cases under
cases: [...]. Carries N evaluations through the same surface/preset and shares fields likesystemorworkspaceat the suite level. -
C2 — inheritance:
extends: <path>deep-merges a base fixture before applying overrides. Lets users compare model variants, prompt variants, or workspace variants without copy-paste. -
C3 — multi-step composition:
steps: [...]carries a sequence where each step references another fixture ($ref) or is inline. Step output is addressable in later steps via a single placeholder grammar:- shape:
${steps[<index>].output(.<dotted.path>)?} <index>is 0-based and refers to a previously executed step in the same suite.outputis the literal output node of the previous step's result packet.<dotted.path>is an optional dotted JSON path into that node (e.g.,${steps[0].output.text}or${steps[1].output.evidence[0].title}).- Bare
${steps[0]}is an error;outputis required. - Forward references (
${steps[N]}where N is the current or later step) are an error. - String interpolation in any string-shaped fixture field; non-string substitutions stay JSON-typed only when the entire field is the placeholder.
This replaces the dropped workflow archetype.
- shape:
-
C4 — snapshot baselines:
expected.snapshot: <path>compares current output against a golden file. Snapshots are explicit artifacts — no auto-init by default.
Result packet
Every result records {provider, model, harness, mode (workspace|messaging), durationMs, costUsd?} so cost / time / provider analysis works uniformly across surfaces and downstream tools (optimize prepare-input, report.json).
Cross-runtime equivalence rules (for app surface running same fixture under different harnesses):
- MUST be byte-equal across harnesses:
surface,preset,fixtureRef,caseId. - MUST be present and non-empty on every harness:
provider,model,harness,mode,durationMs,observed.messages,observed.finalText. - MAY differ across harnesses (and downstream consumers MUST treat as harness-tagged): token counts,
costUsd, reasoning blocks (CLI in messaging mode often strips these), system-prompt prefix injections (CLI may add anti-jailbreak preludes), retry / rate-limit telemetry. - A result that compares against another result MUST carry both harness tags so a consumer can decide whether the diff is real or harness-driven.
These rules are validated by acceptance check in this spec, not invented at impl time.
Workflow archetype and prior commands
The redesign cuts cleanly with no backward compatibility (no external users yet). Removed in this slice:
cautilus instruction-surface test/evaluate— replaced bycautilus eval test/evaluatewithsurface=repo, preset=whole-repo.- The
workflowfirst-class archetype — multi-step composition (C3) replaces it. cautilus.instruction_surface_*schemas — replaced bycautilus.evaluation_*schemas.instruction_surface_*adapter fields — replaced byeval_test_command_templatesandevaluation_input_default.
scenario normalize chatbot|skill|workflow are unchanged for now (they feed proposal-input normalization, not evaluation). They lose their archetype framing once the corresponding cautilus eval presets ship and proposal-input normalization is rescoped in a later slice.
Migration tracked at corca-ai/cautilus#32.
Probe Questions
Answers should land through implementation slices, not through more spec churn:
- deep-merge semantics for
extends: arrays replace vs. concat, nested case-list merge rules, conflict detection. Start with deep-merge-replaces-arrays and tighten if real fixtures collide. - snapshot diff granularity: byte-equal vs. structural-equal vs. semantic-similar. Default to structural-equal on parsed JSON / first-pass byte-equal on text; revisit when first preset hits real friction.
- app surface CLI runtime parity in practice: the cross-runtime equivalence rules in
Fixed Decisions § Result packetdefine the contract. The probe is whether those rules are sufficient to keep consumer comparisons honest, or whether more fields need to be promoted to MUST-be-present. Track real-fixture cases that surface gaps and revisit the equivalence list, not the architectural decision.
Deferred Decisions
- migration plan for
scenario normalize chatbot|skill|workflow: those commands feed proposals, not evaluations. The redesign here is on the evaluation surface; proposal-input normalization is unchanged. The deprecation of the archetype layer lives in a separate handoff slice. - plugin author distinct CLI flow: stick with
--workspaceflag for now. Revisit if plugin authors need richer fixture-workspace tooling (linting plugin manifest from fixture, etc.). - cross-surface composition: a
repostep that invokes anappfixture, or vice versa. Out of v1. - LLM-judge as a preset:
app / judgecould land later. Not in v1.
Non-Goals
- Replacing
optimize,evidence,report,reviewpacket flows. Those stay; the new evaluation packets feed them. - Replacing
scenario proposeand the proposal pipeline. - Adding a fifth preset before the four ship.
Deliberately Not Doing
- Keeping
workflowas a first-class archetype. Multi-step composition (C3) covers stateful sequences with less ceremony. The current archetype-boundary spec will be retired or rescoped after migration. - Per-preset CLI commands like
cautilus repo whole-repo test. Preset lives in the fixture; CLI stays uniform. Otherwise the operator surface grows with every preset added. - Direct API as the only app runtime.
Coding-agent CLI in messaging mode (Claude
-p, Codexexec) MUST be a supported runtime so users without API keys can still run app fixtures. - A separate ideation document for this redesign. The 2026-04-25 chat ideation outcome is captured here directly.
Constraints
- All four composition features (C1–C4) MUST be expressible in the v1 fixture schema, even if impl wires them across multiple slices.
reposurface MUST work with the operator's installed Claude Code or Codex without re-pinning model versions.appsurface MUST work without an API key when the user has Claude or Codex CLI installed.- Preset is deterministic from
(surface, preset)and validated offline. - Result packet schema MUST stay stable across surfaces so downstream consumers (
optimize prepare-input,report.json) treat all four presets uniformly. - No backward compatibility for prior public commands or schemas. Cuts are clean.
Success Criteria
- A user with Claude Code installed and no other config can run
cautilus eval test --fixture <whole-repo.fixture.json>against their own repo and get a usable result with provider, model, harness, durationMs, and costUsd recorded. - A user with Claude CLI but no API key can run
cautilus eval test --fixture <chat.fixture.json>and get a usable result. - A fixture using
extends: ./base.fixture.jsoncorrectly inherits base fields and applies overrides. - A
steps: [...]fixture executes each step in order, and step N can read step (N-1)'s output. - A fixture with
expected.snapshot: ./golden.jsonpasses when output matches and fails with a diff when it doesn't. - Adding a fifth preset value requires the same justification as the existing four (taxonomy axis checkpoint passes).
Acceptance Checks
Each criterion has at least one executable check.
- C1 multi-case: a suite with
cases: [a, b]evaluates both, fixture-backend driver. - C2 extends:
child.fixture.jsonextendsbase.fixture.json, asserts inherited and overridden fields. - C3 multi-step: 2-step fixture where step 2 reads
${steps[0].output.text}(dotted path) AND a separate fixture using bare${steps[0].output}(whole-output substitution); both pass, and an invalid placeholder (${steps[0]}, forward ref, missing index) errors with a parse-error. - C4 snapshot: pass case (output matches snapshot) and mismatch case (output diff surfaces in result).
- Per-preset proof:
repo / whole-repo: cautilus's own AGENTS.md routing test (current self-dogfood, ported).repo / skill: portable plugin probe in fixture workspace (port from current skill test fixtures).app / chat: multi-turn fixture against fixture-backend (no real model).app / prompt: single-turn fixture against fixture-backend.
- CLI runtime parity:
app / chatfixture run via direct API and viaclaude -p --system-promptproduces packets where the MUST-be-byte-equal fields match, the MUST-be-present-and-non-empty fields are populated on both sides, and the MAY-differ fields are carried with a harness tag. - Spec lint:
npm run lint:specsaccepts this spec; archetype lint scope follows the workflow-archetype removal. - Taxonomy axis: per-surface preset enums kept narrow; mixed-axis values (e.g., adding
repo / chat) rejected at the schema validator.
Premortem
Bounded fresh-eye review delegated 2026-04-25 to a subagent reading this spec, instruction-surface.spec.md, archetype-boundary.spec.md, master-plan.md, and the taxonomy-axis-checkpoint reference.
Verdict: ready with named tightenings.
Five findings folded back into Fixed Decisions, Constraints, and Acceptance Checks above.
Anchored re-litigation notes:
-
"
repo / skillandrepo / whole-repolook like duplicate tests at fixture level." They differ in scope, not runtime.whole-repoasks an open-ended question against the real repo;skillis bounded to one capability and the workspace knob distinguishes plugin-fixture from real-repo. The per-surface preset axes (scope forrepo, turn-shape forapp) are explicit inFixed Decisions. -
"Coding-agent CLI in messaging mode is the same as direct API." It is not.
claude -pmay inject anti-jailbreak prefixes, may strip reasoning blocks, and may differ in token accounting from direct API. The cross-runtime equivalence rules inFixed Decisions § Result packetmake this concrete: which fields MUST be byte-equal, which MUST be present, which MAY differ with harness tag. -
"
extendsis just file include." It is a deep-merge with array-replace semantics; child scalar wins on conflict. Documented inFixed Decisions § Fixture composition. -
"
workflowarchetype removal breaks existing host repos." No external host repos exist yet, so the workflow archetype and prior public commands are cut without aliases. Migration noted in a tracking issue rather than in code. -
"Taxonomy-axis checkpoint passed because the spec asserts it did." The per-surface axes are spelled out and the schema validator rejects cross-axis values. Adding a fifth value requires both the checkpoint pass and explicit reasoning in the next spec edit.
Canonical Artifact
docs/specs/evaluation-surfaces.spec.md.
This spec is the implementation contract. Per-slice impl notes live in commits and the migration tracking issue.
First Implementation Slice
repo / whole-repo preset shipped:
cautilus eval test,cautilus eval evaluateacceptcautilus.evaluation_input.v1fixtures.- Schema validates
surface=repo, preset=whole-repoonly; C2/C3/C4 fields stub-error until their slices land. - Self-dogfood fixture lives at
fixtures/eval/whole-repo/checked-in-agents-routing.fixture.json; runner underscripts/run-self-dogfood-eval.mjs. - Prior
cautilus instruction-surface test/evaluate,cautilus.instruction_surface_*schemas,instruction_surface_*adapter fields, and theworkflowarchetype framing in this layer are removed without aliases.
Follow-up slices proceed in this order:
- ~~
repo / skillpreset — replacecautilus skill test/evaluate.~~ Shipped 2026-04-25.cautilus.evaluation_input.v1now translatessurface=repo, preset=skillcases into the existingcautilus.skill_test_cases.v1shape, thecautilus eval evaluatehandler dispatches toBuildSkillEvaluationSummarywhen the observed packet usescautilus.skill_evaluation_inputs.v1, and thecautilus skill test/evaluatecommands plus their adapter slots and example fixtures were cut without aliases. app / chatpreset — replacecautilus mode evaluatechatbot mode. Additive accept shipped 2026-04-25:cautilus.evaluation_input.v1acceptssurface=app, preset=chatand translates fixtures tocautilus.app_chat_test_cases.v1;cautilus eval evaluatedispatchesBuildAppChatEvaluationSummaryoncautilus.app_chat_evaluation_inputs.v1; the result packet enforces the cross-runtime equivalence rules (provider/model/harness/mode=messaging/durationMs/observed.messages/observed.finalText required) at the evaluator boundary. The matching cut shipped 2026-04-26:cautilus mode evaluate, theiterate / held_out / comparison / full_gateadapter slots, and the chatbot scenario init scaffold were removed without aliases. Optimize-search held-out and full-gate checkpoints now honest-skip withstatus=skippedandskipReason=surface_unavailableuntil the search layer is rewired onto the eval-test surface.- ~~
app / promptpreset — new.~~ Shipped 2026-04-26.cautilus.evaluation_input.v1now acceptssurface=app, preset=promptand translates fixtures tocautilus.app_prompt_test_cases.v1;cautilus eval evaluatedispatchesBuildAppPromptEvaluationSummaryoncautilus.app_prompt_evaluation_inputs.v1. The result packet keeps the app-surface runtime fields fromapp/chat(provider,model,harness,mode=messaging,durationMs,observed.messages,observed.finalText) and adds requiredobserved.inputfor the single-turn I/O boundary. - C2–C4 composition primitives, one per slice.
- Rescope
scenario normalizeproposal-input lineage. Thearchetype-boundary.spec.mdretirement was absorbed into themode evaluatecut slice (2026-04-26): the spec was removed,lint:archetypeswas reframed as a runtime-completeness check for the survivingscenario normalizeplumbing, and AGENTS.md / CLAUDE.md / README.md / master-plan.md were realigned to point at this spec instead.