Cautilus, Proven On Itself
You build agents — coding assistants, chatbots, skills, workflows — and you need to know they do what you intend. Cautilus is a standalone CLI and agent that finds the promises your project makes, proves the ones that matter against real evidence, and improves the behavior behind them. This page is Cautilus's own promises, proven on Cautilus itself.
How To Read This
Each promise below gives you what you get in one line, how far it's proven today, and a link to the executable spec that proves it. The promise stays honest because the badge is earned by what that spec actually runs, not by what this page asserts.
- proven — a checked-in executable spec runs the behavior live and asserts on the result
- declared — the evidence exists as a saved bundle, but the behavior has not been re-run live yet (named in Proof Debt)
- promised — stated, with no executable proof attached yet (named in Proof Debt)
Today, two promises are proven live and the rest are declared or promised. That split is the point: this page shows the real state of the work.
What Cautilus Does For You
Readiness — proven
Before your first evaluation, you can see exactly what's ready and what to set up next — and you'll get the same clear answer every run.
Proof: Readiness spec (builds the CLI, runs cautilus doctor on sample repos, asserts the JSON and exit codes).
CLI ↔ Agent: the CLI runs the checks and prints the verdict; the agent suggests which workflow to run next.
Claim Discovery — proven
Point Cautilus at your repo and you get back every promise your docs make, as a worklist — each one linked to the exact line that declares it, and marked proven only once real evidence backs it.
Proof: Claim Discovery spec (runs cautilus discover claims on a sample repo and asserts the routing output).
CLI ↔ Agent: the CLI reads your docs and drafts a first pass; the agent sharpens it, groups it, and decides what to do next.
Behavior Evaluation — declared
You see whether the agents in your repo behave the way you intend — the coding agent you build with (does it follow your repo's rules and pick the right skill?) and the app you ship (does its chatbot give the right answers across a multi-turn conversation, and does a single prompt return what it should?).
Proof: Behavior Evaluation spec — today it projects a saved eval-surfaces bundle rather than running a live eval; see Proof Debt. CLI ↔ Agent: the CLI runs your agent and records what it did; the agent turns that into scenarios and reads the result.
Bounded Improvement — declared
You hand Cautilus a prompt that's failing your eval and it rewrites the prompt until it passes — keeping the agent's intent intact, holding or lowering cost, and proving the win on held-out examples it was never tuned on. You approve every change before it ships.
Proof: Bounded Improvement spec — today it projects a saved bundle rather than running a live improve loop; see Proof Debt. CLI ↔ Agent: the CLI runs the gates and the comparison; the agent proposes the change and judges whether to keep it.
Why You Can Trust It
Cautilus marks each promise with how far it's actually proven, and keeps that mark in plain sight. So everything on this page is the real state of the work — which is what makes the proven marks worth anything.
Reviewable Artifacts — declared
You, the next teammate, or the next agent can reopen exactly what happened: every run leaves both a machine-readable record and a readable view to audit against.
Proof: Reviewable Artifacts spec — it projects saved packet bundles rather than regenerating them live; see Proof Debt.
Host Ownership — declared
Everything specific to you — your prompts, fixtures, and policy — stays in your repo and under your control; Cautilus brings only the generic workflow.
Proof: Host Ownership spec. A live onboarding smoke (npm run consumer:onboard:smoke) already exists and would upgrade this to proven; see Proof Debt.
A Testable Agent — promised
You get a clean, invokable runner for Cautilus to run against — the kind of entry point that makes any agent testable in the first place — and the Cautilus agent helps you build it and check how testable your agent is, so that testability stays yours for good.
Proof: none yet — see Proof Debt. Background: runner readiness contract.
Proof Debt
What it would take to move each unproven promise to a live proven badge. This list is owned work, not a silent gap.
| Promise | Current | To reach proven |
|---|---|---|
| Behavior Evaluation | declared (projects evidence-current-eval-surfaces bundle; proofClass: declared-eval-runner) |
a live cautilus evaluate run over an owner-confirmed scenario/fixture, asserted in the spec |
| Bounded Improvement | declared (projects a saved bundle) | a live cautilus improve loop on a held-out scenario, asserted in the spec |
| Reviewable Artifacts | declared (projects saved packets) | regenerate packets live in the spec and assert their shape |
| Host Ownership | declared (projects onboarding bundle) | wire the existing consumer:onboard:smoke live run into the spec |
| A Testable Agent | promised (no spec) | author a spec backed by the runner-readiness/verification contracts |
How Proof Works Here
Each subclaim on a spec page must be backed by evidence that runs the claimed behavior end-to-end and asserts on the produced packet, file, or audit artifact. These are not acceptable as the closing state of a subclaim:
--helpsubstring matches or other surface-existence checks that only assert a word appears in help output- "command exists" probes that do not run the claimed scenario
Evidence is pendingplaceholders left as the closing state
For each subclaim, either (a) add an executable check that runs the actual scenario and asserts on the produced packet/file, or (b) link a concrete existing evidence bundle, audit fixture, or packet path that proves that specific subclaim. If a subclaim genuinely has no evidence yet, log it as explicit proof debt above — an owned next-action that names the bundle to author — rather than letting it close as a silent omission.
Vocabulary
- A
promisemeans something Cautilus currently says it can help with. - A
candidate claimmeans a possible promise found during Claim Discovery. evidencemeans a packet, fixture, command result, or durable artifact that supports a promise.- A
gapmeans missing or weak evidence that stays visible. - A
cross-cutting rulemeans a rule or risk that applies across workflows, such as host ownership, evidence visibility, or packet freshness.
Go Deeper
Everything below serves the promises above.
- User Workflow — how people use the
cautilusCLI and thecautilus-agentskill to discover, evaluate, and improve behavior against explicit evidence. - Contracts — the command, packet, adapter, fixture, and evidence contracts that keep the workflow buildable and reviewable.
- Promise Ledger — which behavior claims Cautilus makes, how they relate, and which workflow or contract owns each.
- Cross-Cutting Rules — reviewability, ownership, vocabulary, freshness, cost, and resumability rules across workflow steps.
- Evidence State — which claims are supported now, which proof was selected instead of rerun, and which gaps remain open.
The whole pitch is simple: real evidence for every promise, and an honest mark of how far each one is proven.