Coverage

The agentic coverage harness turns “review this repo for X” from an unauditable one-shot into a persistent, accumulating, comparable record of what an agent examined, for which concerns, and what it concluded.

What coverage is

When a model performs a survey task — “review this repo for vulnerabilities”, “find every place we touch the database”, “audit for accessibility” — findings vary run-to-run, and you can’t tell “no issues found” apart from “the model never looked.” The coverage harness fixes that by making coverage explicit and durable:

On top of that foundation, the harness also lets you measure and reproduce variance: diff two runs, and replay a run to compare it against the original.

targets workspace × scope concerns catalog assertions (file × concern) → verdict ledger & audit coverage / gaps / findings
Targets and concerns produce per-cell assertions that accumulate into an append-only ledger and a pass / gap audit.

The model

Coverage data for a target lives under <workspace>/.rupu/coverage/<target_id>/ as append-only JSONL plus a catalog snapshot. The <target_id> is derived deterministically from (workspace, scope_name), so the same agent against the same repo accumulates into one target across runs, while different workspaces stay distinct.

EntityMeaning
TargetThe unit of accumulation — one per (workspace, scope_name). Runs against the same target merge into one record.
ConcernA single criterion the agent assesses against (e.g. an OWASP or CWE entry), drawn from the effective catalog.
AssertionA (concern, file) verdict with evidence: clean, finding, examined, or not_applicable.
FindingA reported issue, with severity, location, and remediation.
LedgerThe append-only JSONL files that make up a target’s record (below). Every row carries run_id + model + surface attribution.

The ledger files under each target directory:

FileContents
files.jsonlevery file touch (read / grep / glob / edit / cmd), with attribution
concerns.jsonlevery (concern, file) → verdict assertion
findings.jsonlevery reported issue
catalog.yamlthe effective concern catalog, snapshotted at run start
runs.jsonlone manifest per run (its defining inputs, for replay)

That per-row run_id + model + surface attribution is what makes cross-run and cross-model analysis possible.

Declaring coverage

An agent activates the harness by declaring a concerns: block in its frontmatter. The block is a list of catalog includes (with optional overrides, filters, and render mode). When present, the runtime flattens the catalog, snapshots it, injects the catalog into the system prompt, and registers the coverage tools.

---
name: security-assessor
permissionMode: readonly
concerns:
  - include: owasp-top10-2021
    mode: full
  - include: cwe-top25-2023
    mode: full
  - include: secrets-in-source
    mode: full
  - include: stride
    mode: full
  - include: cwe-software-development   # ~399 CWEs
    mode: index                          # one-line table, searched on demand
---
You are a security assessor. For each (file × concern) you assess, call
coverage_mark; for each issue, call report_finding…

Render modes. full inlines each concern’s body into the prompt; index renders a compact one-line-per-concern table that the agent searches on demand (use it for large catalogs like the full CWE list so the prompt stays small); auto (default) picks based on catalog size.

Bundled catalogs

rupu coverage templates list prints them all:

owasp-top10-2021        owasp-api-top10-2023
cwe-top25-2023          cwe-software-development   cwe-research
stride                  secrets-in-source          code-smells
web-security-default    api-security-default

User catalogs may live in .rupu/concerns/ (project) or ~/.rupu/concerns/ (global) and are discovered by name; project overrides global overrides builtin.

Coverage tools the agent gets

When concerns: is set, these tools are injected automatically — you do not list them in the agent’s tools::

ToolPurpose
coverage_markrecord a (concern, file) verdict + evidence
report_findingrecord an issue (severity, location, remediation)
coverage_remaininglist in-scope files still lacking an assertion
coverage_statussummary of assessed-vs-gap progress
coverage_concerns_search / coverage_concerns_detailsearch / fetch full bodies for index-mode catalogs

Running & auditing

All inspection commands take the global --format table|json|csv flag (table is the default; structured commands support table/json, tabular ones also csv).

rupu coverage list                          List targets under .rupu/coverage/
rupu coverage templates {list, show}        List bundled catalogs / print one's concerns
rupu coverage catalog <target>              Print the effective catalog snapshot
rupu coverage show <target>                 Derived view: files touched + assertions + findings
rupu coverage audit <target>                Full report: per-concern coverage, gaps, cross-model, findings
rupu coverage gap <target>                  Just the gaps (in-scope files lacking an assertion)
rupu coverage runs <target>                 List the runs recorded against a target
rupu coverage diff <target> [base compare]  What changed between two runs (defaults: previous latest)
rupu coverage rerun <target> <run_id>       Replay an agent run, appending a new run to the same target

Find a target id with rupu coverage list, then inspect:

rupu coverage audit a1b2c3d4e5f6          # human report
rupu --format json coverage audit a1b2c3d4e5f6   # machine-readable
rupu --format csv  coverage runs  a1b2c3d4e5f6

audit surfaces per-concern coverage, gaps, cross-model attribution, and serendipitous findings; the underlying findings and verdicts come straight from the target’s ledger. Because coverage is surface-uniform, runs dispatched through the control plane accumulate into the same target and show up in the same audit.

The rerun → diff loop

Model output varies run-to-run — sometimes usefully (different angles surface different bugs), but you could never evaluate the combination for completeness. Now you can:

rupu run security-assessor "assess this repo"   # run 1
rupu coverage runs <target>                      # grab run 1's id
rupu coverage rerun <target> <run_id>            # replay it → run 2, same target
rupu coverage diff <target> <run_id> latest      # what did run 2 do differently?

diff reports four dimensions: cell-coverage delta (newly / no-longer asserted), verdict flips (with clean → finding flagged), findings appeared / disappeared, and file-touch delta.

Note. v1 rerun dispatch covers the agent surface; session / workflow / autoflow runs are captured but rerun returns an explicit “not yet supported” error. A replay re-resolves provider / model / concerns from the agent’s current frontmatter (it goes through rupu run), so those manifest fields are a record of the original run, not replay inputs.

Determinism

Everything the harness controls about what the model sees — concern ordering, the catalog snapshot, and the live file list — is byte-stable and independent of the order catalog inputs are declared in. That makes the model the only source of run-to-run variance, which is exactly what diff measures. The harness does not claim byte-identical model output: prompt construction is deterministic; sampling is not.