Coverage
The agentic coverage harness turns “review this repo for X” from an unauditable one-shot into a persistent, accumulating, comparable record of what an agent examined, for which concerns, and what it concluded.
What coverage is
When a model performs a survey task — “review this repo for vulnerabilities”, “find every place we touch the database”, “audit for accessibility” — findings vary run-to-run, and you can’t tell “no issues found” apart from “the model never looked.” The coverage harness fixes that by making coverage explicit and durable:
- Coverage is explicit. For every
(file × concern)the agent assesses, it records a verdict —clean/finding/examined/not_applicable— with evidence. File reads, greps, and edits are tracked automatically. - Runs accumulate. A second pass (same model or a different one) picks up where the first left off; cross-model verdicts are attributed and merge-able.
- Industry-anchored catalogs. Ship OWASP Top 10, CWE Top 25, the full CWE weakness list, STRIDE, secrets, and more; include or extend them.
- Surface-uniform. Works the same whether the agent loop is driven by
rupu run, a workflow, an autoflow cycle, or an interactive session.
On top of that foundation, the harness also lets you measure and reproduce variance: diff two runs, and replay a run to compare it against the original.
The model
Coverage data for a target lives under <workspace>/.rupu/coverage/<target_id>/ as append-only JSONL plus a catalog snapshot. The <target_id> is derived deterministically from (workspace, scope_name), so the same agent against the same repo accumulates into one target across runs, while different workspaces stay distinct.
| Entity | Meaning |
|---|---|
| Target | The unit of accumulation — one per (workspace, scope_name). Runs against the same target merge into one record. |
| Concern | A single criterion the agent assesses against (e.g. an OWASP or CWE entry), drawn from the effective catalog. |
| Assertion | A (concern, file) verdict with evidence: clean, finding, examined, or not_applicable. |
| Finding | A reported issue, with severity, location, and remediation. |
| Ledger | The append-only JSONL files that make up a target’s record (below). Every row carries run_id + model + surface attribution. |
The ledger files under each target directory:
| File | Contents |
|---|---|
files.jsonl | every file touch (read / grep / glob / edit / cmd), with attribution |
concerns.jsonl | every (concern, file) → verdict assertion |
findings.jsonl | every reported issue |
catalog.yaml | the effective concern catalog, snapshotted at run start |
runs.jsonl | one manifest per run (its defining inputs, for replay) |
That per-row run_id + model + surface attribution is what makes cross-run and cross-model analysis possible.
Declaring coverage
An agent activates the harness by declaring a concerns: block in its frontmatter. The block is a list of catalog includes (with optional overrides, filters, and render mode). When present, the runtime flattens the catalog, snapshots it, injects the catalog into the system prompt, and registers the coverage tools.
--- name: security-assessor permissionMode: readonly concerns: - include: owasp-top10-2021 mode: full - include: cwe-top25-2023 mode: full - include: secrets-in-source mode: full - include: stride mode: full - include: cwe-software-development # ~399 CWEs mode: index # one-line table, searched on demand --- You are a security assessor. For each (file × concern) you assess, call coverage_mark; for each issue, call report_finding…
Render modes. full inlines each concern’s body into the prompt; index renders a compact one-line-per-concern table that the agent searches on demand (use it for large catalogs like the full CWE list so the prompt stays small); auto (default) picks based on catalog size.
Bundled catalogs
rupu coverage templates list prints them all:
owasp-top10-2021 owasp-api-top10-2023 cwe-top25-2023 cwe-software-development cwe-research stride secrets-in-source code-smells web-security-default api-security-default
User catalogs may live in .rupu/concerns/ (project) or ~/.rupu/concerns/ (global) and are discovered by name; project overrides global overrides builtin.
Coverage tools the agent gets
When concerns: is set, these tools are injected automatically — you do not list them in the agent’s tools::
| Tool | Purpose |
|---|---|
coverage_mark | record a (concern, file) verdict + evidence |
report_finding | record an issue (severity, location, remediation) |
coverage_remaining | list in-scope files still lacking an assertion |
coverage_status | summary of assessed-vs-gap progress |
coverage_concerns_search / coverage_concerns_detail | search / fetch full bodies for index-mode catalogs |
Running & auditing
All inspection commands take the global --format table|json|csv flag (table is the default; structured commands support table/json, tabular ones also csv).
rupu coverage list List targets under .rupu/coverage/ rupu coverage templates {list, show} List bundled catalogs / print one's concerns rupu coverage catalog <target> Print the effective catalog snapshot rupu coverage show <target> Derived view: files touched + assertions + findings rupu coverage audit <target> Full report: per-concern coverage, gaps, cross-model, findings rupu coverage gap <target> Just the gaps (in-scope files lacking an assertion) rupu coverage runs <target> List the runs recorded against a target rupu coverage diff <target> [base compare] What changed between two runs (defaults: previous latest) rupu coverage rerun <target> <run_id> Replay an agent run, appending a new run to the same target
Find a target id with rupu coverage list, then inspect:
rupu coverage audit a1b2c3d4e5f6 # human report rupu --format json coverage audit a1b2c3d4e5f6 # machine-readable rupu --format csv coverage runs a1b2c3d4e5f6
audit surfaces per-concern coverage, gaps, cross-model attribution, and serendipitous findings; the underlying findings and verdicts come straight from the target’s ledger. Because coverage is surface-uniform, runs dispatched through the control plane accumulate into the same target and show up in the same audit.
The rerun → diff loop
Model output varies run-to-run — sometimes usefully (different angles surface different bugs), but you could never evaluate the combination for completeness. Now you can:
rupu run security-assessor "assess this repo" # run 1 rupu coverage runs <target> # grab run 1's id rupu coverage rerun <target> <run_id> # replay it → run 2, same target rupu coverage diff <target> <run_id> latest # what did run 2 do differently?
diff reports four dimensions: cell-coverage delta (newly / no-longer asserted), verdict flips (with clean → finding flagged), findings appeared / disappeared, and file-touch delta.
rerun dispatch covers the agent surface; session / workflow / autoflow runs are captured but rerun returns an explicit “not yet supported” error. A replay re-resolves provider / model / concerns from the agent’s current frontmatter (it goes through rupu run), so those manifest fields are a record of the original run, not replay inputs.Determinism
Everything the harness controls about what the model sees — concern ordering, the catalog snapshot, and the live file list — is byte-stable and independent of the order catalog inputs are declared in. That makes the model the only source of run-to-run variance, which is exactly what diff measures. The harness does not claim byte-identical model output: prompt construction is deterministic; sampling is not.