Multi-host

Control rupu agents and workflows running on other machines from one central Control Plane — register remote hosts, launch work on the host you choose, and watch it stream back, all in a single browser tab.

What it is

Multi-host is a hub model. One central Control Plane federates out to a fleet of remote hosts, each running its own control plane (rupu cp serve). A Host is a top-level peer in rupu — it sits alongside your Projects, not underneath them. The machine the central CP runs on is itself just a host: host[0], "this host", always present and always first.

The remote is always the source of truth — the central CP keeps no authoritative copy of a host's run state, and tags every run with the host it came from. How it reaches that state depends on the transport: directly-reachable hosts (HTTP, SSH) are live-queried on demand — open a run list and the CP fans out to each host, merges the results, and proxies straight through to a run's detail / event stream / transcript; hosts that can't be reached inbound (tunnel, bucket) mirror their run artifacts back to the central CP as they change. Either way you get one merged fleet view, and no host's state is duplicated as a second source of truth.

No new inbound surface. The central CP is purely a client this slice — it adds no new way in. Each remote host's API is already token-gated by its own rupu cp serve, and that token is what the center authenticates with.

One central control plane reaching each host through whichever transport fits — in-process for host[0], HTTP and SSH here, plus dial-home tunnel and dead-drop bucket for hosts it can't connect to directly.

Connection types

Every host is reached through a single transport port (HostConnector), so the Fleet surface, host attribution, launcher, and live streams work the same no matter how a host is wired up. Five transports ship today, covering hosts you can reach directly, hosts behind a NAT or firewall, and hosts you can't reach at all:

Transport	How the host is reached	How to register	Best for
Local	In-process — no network hop. This is host[0], "this host," always present.	Built in (the `local` host). Nothing to register.	The machine the central CP runs on.
HTTP (federation)	The central CP is a client of the remote's `rupu cp serve`, over its token-gated HTTP API.	`rupu host add <name> --url https://… --token …`	A remote that can expose a reachable, TLS-fronted server.
SSH	The CP connects out over SSH and runs `rupu` on the remote; auth is delegated to the system `ssh`.	`rupu host add <name> --ssh user@host`	A host you can already `ssh` to but that runs no server.
Tunnel (dial-home)	The remote dials home over an outbound WebSocket; the CP reaches it via its node registry. Works behind NAT / firewalls.	`rupu node enroll <name>`, then run `rupu node` on the box.	A NAT'd / firewalled box that can't expose any inbound port.
Bucket (pull / dead-drop)	The CP and the worker exchange jobs + results through an object store. Neither side connects to the other.	`rupu host add <name> --bucket s3://…`, then run `rupu node pull`.	Air-gapped / fully disconnected / batch hosts.

Where hosts persist. Every host is a TOML record under ~/.rupu/hosts/<id>.toml holding only its name and transport metadata. HTTP bearer tokens live in the system keychain, referenced by host id; tunnel hosts store only the SHA-256 hash of the enrollment token; SSH and bucket hosts store no secret at all (auth is delegated to the system ssh and the cloud credential chain).

Local — host[0]

The machine running the central CP is itself a host: the built-in local entry, always present and always first. It needs no registration and can't be removed. Single-host setups never touch any of the transports below — everything just runs on local.

HTTP federation

Federation has two sides. On the remote machine, run the control plane so it's reachable from the center and guarded by a bearer token. Bind to an address other than loopback, and set --token so /api/* requires Authorization: Bearer <token>:

# on the REMOTE machine — serve the control plane, bound + token-guarded
$ rupu cp serve --bind 0.0.0.0:7878 --token "$RUPU_CP_TOKEN" --no-open

Put a TLS-terminating reverse proxy in front for a real https:// URL — remote base URLs are expected to be HTTPS so the token is protected in transit. Then, on the central machine, register that host with a display name, the remote's base --url, and the --token:

# on the CENTRAL machine — register the remote host
$ rupu host add prod --url https://host.example.com --token "$REMOTE_TOKEN"
host_01J9Z4W7Q0X8Y6V5K3M2N1P0R8

To keep the token out of your shell history and process list, read it from stdin instead with --token-stdin:

# pipe the token in on stdin (mutually exclusive with --token)
$ pass show rupu/prod | rupu host add prod --url https://host.example.com --token-stdin

SSH

The SSH transport targets a host that runs a full rupu install but exposes no rupu cp serve and runs no dial-home agent — yet you can already ssh to it. The central CP connects out over SSH: it dispatches a run with ssh host rupu workflow run … (detached on the remote so it survives the SSH session), then keeps a long-lived ssh … tail -f pump that mirrors the run's artifact files back into the central run store — so the run shows up in the same host-aware lists, detail, and live events as any other. Register it with the --ssh destination:

# register an SSH host — destination is user@host or a ~/.ssh/config alias
$ rupu host add build-box --ssh deploy@build.example.com --port 22 --identity ~/.ssh/id_ed25519
host_01J9ZB3KQ8M2T0V4W6X8Y1A3C5

No secret stored. Authentication is delegated entirely to the system ssh — ssh-agent, ~/.ssh/config, default keys. The host record keeps only the destination, optional --port, and optional --identity path. Because the destination can be a ~/.ssh/config alias, ProxyJump / ControlMaster / keys all come for free from your SSH config.

Tunnel (dial-home)

When a host is behind a NAT or firewall and can't expose any inbound port, invert the direction: a lightweight rupu node agent on the box dials out to the CP over a persistent WebSocket, executes dispatched runs locally, and streams their artifacts back. The CP reaches the node through its node registry — no inbound ports, no reachable server. Start on the central machine by enrolling the node, which mints a one-time token and prints the exact command to run on the box:

# on the CENTRAL machine — mint a node + one-time token
$ rupu node enroll build-box-01 --cp-url wss://cp.example.com
enrolled: build-box-01 (host_01J9ZC4M0N1P2Q3R4S5T6U7V8W)

⚠  token shown ONCE — copy it to the node now:

  rupu node --cp-url wss://cp.example.com --token <token> --node-id host_01J9ZC4M0N1P2Q3R4S5T6U7V8W

Copy the token to the remote box and run the node agent there, reading the token from stdin so it never lands in shell history. The agent authenticates, stays connected, and reconnects with backoff if the link drops:

# on the REMOTE box — dial home and stay connected
$ rupu node --cp-url wss://cp.example.com --token-stdin < token.txt

The plaintext token is shown once. Only its SHA-256 hash is stored in the tunnel host record — the CP verifies an inbound node connection against the hash without ever keeping the secret on disk. Tunnel-node runs are mirrored as first-class, host-attributed runs in the central store, so they observe and control exactly like local runs (launch, cancel, approve / reject / resume).

Bucket (pull / dead-drop)

The most decoupled transport: for a host the CP can't reach at all and which can't reach the CP either. Both sides independently talk to a shared object-store bucket that acts as the gateway. The CP writes dispatched work into the bucket; the worker polls it, atomically claims a job, runs it locally, and writes results back; a CP-side poller reads those results and mirrors them in. Register the host on the central machine with a bucket URL and optional prefix:

# on the CENTRAL machine — register a bucket (dead-drop) host
$ rupu host add nightly --bucket s3://my-bucket --prefix rupu/host-1
host_01J9ZD5N1P2Q3R4S5T6U7V8W9X

On the worker box, run the pull agent against the same bucket and prefix. It claims jobs, runs them locally, and writes results back — no inbound or outbound connectivity to the CP required. By default it loops forever, polling on an interval; pass --once to drain whatever is queued a single time and exit (handy for cron / batch hosts):

# on the WORKER box — poll the bucket, claim + run jobs, write results back
$ rupu node pull --bucket s3://my-bucket --prefix rupu/host-1

# or drain once and exit (cron / batch)
$ rupu node pull --bucket s3://my-bucket --prefix rupu/host-1 --once

Object stores + no stored secrets. Bucket URLs may be s3://, gs://, or file:// (a shared filesystem on a local network). Credentials are resolved from the standard cloud credential chain / environment (AWS_*, GOOGLE_*) — rupu stores none. Control messages (cancel / approve / reject) are queued in the bucket and applied on the worker's next poll, since a dead-drop is inherently asynchronous.

Manage hosts

rupu host list shows every configured host. The built-in local host is always first and can't be removed — it's host[0], "this host." Each row is the host id, name, and a transport label (the base URL for HTTP, or ssh:… / tunnel:… / bucket:… for the others):

$ rupu host list
local                              local         local
host_01J9Z4W7Q0X8Y6V5K3M2N1P0R8   prod          https://host.example.com
host_01J9ZB3KQ8M2T0V4W6X8Y1A3C5   build-box     ssh:deploy@build.example.com:22
host_01J9ZC4M0N1P2Q3R4S5T6U7V8W   build-box-01  tunnel:host_01J9ZC4M0N1P2Q3R4S5T6U7V8W
host_01J9ZD5N1P2Q3R4S5T6U7V8W9X   nightly       bucket:s3://my-bucket

Remove a remote host by its id with rupu host remove. This deletes the TOML record and any keychain token; removing local is refused:

$ rupu host remove host_01J9Z4W7Q0X8Y6V5K3M2N1P0R8
removed host_01J9Z4W7Q0X8Y6V5K3M2N1P0R8

The same registry is exposed in the browser — adding and removing hosts in the Control Plane writes the exact same TOML records and keychain entries, so the CLI and the CP are always looking at one shared fleet.

In the Control Plane

With hosts registered, the central Control plane becomes fleet-aware. Everything you can do on the local host you can now do on a remote one — whichever transport reaches it.

Fleet → Hosts. A new top-level surface listing every host (name, transport, status, version, active runs, last seen), with local pinned first and an Add Host form that registers HTTP, SSH, and bucket hosts right from the browser (tunnel nodes enroll via the CLI). Each host's status is probed live; a host's detail view scopes the run list to that one host plus its connection and health info.
Host attribution + filtering. Run lists gain a Host column and filter so you can see at a glance where each run is executing. Runs are addressed by (host_id, run_id), so deep-links stay stable; run detail shows the owning host. The default view stays "this host," so single-host setups are completely unchanged.
Host selector in the launcher. The launcher gains a host picker (defaulting to local). Launch a workflow run, an agent run, or a session on whichever host you choose — and send session turns to it — all proxied to that host's launch endpoints.
Observe a host's runs. List, detail, the live event stream, and the transcript of any remote run all surface in the central CP — proxied straight through for HTTP hosts, mirrored into the central run store for tunnel / SSH / bucket hosts. The same live view works whether the run is local or three hosts away.
Control remote runs. Approve, reject, and cancel a paused or in-flight run on the host that owns it — over whichever transport that host uses (queued in the bucket for dead-drop hosts).
Graceful degradation. When a host is unreachable, the run-list fan-out tolerates it per-host: that host shows an offline marker and the rest of the page stays intact. Launch or control against an offline host returns a clear error, and live streams auto-reconnect.

Driving the fleet needs the full runtime. Controlling remote hosts requires the central side to run the full rupu cp serve runtime — that's what installs the host registry with its transport connectors and the mirror / pull workers. The read-only rupu cp can show hosts but cannot control them.

Distribute work across the fleet

Beyond running a whole workflow on one host, a single workflow run can spread a for_each step's units across the fleet. Add a distribute: block naming the hosts; rupu assigns the units round-robin, dispatches each to its host, attributes the unit's run to that host, and aggregates the results back into the step — partial failures honored per continue_on_error. It works over every transport (HTTP / SSH / tunnel / bucket).

  - id: review_each
    agent: code-reviewer
    for_each: "{{ inputs.files }}"
    distribute:
      hosts: [gpu-box, build-box]
    prompt: "Review {{ item }}."

See Workflows for the step format. Host-placement of individual file-mutating steps (which needs cross-host workspace sync) is still in progress — see below.

Choosing a transport

All five transports satisfy the same port, so pick by how the host can be reached on the network — not by what features you want:

The CP can reach the host directly. If the host can run a reachable, TLS-fronted server, use HTTP. If it can't run a server but you can already ssh to it, use SSH — no daemon to deploy, auth comes from your SSH config.
The host is behind a NAT or firewall. It can't accept an inbound connection but can dial out — use the Tunnel: run rupu node on the box and it connects home.
Neither side can reach the other. Air-gapped, locked down, or batch / cron hosts that only share an object store — use a Bucket dead-drop and run rupu node pull on the worker.

Still coming

All five transports above are shipped today. The architecture was built around one transport port (HostConnector) so the remaining work slots in without disturbing the surfaces above. These are genuinely not done yet:

Coming later	What it adds
Per-step host placement & workspace sync	A `host:` on an individual (non-fan-out) step, plus cross-host workspace / file sync so file-mutating steps can run remotely. (Distributing a `for_each` step's units across hosts already ships — see Distribute work across the fleet above.)
Auto-placement	Capability-based scheduling — picking the host for a run automatically from its advertised backends and capabilities.
Remote sessions over the tunnel	Interactive, long-lived sessions dispatched to a tunnel node (a persistent session worker + reconnect-stable mapping). Today's tunnel covers workflow / agent runs and approve / reject / resume; sessions are deferred.
mTLS hardening	Mutual-TLS node enrollment (cert CN = node id) on top of today's token-hash auth — the frame envelope already carries an `auth` block so it slots in with no protocol change.

← Control plane Coverage →