Florete

Reasoning

Design rationale for the C0 control and management plane decisions

Captured justifications behind the decisions in the C0 control and management planes, organised by topic.

Why git for source-of-truth (but not for distribution)

For the operator's authoring workspace, git is the de facto standard (GitOps). Argo CD, Flux, Nix/NixOS, Tailscale ACLs, and WireGuard manual deployments all store declared state in git. It gives us history, blame, diff, PRs, signed commits, offline operation for free.

An earlier draft also had nodes pulling directly from git (via an embedded gitoxide library with a read-only deploy key). Pressure-testing surfaced several problems with that approach:

  • Two access-control systems. Florete's own rete CA already authenticates every participant; nodes pulling from GitHub/Gitea/etc. introduces a second auth plane (deploy keys) with its own rotation story. That's one system too many.
  • Deploy-key revocation is awkward. A leaked bundle contains the deploy key; revoking it means rotating on the forge and re-issuing bundles to every legitimate node. Doing this at pilot pace is painful.
  • Binds us to a forge. Users with no GitHub account, self-hosted-only shops, and airgapped pilots all suffer.

Replacement: nodes fetch compiled artifacts from a Florete-protected config-server run on a management node — same mTLS auth path as every other service, same SPIFFE identity model, no new credential kind. See Why a config-server, not direct git pull by nodes below for the detail.

The git repo stays — as the operator's authoring workspace and shipment audit log. It holds YAML, ca.crt, enrollment.log, and the committed compiled tree. Nodes never see it.

Alternatives considered and rejected for the authoring workspace:

  • Shared filesystem (NFS, Syncthing, Dropbox) — no history, messy conflicts, no atomicity.
  • Object store (S3 + versioning) — loses diff/PR workflow.
  • Distributed KV (etcd, Consul) — that is a control plane; violates the premise of C0/C1's manual mode.
  • scp/rsync from operator — works, but zero audit trail.

Future upgrade worth considering early: require signed commits (or signed tags) for retectl publish to accept them. Trivial to add; big supply-chain-security win aligned with the Zero Trust posture.

Why YAML (not JSON5, TOML, or plain JSON)

  • Plain JSON — no comments. Killer; out.
  • TOML — clean for flat config, ugly for nested structures like services.
  • JSON5 — genuinely good (comments, trailing commas, no whitespace traps), but much smaller ecosystem: fewer schema validators, IDE plugins, Rust crates. Devops audience doesn't reach for it.
  • YAML — devops lingua franca. Anchors/aliases let us DRY repeated crypto params. Rich schema tooling (JSON Schema over YAML). Well-known footguns (Norway bug, sexagesimal, no = false) are manageable with strict schema validation, always-quoting strings, and YAML 1.2 parsers.

For machine-to-machine compiled artifacts, plain JSON is the right choice — comments aren't needed, parser simplicity matters for the agent.

Why a two-step pipeline (YAML → compiled JSON)

The full list of motivators:

  1. Per-node filteringalpha should not see beta's ACL rows or user3's identity info. The compiler extracts only what each node needs.
  2. Agent simplicity — pre-resolved data. No schema validation at runtime. Smaller attack surface.
  3. Denormalization for performance — access matrix is pre-computed; the hot path does table lookups, not graph walks.
  4. Stable boundary — we can restructure YAML sources freely as long as the compiled schema is stable. Agent code decouples from operator ergonomics.
  5. Validation before distribution — a bad edit fails retectl validate locally; it never reaches nodes.
  6. Determinism — same-input same-output across machines; two operators independently compiling produce identical artifacts, detectable by diff.
  7. Committed artifacts = inspectable, auditable shipments. "What config did alpha actually get?" = git log .flor/compiled/alpha/. No need to reconstruct from YAML. Artifacts contain only path references (no secrets), so committing them is safe. The config-server is the live wire path; git is the audit trail.

This is the same pattern as nginx.conf → loaded state, Terraform HCL → plan.json, or kubectl apply → etcd. In C0 the per-node output is two artifacts (agent.json + vertices/flor.json); in C1 a third is added for the second flor vertex in the recursive stack. The split is between supervisor (agent) and data-plane (vertex), so each consumer's contract stays minimal.

Why CA-based identity (not pinned certs)

Two options were considered:

  • Pinned certs per peer — every node must hold every peer's cert. Adding one peer = updating every node's trust store. Tangles trust distribution with ACL distribution — two concerns, one file.
  • CA-signed certs (SPIFFE X.509-SVIDs) — nodes verify signatures, not exact certs. ACLs authorize by SPIFFE ID. Adding a peer = sign their cert, commit their name; existing nodes need no trust-store update.

Even at C0 scale, CA-based scales better — and by putting a SPIFFE URI in the SAN we produce real SPIFFE X.509-SVIDs, interoperable with SPIRE, Istio, Linkerd, Vault, and anything else in the SPIFFE ecosystem. Same amount of code as pinned-cert mode, forward-compatible with standards. Clean separation: authentication = CA signature at mTLS handshake; authorization = ACL check on SPIFFE ID.

CA private key stays on operator hardware (password-protected file or YubiKey PIV slot). Never in the repo. Revocation via ACL removal is sufficient at C0 scale; CRL/OCSP is a B1+ concern.

Why only the CA cert lives in the repo (not individual signed certs)

Signed identity certs are public material (public key + name + CA signature; nothing secret) — but they aren't semantically required in the rete repo. Each principal holds its own cert locally and presents it at mTLS handshake; verifiers only need ca.crt and the expected identity name.

Options considered:

  • Commit all certs to identities/*.crt — operational convenience (one delivery channel), but redundant with the holder's local copy, clutters diffs, and creates confusion when certs expire.
  • CA cert in repo + certs delivered via enrollment bundle — clean separation. Repo holds only what's needed for compile and audit.

Enrollment-event audit is preserved via enrollment.log — an append-only record of operator sign-actions. This is a strictly better audit than cert-in-repo, because it records the decision (operator, principal, role, timestamp, fingerprint) rather than just the resulting artifact.

Why principals are uniform (users and services)

Istio (AuthorizationPolicy + SPIFFE IDs), Linkerd (MeshTLSAuthentication), and Consul (intentions) all treat services as principals of the same kind as human workloads. A service calling another service is mTLS between two identities, not a special case.

For Florete, modelling both user and service as principals means a single role mechanism covers both "which users can reach which services" (ingress ACL) and "which services can call which other services" (egress from service perspective). The compiler evaluates the same matrix either way.

This also makes p2p service pairs trivial: two services that call each other just hold two overlapping role grants. No special "p2p" primitive needed. The per-node flor agent workload extends the same principle: it owns the node/<node> identity and is wired into the vertex's workloads list like any other principal, with a SOCKS5 inbound channel for its outbound calls to config-server. C1 extends the principle to mesh-vertices as a fourth kind for hop-by-hop mesh-layer mTLS, so all kinds — users, services, nodes, mesh-vertices — share the same principal mechanism.

Why flor agent is a separate principal from flor vertex

C0 could plausibly be one process per node (the data-plane vertex doing its own config-server talk on the side, with the node identity baked in). That works for C0's single-vertex case but breaks down the moment there's more than one vertex on a node — which of them owns the node identity and the config-server poll loop? With C1's recursive stack right around the corner, picking that answer per release would be churn.

Splitting up front avoids the issue: flor agent is a first-class workload that owns the node identity (spiffe://<rete>/node/<node>) and is the only thing that talks to the config-server. The flor vertex below it is a pure data-plane component, driven entirely by its compiled config, with no awareness of the rete control plane. The agent, like any other principal, doesn't hold the node cert+key — it delegates the mTLS handshake to the vertex via a SOCKS5 channel, exactly as alice delegates her mTLS to flor. Same mechanism, applied to a workload that happens to live alongside its serving vertex on the same machine.

This also maps cleanly to the HLD's Execution Plane Agent concept: the agent is the per-node supervisor of workloads, and the vertices it supervises are themselves workloads. Adding more vertex kinds (C1's link+mesh, future interrete layers) is just listing more entries under vertices in agent.json.

How the agent actually launches a vertex — direct fork, systemd unit, future local workload-runtime interface — is intentionally not pinned in C0. The runtime contract is just "a vertex starts when given its config"; the rest is platform detail. The shape resembles kubelet's relationship to its container runtime; that analogy is intentional, but the WRI design is deferred.

Why one workloads list with typed io[] channels

An earlier draft of the per-node payload had local_users (with socks5_proxy) and local_services (with upstream_addr, sometimes also socks5_proxy). The split conflated initiator-vs-target with workload kind, and started straining as soon as a second flor binary (link-flor in C1) needed to carry an entry that's both initiator and target on a single FlorIO channel.

Unifying to workloads[].io[] resolves the asymmetry: each channel's direction is intrinsic to its kindsocks5 inbound, tcp outbound, florio bidirectional — so a workload is an initiator if it has any inbound-kind channel, a target if it has any outbound-kind channel, both if both. Users, services, the agent's node/<node> principal, and (in C1) mesh-vertices are all rows in the same table — they differ only by SPIFFE-ID kind and by which io channels they expose. There is no separate direction field (it would be redundant with kind, and would admit illegal pairings like an inbound tcp); a new direction is a new kind. The schema also stops privileging SOCKS5 and TCP-upstream as the only northbound shapes; new io.kind values can be added without restructuring.

Why per-service SOCKS5 outbound (no container namespaces)

Big service meshes rely on sidecars per pod + network namespaces to separate services on the same host. Florete at C0 doesn't have an execution environment and can't assume containers.

Options for distinguishing per-service egress on a shared host:

MechanismVerdict
Source-port heuristicUnreliable
SO_PEERCRED + OS user per serviceWorks, but forces OS-level per-service users
Shared secret / token per serviceBootstrap problem
Unix socket per service (app opens it)Requires flor-aware apps
Per-service SOCKS5 proxy, port→identityWorks with any SOCKS5-aware client; no OS coupling

SOCKS5 is supported by virtually all HTTP clients, gRPC clients, and database drivers via ALL_PROXY env var. Each service gets a dedicated localhost SOCKS5 port that flor binds to exactly one identity. Simple, portable, works today.

Transparent outbound for non-SOCKS5 apps (iptables redirect, libc shim) is a later optimization, not a C0 blocker.

In C0, QUIC connections are established directly between service/user endpoints over UDP — there's no mesh-of-vertices in between. The single flor vertex at each node hosts QUIC endpoints for its local principals and services; there are no node-to-node mTLS links.

This means C0 has no mesh-vertex identity layer, no links.yaml, no paths.yaml — the hop-by-hop mesh crypto layer simply doesn't exist. Workload access is fully derivable from users + services + roles + groups without any routing configuration. The agent at each node carries its own principal identity (for mgmt-plane communication with the config-server and metrics services), but that's orthogonal to the workload data plane and to mesh transit.

Inter-node links appear only in C1, when mesh-flor is introduced on top and pairs of mesh-vertices need authenticated channels over which multiple service-to-service connections can multiplex. That's when mesh-vertex identity, links.yaml, and paths.yaml become necessary.

Why RBAC (not per-principal ACLs)

N principals × M roles is tractable; N principals × K services blows up fast. RBAC matches how teams already think about access. If someone needs an exception, create a new role (developer-kafka-only) rather than a principal-specific permission.

Applies uniformly whether the principal is a user or a service, which is what makes the "principals are uniform" principle work without extra mechanism.

Why operator-issued bundles (not user-initiated PRs)

An earlier design had users opening PRs with their CSRs. That failed real-pilot pressure-testing:

  • Private repo → users need forge accounts + collaborator access, which most pilots can't provide.
  • Non-technical users (sales, ops) shouldn't edit git.
  • Giving users write access (even scoped) to the rete repo inverts the trust model — they could tamper with state beyond their own enrollment.
  • Per-user install tokens with write access effectively become signed URLs, so the git-ness adds no value over plain HTTPS delivery.

Replacement: operator issues a bundle per node (covering the node's identity and every workload it hosts); delivery is out-of-band (Telegram, email, any channel); user runs flor enroll. Bundle contains the signed certs, CA cert, config-server URL, and expected config-server SPIFFE ID — no git credential, no deploy key.

This is how WireGuard / OpenVPN enterprise deployments work. Users never see git; operator is the only writer. For security-conscious users, Flow B (principal generates keypairs, sends CSRs out-of-band, operator returns signed certs) preserves key-hygiene.

Why a config-server, not direct git pull by nodes

Every node needs the operator's latest compiled artifact for its own machine. Options considered:

  • Nodes git-pull from the rete repo. Requires embedding a git client in flor plus a deploy-key credential in every enrollment bundle. Adds a second auth plane on top of Florete mTLS, binds us to a git forge, and makes bundle revocation painful (rotate deploy key + re-issue every legitimate bundle).
  • Operator runs scp of the per-node compiled directory to each node. No audit trail; operator can't do this from a phone; doesn't survive the management use-case of "user wants to sync on demand".
  • Nodes fetch from a Florete-protected HTTPS service run by the rete itself. Reuses the rete's existing mTLS + SPIFFE identity model. No second auth plane. Bundle revocation = remove principal from YAML and publish. Runs on any provider (or none — self-hosted). Also gives us a natural home for the metrics sink (POST /metrics/...) and future features (event streams, CRL distribution).

The third option — the config-server — wins on every axis that matters at our scale: fewer moving parts, one access-control system, no forge lock-in, no extra credential to rotate. The cost is that we now have one always-on server per rete (the management node), but pilots already need that machine for metrics anyway, and the config-server is just another Florete-published service living next to it.

Why HTTP inside Florete instead of a bespoke RPC. HTTPS is already ubiquitous, trivially testable with curl, and has every client library any operator tool will want. Our mTLS happens at the Florete layer, not the HTTP layer — the HTTP endpoint binds 127.0.0.1 and is only reachable through the Florete tunnel, so HTTP is just the framing.

Why polling, not push. Pilot scale doesn't justify the complexity of server-push (long-polling, WebSocket, reconnect handling). Agents poll on a timer (default: every 10 minutes, configurable) and operators can always trigger an immediate flor agent sync for urgent rollouts. Push lands in B1 when the coordination server becomes first-class anyway.

Management-node bootstrap is the only inherent drawback. The config-server can't serve its own artifact before it's running. The one-shot manual bootstrap (scp the bundle + mgmt01.json, start the agent) is documented as a playbook step and handles this cleanly — once mgmt01 is up, everything else flows through it.

Why --commit-timeout by default (safety net)

Publishing SSH over Florete means a bad flor agent sync can remotely brick a server. This is the classic network-device failure mode, and the industry-standard mitigation is commit confirmed: activate the change, require explicit confirmation within a timeout, else auto-rollback.

Cisco IOS, Juniper JunOS, and most serious network-equipment CLIs default to this pattern. For Florete's pilot use case, the same pattern:

  1. Costs ~50 lines of supervisor logic.
  2. Eliminates the highest-severity pilot failure mode.
  3. Trains operators into a safe habit from day one.

Combined with the "keep a non-Florete access path during pilots" recommendation, it bounds the blast radius of bad config.

Why atomic updates with a version stamp (not deltas yet)

Deltas add significant complexity: sequence numbers, resync-on-miss, per-mutation delta encoding, exploded test matrix. At pilot scale (≤5 nodes, handful of services), the full artifact is small and re-sending it is fine.

Forward-compat measure taken now, cheaply: a monotonic version number in the compiled artifact envelope. Costs one field; unlocks two things later:

  • Delta mode: server asks "what version?" and sends deltas from there (or full re-push if too far behind).
  • Idempotency / rollback: operator can tell which version each node is running.

Why the compiled artifact is a public, versioned contract

The agent consumes a compiled artifact regardless of who produced it. This stays invariant across all evolution stages:

StageProducerDistribution
C0 Tended Tunnelsretectl compile (local)bundle + config-server fetch
C1 Manual Meshretectl compile (local)bundle + config-server fetch
B1 Cloud ControlCentral server, pushWebSocket / gRPC stream
Distributed controlAny agent, consensus-derivedGossip + pull

Because the agent-facing contract stays constant, every stage is an additive capability tier. Manual mode is not a stepping stone we retire — it's a permanent power mode useful for hackers, airgapped environments, personal setups, and disaster-recovery fallback.

The practical consequence: treat the compiled-artifact JSON Schema as public API from C0 onward. Semver it. Document it. Don't let implementation details leak into it.

Why signed mgmt artifacts and policy-bounded autonomy

The rete's coordination point — config-server today, a managed cloud service in B1+ — is the obvious blast-radius target. If it's compromised and it can mint or modify access state, an attacker who hijacks it owns the rete. The defense is to keep the signing key off the coordination point entirely: the operator signs every mgmt artifact on their own hardware with a dedicated management-plane/<name> signing principal (a sign-only identity issued separately from any user, distinct from the operator's user/<op> TLS principal — see ADR-0005); agents verify locally against the corresponding public cert, delivered with ca.crt at enrollment. The coordination point only stores and serves. It cannot grant access, change topology, or replace identities — only refuse to serve.

This is the same posture as TUF / Sigstore / Tailscale's Tailnet Lock: sensitive state has a signed root of authority that lives outside the distribution path. We adopt it from C0 — even though C0 has no cloud component yet — because (a) verification logic is cheap to write once and grows with the system, (b) the operational story we sell to security-conscious buyers in B1+ ("you don't need to trust our cloud") is only credible if the property held from day one, and (c) retrofitting signature-checking into a deployed agent base is far harder than shipping it now.

In B1+ the system grows a control plane that makes dynamic decisions (path failover, NAT signaling, link selection, eventually placement). Two choices we explicitly don't make:

  • Sign every CP decision. Round-tripping every dynamic decision through an offline operator key is operationally impossible.
  • Put a signing key in the cloud. That defeats the entire posture.

Instead: the operator signs bounds — a grammar of allowed states (reachability matrix, topology subset, path constraints, resource caps) — and the unsigned CP picks any state inside the grammar. Each agent verifies CP decisions against the signed bounds locally; the CP holds no key and can be fully untrusted. This is policy-bounded autonomy / capability attenuation. Precedent runs deep: Macaroons (caveats narrow capabilities held by untrusted parties), RPKI/ROAs (signed origin authorizations bound dynamic BGP), OPA-style admission policies, BGP/FIB+FRR (control plane sets paths; data plane reroutes locally on failure). The principle nests one more level for sub-millisecond agent decisions (FRR-class fast reroute within reserved paths) — same delegation rule, finer timescale.

Why this beats the obvious alternatives:

  • vs. simple "everything in mgmt is signed, CP is access-irrelevant": works for B1 with minimal CP scope but degrades as features land. Service discovery, path selection, load balancing all sit in the grey zone — under simple split you re-litigate "does this leak access?" feature by feature. Under bounded-CP you write the bound once.
  • vs. signed CP with delegated intermediate key: equivalent in security iff the key is scoped to the same bounds — and you've put a hot signing key in the cloud, which is exactly what we're selling against. Intermediate keys reappear later as the implementation mechanism for delegation, not as a parallel trust chain.
  • vs. attested CP (TPM/SGX): wrong shape for cross-platform agents.

What we do now in C0/C1 (cheap, hard to retrofit):

  1. Sign every mgmt artifact. Single management-plane/<name> signing principal (issued separately from any user; primary by convention), file-backed, ed25519. Its cert is sign-only (no TLS extKeyUsage) so a leak can't be re-used to impersonate any operator's or workload's TLS identity. Agents verify on every fetch.
  2. Add a plane discriminator to the envelope ("mgmt" today; "ctrl" reserved). The two artifact streams will be parallel files with separate envelopes — not one envelope with mixed signed/unsigned regions, which would make signature scope ambiguous and partial verification fragile.
  3. Keep the monotonic version already in the envelope; in B1+ ctrl artifacts will reference the mgmt version they obey, and agents will reject ctrl referencing a stale mgmt — same rollback-attack defense applied to CP.

What we deliberately don't design now:

  • The policy language. Risk: sliding into Cedar/Rego-class complexity. Defense: introduce primitives only as B1+ features pull on them. Most policy will derive from YAML operators already write, with a small "freedoms" addition.
  • Intermediate-key delegation, multi-operator quorum, transparency log, HSM integration. All B1+ when concrete needs justify them.
  • Threat modeling, key ceremony, recovery procedures, emergency revocation. Operational/security-design phase, not the architectural-shape phase. Captured here as evolution direction, not commitment.
  • Agent fast-path autonomy primitives (FRR-class). C0/C1 have nothing dynamic to react to.

C0/C1 are the degenerate case of the model: empty CP, empty agent autonomy, full mgmt determines runtime state. The shape we lock in now is what lets B1+ add the lower decision layers additively without re-cutting envelopes or reissuing identities.

Why cert rotation stays manual in C0/C1, and the B1+ shape

Identity issuance in C0/C1 is operator-driven and offline: retectl issue-bundle runs on the operator's machine, mints certs from the offline CA, delivers them out-of-band. Validity is 90 days; renewal = the operator runs the same command before expiry. There's no automation because automation needs an online signer, and the whole zero-trust-cloud posture rests on "no online signing key."

This is fine at pilot scale (handful of principals, weekly cadence) and fits C0/C1's manual-mode framing. It becomes untenable at B1+ scale (hundreds of workloads, ephemeral instances), so a path forward is needed.

The B1+ direction: a delegated online intermediate CA, bounded by signed operator policy. Operator's offline root signs a policy of the form "intermediate I may issue certs for principal set {alice, api, kafka, …}, each cert valid ≤24h, until policy expiry T." The intermediate runs online (managed cloud, or self-hosted), refreshes certs at any cadence, and cannot create new principals — that still requires an operator-signed policy update. If the intermediate is breached, the attacker mints short-lived certs only for already-authorized principals during the policy window; they cannot expand the set, extend validity, or persist beyond the next operator-signed policy refresh. Same delegation principle as bounded-CP, applied to identity lifecycle.

The trade-off is real and deserves to be named: rotation-via-online-intermediate defends against single-device key leaks (rotate often, leak window is short) but creates a hot signing target that, if compromised, exposes the full authorized principal set during the policy window. We don't claim this is strictly better than offline-only — it's a different point on the convenience/blast-radius curve, and the right point depends on customer risk appetite.

This drives two deployment tiers for B1+:

  • Managed issuer — we host the bounded intermediate. Setup-and-go UX. Compromise blast radius = authorized principal set × policy window. Right for SMBs and pilots.
  • BYO on-prem issuer — customer runs the bounded intermediate inside their own infrastructure (DMZ, HSM, hardware-secured server). Same protocol, same operator-signed policy, just a different host. Right for security-conscious enterprises and regulated industries.

Migration between tiers is a config flip, not a re-architecture, because the protocol and the operator-signed policy stay identical across both. Same rete, different host for the issuer.

A useful refinement available in either tier: per-principal-class issuance policy. High-risk principals (operator, infra services, anything reaching production data) keep long-lived offline-issued certs; low-risk workloads rotate frequently through the online intermediate. Same rete, two trust profiles, both expressed in operator-signed policy. This lets the operator pay the rotation/blast-radius cost only where convenience actually matters.

SPIFFE compatibility note. C0/C1 produce SPIFFE-format-conformant X.509-SVIDs but don't implement SPIFFE issuance (the SPIRE Workload API). We're a SPIFFE-compatible trust domain whose certs any SPIFFE consumer (Istio, Linkerd, Vault, Envoy) can verify, but our own workloads don't yet receive SVIDs from a Workload-API endpoint — they hold static cert+key files installed at enrollment. The bounded intermediate adds the Workload API as a natural by-product: agents can dispense SVIDs to local workloads via SPIRE-compatible UDS, with rotation driven by the intermediate. We become SPIFFE-issuance conformant without abandoning the zero-trust posture. Until then, "SPIFFE compat" means format and trust shape, not issuance protocol. Don't oversell.

What we do now in C0/C1: nothing on rotation automation. The operator-signed mgmt artifact already references certs by path, so future rotation can replace cert files in place without recompiling the artifact — the schema doesn't need changing.

Why authoring UI must be local (and what cloud UI can do)

Operator-signed mgmt artifacts mean the signing step itself is part of the trust boundary. A cloud-hosted authoring UI weakens that boundary in a specific, well-known way: a compromised cloud serves JavaScript that displays payload B but constructs payload A for signing. Operator signs A thinking it's B. No amount of HTTPS, auth, or MFA fixes this — the JavaScript itself is the attack surface. The crypto industry learned this and answered with hardware wallets that re-render the to-be-signed transaction independently; the same lesson applies here.

This is What You See Is What You Sign (WYSIWYS): the signature must cover the same artifact that the signer actually viewed. When the viewing surface is code served by an untrusted party, that party can disconnect the two.

So the rule for B1+ authoring UX:

  • Read-only views can live in the cloud. Rete status, who's connected, mgmt-plane history, ctrl-plane state, metrics, audit log of past signing operations. Even if the cloud lies, it can mislead but cannot make the operator sign anything.
  • Authoring + signing stay local. retectl is the primary surface; a desktop GUI (Tauri/Electron, or retectl serve on localhost) can be added later without changing the trust model. This is how 1Password Desktop, HashiCorp Vault Enterprise authoring tools, and serious wallet apps work — local app, cloud-synced data.
  • "Draft in cloud, sign locally" handoff recovers most of the cloud-UI ergonomics without compromising trust: cloud UI proposes a change (renders the canonical to-be-signed payload, generates a signing ticket); operator pulls the ticket into local retectl, which independently re-canonicalizes the payload, displays it, and signs. The signature covers the canonical form, not whatever the cloud displayed. Cosign / Sigstore container-image signing already works this way.

What we should not build: "sign with browser extension," "sign with cloud-stored key," or "delegate signing to a cloud agent." Those failure modes have a long, embarrassing history.

What we do now in C0/C1: nothing UI-related (retectl is the only authoring surface anyway). One choice we lock in that keeps every UI option open later: canonical-form-based signing — the signature covers a deterministic serialization of the envelope, and the envelope carries a key_id so any UI surface can independently re-derive and re-display "what is being signed." That property alone keeps cloud-UI, local-GUI, mobile, hardware-token, and quorum-co-signing flows reachable without re-cutting the artifact format.

Why SPIFFE IDs (not a bespoke flor:// scheme)

Two use cases pull in opposite directions:

  • Tool compatibility: users want to type ssh user@ssh.alpha.rete-lovers.rete; curl, ssh, psql, browsers need hostnames they can resolve. Argues for a DNS-like form.
  • Semantic cleanliness for identities: network identities shouldn't pretend to be DNS when they aren't — SPIFFE uses spiffe:// URIs precisely to avoid conflation. Argues for a URI form.

The resolution: both, interchangeably. Canonical form is the SPIFFE URI spiffe://<rete>/<kind>/<name> — used in certs (SAN) and compiled artifacts. Convenience form is <service>.[<node>.]<rete>.rete — resolved by a local flor resolver.

An earlier draft used a bespoke flor:// scheme. Switching to spiffe:// costs nothing and buys real compatibility: what we produce is a SPIFFE X.509-SVID by definition (trust domain in the authority, workload path in the URI, URI in the SAN). Any SPIFFE consumer — SPIRE agents, Istio, Linkerd, Vault, Envoy, third-party auditing — can ingest our identities without adapters. The brand loss is tiny; the optionality gain is large.

Path-kind namespacing (user/, service/, node/, vertex/<node>/) is layered inside the SPIFFE path. Under vertex/ the structure is node-scoped, mirroring service/<node>/<service> — the second segment is the host node, the third is the vertex name. SPIFFE itself leaves path structure to the implementer; we use these prefixes to prevent cross-kind collisions (alice-the-user vs. alice-the-service). The convenience hostname elides the service/ segment because services are the only kind of entity a workload ever dials.

.rete is unregistered. It's not a good DNS citizen in the strict sense (RFC 6761 reserves specific names, not this one), but Florete's stub resolver intercepts before the OS resolver sees it, so queries never leave the host. Alternatives like .rete.test or URI-only were considered; .rete won on ergonomics. Can be revisited if IETF registers something better.

Why two binaries (flor + retectl)

Two audiences with non-overlapping needs:

  • End users and server admins install Florete to participate in a rete. They need a small, focused command set: enroll, apply, check status, run the daemon. They should never see commands they can't or shouldn't run (CA signing, compile, bundle issuance).
  • Operators additionally author rete state. They need CA operations, validation, compilation, and bundle issuance — on their own workstation, against a checked-out repo, with no running agent involved.

This is the same split as kubectl/kubelet, terraform/agent, salt-master/salt-minion, gh/forge-server. The authoring tool and the participant runtime have different install footprints, different update cadences, and different security boundaries; conflating them into one binary makes the user CLI noisier without a real payoff.

Concrete wins:

  • Smaller node surface. The node binary ships without CA code paths, signing primitives, or rete-repo schema. Fewer attack surfaces on every pilot machine.
  • Clearer docs and UX. flor --help on a user's laptop shows only what that user can do. retectl --help is the operator's reference.
  • Independent release cadence possible later. CA and compiler changes don't force a daemon rollout.

Cost is modest: two main.rs shims sharing a workspace of library crates. Both binaries reuse identical schema, crypto, and artifact-format code.

An earlier draft deferred this split to later milestones ("single binary is faster to ship, split post-C1"). Re-evaluating during CLI design: the split is a day of work, the UX gain is permanent, and putting it off creates avoidable churn for pilot users (commands that exist today would move or disappear later). Doing it up front is cheaper than migrating.

Why C0 before C1

The architecture was designed C1-first — the recursive vertex/principal model, the agent/vertex split, and the unified workloads/io[] schema all only fully earn their keep once there's more than one vertex per node. C0 ships first anyway, because it's a strict reduction: same agent, same single vertex with the same compiled-artifact shape, just no mesh-flor on top, no FlorIO, no recursion, no mesh-vertex layer.

A single-flor-vertex prototype (service-to-service QUIC over UDP, identity-based tunnels) is:

  • Smaller scope → faster to ship.
  • Useful on its own as a "better VPN" product category.
  • Dogfoodable in our own prod/staging immediately.
  • First-pilot surface: we can sell "identity-based VPN replacement" before we sell "service mesh".
  • De-risks C1: the agent + service-endpoint machinery is exercised in production before mesh-vertex lands on top.

C1 reuses the full CA + identity + enrollment + install + bundle machinery from this spec. C1 adds mesh-vertex identities, links.yaml, paths.yaml, a second compiled vertex artifact, and the mesh-flor instance on top — additive, not replacing.

On this page