Management Plane

Context

C0 has only a management plane — operator-authored, signed rete state. There is no control plane yet: nothing makes dynamic decisions about traffic at runtime. State is declared manually in a git repo and distributed to every node through a lightweight operator workflow: declare → validate → compile → commit (audit) → publish → nodes sync. A control plane lands in B1+ as a separate, parallel artifact stream operating within the bounds set by signed mgmt state (see Decision authority layers below). Both C0 and C1 share this manual foundation; C0 is the reference design, and C1 layers the rete-specific additions on top.

The design must balance three forces:

Simple to build — small team, MVP ASAP.
Prod-ready for small pilots — 3-5 server nodes, 10-20 users, a handful of services, running a few weeks.
Manually manageable by a single operator during pilots.

This spec is also the boundary that the rest of C0 (agent internals, identity, forwarding) must meet, and it should evolve naturally into C1 Manual Mesh (mesh-flor layer added on top via FlorIO, same CA and enrollment reused unchanged), then into B1 Cloud Control (coordination server replaces the file-based source of truth), and beyond into a fully distributed control plane.

Design Highlights

Source of truth is a git repo of YAML files, hand-edited. No embedded DB. No CLI mutators that round-trip through YAML. Git diffs are the audit log; YAML comments explain intent.
The CLI does a handful of things: CA operations, identity / bundle issuance, validation, per-node compilation, safe apply. It never mutates source YAML. Smallest surface that still gives us safety.
SPIFFE identities from day one. Rete has a root CA; principals get CSRs signed into X.509 certs whose SAN holds a SPIFFE URI (spiffe://<rete>/<kind>/<name>). Peers verify by CA signature, not pinned cert. Only ca.crt lives in the repo — individual certs are delivered to their holders and stored locally. Real SPIFFE X.509-SVIDs, no extra cost.
Principals are uniform. Users, services, and nodes are principals with rete-scoped identities. Roles attach to any principal. Users and services own workload identities (end-to-end mTLS); each node has a rete-member identity used by its flor-agent workload to reach the rete's own infrastructure services (config-server, metrics). Each principal's cert+key are held by a flor vertex acting on the principal's behalf — the agent included. (C1 adds mesh-vertices as a further, distinct principal kind for hop-by-hop mesh transit.)
Access is fully derived in C0. With no multi-hop routing, who-reaches-what is determined by users.yaml + services.yaml + roles.yaml + groups.yaml + service location. No paths.yaml, no label allocation — the compiler walks the ACL matrix directly.
Restart-with-rollback, not SIGHUP (for C0). flor agent sync fetches the pre-compiled artifacts from the config-server and restarts the supervised vertex; --commit-timeout auto-rolls-back if the operator doesn't confirm within the window. Full hot-reload is hard; but the agent should be built with swappable ACL tables from day one so that a near-term post-C0 ACL-only hot-reload (see Hot reload) can swap permission tables in place — that covers ~80% of day-to-day changes (add/remove user, change role membership) without dropping connections. Structural changes (new service, port change, identity rotation) still need a restart.
RBAC, not per-principal ACLs. Permissions attach to roles; principals get roles. Keeps the access control matrix smaller.
Atomic full-artifact state delivery, with a monotonic version stamp. Deltas are a later optimization; the version number is a zero-cost forward-compat anchor for them.
Mgmt artifact is signed by the operator; agents verify locally. The signing key is a SPIFFE principal in its own right — management-plane/<name> (convention: primary), a sign-only identity issued explicitly via retectl ca sign --kind management-plane and unrelated to any users.yaml entry (see ADR-0005). The operator holds three keys total: CA root, their own user/<op> TLS keypair (issued like any other user), and the rete's management-plane/<name> envelope-signing keypair. None of them leave operator hardware — the rete's coordination server (today the config-server, B1+ a managed cloud service) only stores and serves; it cannot mint or modify state. C0 is the degenerate case: every artifact is signed mgmt and there are no control-plane decisions yet. The envelope carries a plane: "mgmt" discriminator and a signature field from day one so a parallel plane: "ctrl" stream (signed by a control-plane/<name> principal) can land in C1 without schema churn. See Decision authority layers below for the evolution target.
Operator-issued bundles, not user-initiated PRs. Users never need git write access or forge accounts. The operator is the only writer of rete state.
Stable boundary = the compiled per-node artifacts. C0 produces an agent.json plus one vertices/flor.json per node from YAML; C1 adds a second vertex artifact for the mesh-flor layer with the same envelope; B1 fetches them from a coordination server; distributed control plane later emits them by consensus. The agent and vertex consume the same shapes throughout.

Details

Specific design topics are placed in separate pages:

Multi-rete on a single node

A computing host may join more than one Florete rete — for example, a developer's laptop enrolled in both a personal rete and a work rete, or a server node managed by two independent operators. This is a first-class use case and the design accommodates it without any special mode.

Rete scope

Each rete a node joins gets its own rete scope: a per-rete namespace for runtime state and installed material on that host. The scope is identified locally by the rete name (from rete.yaml). The cryptographic anchor is the bundle's CA cert and operator pubkey — two retes with the same name can never be confused because their trust roots differ. The local name can be overridden at install time (flor enroll <bundle> --as <name>) if two independent retes happen to share one.

Per rete scope on the node:

Install root: ~/.flor/retes/<scope>/ — CA cert, all principal certs and keys for this rete, compiled artifacts (agent.json, vertices/*.json), agent control socket (agent.sock).
Runtime root: /run/flor/<scope>/ — FlorIO sockets and any other runtime files (C1+).
Process tree: one flor agent process supervising its per-mesh vertex graph.

All cert and socket paths in compiled artifacts are scope-relative (e.g. ca.crt, not ~/.flor/ca.crt). The agent resolves them against its install root at startup, so artifacts stay portable across nodes regardless of the local scope name.

Node identity in multiple retes

A computing host that joins N retes has N separate node/<name> principals — one in each rete's trust domain. The names may differ across retes (each rete's operator chooses them independently). Each principal is acted on by a dedicated flor-agent process and held by that rete's flor vertex. There is no shared node identity across retes; the trust domains are fully independent.

Local listen addresses

Two categories of listen addresses arise in a multi-rete setup, and they have different audiences and different answers.

External UDP listeners appear only on nodes that accept inbound connections — nodes with a fixed public or private IP:port declared in nodes.yaml. User devices (laptops, phones) sit behind NAT and use ephemeral source ports; they have no fixed UDP listen address. Therefore:

External UDP conflicts can only arise on server nodes, which are by definition admin-managed (datacenter, office LAN, home server). Two rete operators wanting the same external UDP port on the same physical server must coordinate with the IT admin responsible for that host. At C0/C1 scale this involves a small number of ports and technically capable people.
B1+ direction: node-side capability advertisement — the node declares which external addresses and ports it has available. No implementation in C0/C1.

Local SOCKS5 and per-service inbound ports exist on every node, including user laptops. Users cannot be asked to "adjust a port-override config." The agent must handle this on their behalf.

Model:

Compiled artifacts declare local listens as preferences, not requirements.
The agent at startup binds the preferred port if free; otherwise picks any available port and records the binding in agent state.
Apps that need the actual port query the agent control socket — already the authoritative source for per-rete runtime state.
An ergonomic default: each rete scope is assigned a local-port offset at install time, deterministic from the rete CA fingerprint (or operator-chosen at flor enroll time), so rete A's SOCKS5 lands at :1080, rete B's at :1180, and so on. Single-rete users get their expected default; multi-rete users get predictable separation without any manual configuration.
FlorIO sockets (C1+) are filesystem paths under /run/flor/<scope>/vertices/<vertex-name>.sock — no port space, no collision possible.

The user is responsible for nothing. They install the bundle; the agent resolves any local-resource conflicts; apps discover actual port values via the control socket if they care.

Evolution

The compiled per-node artifact is the stable public contract. What changes across milestones is who produces it and how it's distributed — never the shape the agent consumes.

Aspect	C0 Tended Tunnels	C1 Manual Mesh	B1 Cloud Control	Distributed CP (later)
Scope	Service-to-service direct	Rete mesh, manual paths	Same rete, derived paths	Same rete
Protocol entities	Users, services, nodes	+ Rete vertices	Same as C1	Same as C1
Binaries	`flor agent` + 1 `flor vertex`	`flor agent` + 2 `flor vertex`	Same as C1	Same as C1
Source of truth	YAML in git repo	YAML in git repo	Server (ctrl); operator git+compile (mgmt)	Consensus (ctrl); operator git+compile (mgmt)
Compile	Local, service-level only	Local, both layers	mgmt local+signed; ctrl server-side	mgmt local+signed; ctrl consensus-derived
Distribution	Bundle + config-server fetch	Bundle + config-server fetch	Server push (WebSocket/gRPC)	Gossip + pull
Reload	Restart w/ commit timeout	Restart w/ commit timeout	Hot reload	Hot reload
Consensus	Operator (1 person)	Operator (1 person)	Single-writer server	Raft / CRDT
Paths	N/A (direct forwards)	Manual `paths.yaml`	Derived from topology + access	Derived, per-agent resolution
State updates	Atomic (poll config-server)	Atomic (poll config-server)	Atomic push	Atomic; deltas in B2+
Decision authority	Operator only (signed mgmt)	Operator only (signed mgmt)	+ bounded CP (unsigned, within signed policy)	+ agent fast-path autonomy (FRR-class) within CP/policy bounds
Artifact streams	`plane: "mgmt"` (signed)	`plane: "mgmt"` (signed)	+ `plane: "ctrl"` (unsigned)	Same

From B1 on, the Source of truth and Compile rows split by plane: mgmt stays authored, compiled, and signed on operator hardware — the coordination server relays it but cannot author it (see Coordinator) — while only ctrl moves to the server or to consensus. This bounds a compromised cloud to disruption, not destruction.

Preserving manual mode as a power mode: manual configuration (YAML in git + operator-run config-server) is a permanent capability tier, not a stepping stone. Hackers, personal setups, airgapped environments, and disaster-recovery fallback all depend on it. Every higher tier is additive — no C0/C1 capability is removed in B1 or beyond.

Decision authority layers

A second evolution axis, orthogonal to producer/distribution, is who is allowed to decide what. The model is one delegation rule applied at three points: at every layer, decisions operate within bounds the layer above signed. C0/C1 use only the top layer; B1+ activates the lower two.

Layer	Speed	Authority	Scope of decision	How it's verified
Operator (mgmt plane)	minutes–days	signed	grammar of permitted states (identities, reachability, topology, path constraints)	flors verify operator signature on the mgmt artifact
Control plane (B1+)	seconds–minutes	unsigned, bounded	pick a state within grammar (path failover, link selection, NAT signaling, later: placement)	flors check each ctrl decision against the signed mgmt policy locally
Agent fast-path (B1+)	µs–ms	unsigned, local	sub-millisecond reactions inside what CP/operator allowed (e.g. MPLS-style FRR over reserved paths, BFD-class liveness, queue mgmt)	self-bounded by the configured policy/CP state already in the agent

This is policy-bounded autonomy (capability attenuation): the operator signs bounds, lower layers act freely within them, and verification stays local — no online operator key, no hot signing key in the cloud. Precedent: Macaroons (caveats narrow capabilities held by untrusted parties), RPKI/ROAs (signed origin authorizations bound dynamic BGP), BGP/FIB + ECMP/FRR (control plane sets paths; data plane reroutes locally on failure), OPA-style admission policies.

The CP can always narrow what policy allows (drain a node, rate-limit, circuit-break) but never broaden it. Identity issuance stays operator-only forever; CP and agents never mint identities.

C0/C1 are the degenerate case: empty CP, empty agent autonomy, full mgmt determines runtime state. The shape we lock in now — signed mgmt envelope, plane discriminator, monotonic versioning — is what lets B1+ add the lower layers additively. The policy language itself is not designed now; it grows feature by feature as B1+ capabilities pull on it.

Scope Checklist

Open Follow-ups

Not blockers for C0 release:

ACL-only hot reload — near-term post-C0; design in Hot reload. Requires the agent to hold ACL tables behind an Arc/similar indirection from day one (cheap).
L7 awareness as a general capability (C1+, design needed). Several planned features need the HTTP layer to know which Florete principal is calling: per-node isolation on config-server, HTTP-level authorization policies, per-principal metrics tagging, request-level access logs, per-principal rate limiting. All of them want something like an X-Florete-Peer-SpiffeID header (or equivalent out-of-band signal) derived from the mTLS peer identity. flor itself is deliberately L4/L5 (QUIC/mTLS + TCP bytes) and should stay that way at the C0 layer — promoting it to an L7 proxy would balloon scope, entangle buffering/framing concerns with identity concerns, and make the data-plane harder to reason about. Options to explore in C1+:
- A dedicated L7 sidecar process between flor and upstream HTTP services (flor keeps forwarding TCP bytes; the sidecar handles HTTP + identity augmentation).
- A side-channel lookup: flor exposes a Unix socket where upstream services can ask "which SPIFFE ID is on local socket X?" Upstream owns its own HTTP plumbing.
- A Florete-specific header-injection shim that's opt-in per service in YAML, so most services stay pure-L4 and only those that want L7 identity metadata get the extra path.
No one of these is obviously right; the tradeoffs (process count, resource cost, API stability, who-owns-which-failure-mode) need honest exploration. Picking one now would lock us in prematurely.
Per-node isolation of config-server reads is the most immediate use-case driving the L7 question: replacing the two-service split with one service + L7 identity-aware authZ so alpha can only fetch alpha.json. Gated on the L7 design above. Pilot-scale metadata disclosure is tolerable until then.
Cert fingerprint allowlist — include the cert fingerprint alongside the SPIFFE ID in each allow entry so a rogue CA-signed cert for the same SPIFFE ID is rejected even if name-removal hasn't propagated yet. Adds a field to the compiled artifact schema (semver bump); deferred because name-removal revocation is already effective at pilot scale.
Full structural hot-reload (new services, cert rotation, port changes) with connection-preserving restart semantics.
CRL / OCSP for immediate mid-cert-lifetime revocation without a new compile cycle.
Automated cert rotation via a bounded online intermediate CA (B1+). C0/C1 have only manual retectl issue-bundle rotation — workable at pilot scale, untenable at scale. The B1+ direction is a delegated intermediate CA whose authority (principal set, validity window) is operator-signed and narrow; can refresh certs freely within bounds, cannot create new principals. Two deployment tiers expected: managed (we host) and BYO on-prem (security-conscious customers run their own). Per-principal-class policy lets high-risk principals stay on long-lived offline-issued certs while low-risk workloads rotate frequently. Adds SPIFFE Workload-API conformance as a natural by-product. See reasoning.
Authoring UI surface (B1+). Cloud-hosted UI is fine for read-only views (status, audit, metrics), but signing operations must stay local on operator hardware to avoid UI-substitution attacks. Authoring goes in retectl and (later) a local desktop GUI; cloud dashboards are read-only or use a "draft in cloud, sign locally" handoff. See reasoning.
Per-rete local-port offset scheme — deterministic port offset derived from rete CA fingerprint (or operator-chosen at flor enroll time) so multi-rete installs get predictable port separation without manual config (e.g. rete A at :1080, rete B at :1180). Ergonomic nicety; single-rete users always see the default.
Port-query on agent control socket — extend flor agent status (or a dedicated subcommand) so apps can ask "what's the current SOCKS5 port for principal X?" instead of hard-coding the preferred port. Required for apps that must work correctly on multi-rete hosts; optional for single-rete users where the preferred port always wins.
B1+ node-side capability advertisement — node declares which external IP:port pairs it has available; rete mgmt-plane consults that at compile time rather than having the rete operator guess external UDP ports. Eliminates the remaining admin-coordination requirement for server nodes in multi-rete setups.
Transparent outbound for non-SOCKS5 apps (iptables / SO_PEERCRED / libc shim).
flor doctor diagnostics (ping peers, check access matrix, confirm config-server reachability).
Backend-hosted enrollment forms, Slack/Discord approval bots.
Config-server HA (two management nodes, replicated published state).

Management Plane

Source Layout

Identity & Naming

Validate & Compile

CLI Surface

Config-server

Enrollment

Distribution & Reload

Reasoning

On this page