Reasoning
Design rationale for the C0 control and management plane decisions
Captured justifications behind the decisions in the C0 control and management planes, organised by topic.
Why git for source-of-truth (but not for distribution)
For the operator's authoring workspace, git is the de facto standard (GitOps). Argo CD, Flux, Nix/NixOS, Tailscale ACLs, and WireGuard manual deployments all store declared state in git. It gives us history, blame, diff, PRs, signed commits, offline operation for free.
An earlier draft also had nodes pulling directly from git (via an embedded gitoxide library with a read-only deploy key). Pressure-testing surfaced several problems with that approach:
- Two access-control systems. Florete's own cluster CA already authenticates every participant; nodes pulling from GitHub/Gitea/etc. introduces a second auth plane (deploy keys) with its own rotation story. That's one system too many.
- Deploy-key revocation is awkward. A leaked bundle contains the deploy key; revoking it means rotating on the forge and re-issuing bundles to every legitimate node. Doing this at pilot pace is painful.
- Binds us to a forge. Users with no GitHub account, self-hosted-only shops, and airgapped pilots all suffer.
Replacement: nodes fetch compiled artifacts from a Florete-protected config-server run on a management node — same mTLS auth path as every other service, same SPIFFE identity model, no new credential kind. See Why a config-server, not direct git pull by nodes below for the detail.
The git repo stays — as the operator's authoring workspace and shipment audit log. It holds YAML, ca.crt, enrollment.log, and the committed compiled tree. Nodes never see it.
Alternatives considered and rejected for the authoring workspace:
- Shared filesystem (NFS, Syncthing, Dropbox) — no history, messy conflicts, no atomicity.
- Object store (S3 + versioning) — loses diff/PR workflow.
- Distributed KV (etcd, Consul) — that is a control plane; violates the premise of C0/C1's manual mode.
- scp/rsync from operator — works, but zero audit trail.
Future upgrade worth considering early: require signed commits (or signed tags) for florctl publish to accept them. Trivial to add; big supply-chain-security win aligned with the Zero Trust posture.
Why YAML (not JSON5, TOML, or plain JSON)
- Plain JSON — no comments. Killer; out.
- TOML — clean for flat config, ugly for nested structures like
services. - JSON5 — genuinely good (comments, trailing commas, no whitespace traps), but much smaller ecosystem: fewer schema validators, IDE plugins, Rust crates. Devops audience doesn't reach for it.
- YAML — devops lingua franca. Anchors/aliases let us DRY repeated crypto params. Rich schema tooling (JSON Schema over YAML). Well-known footguns (Norway bug, sexagesimal,
no= false) are manageable with strict schema validation, always-quoting strings, and YAML 1.2 parsers.
For machine-to-machine compiled artifacts, plain JSON is the right choice — comments aren't needed, parser simplicity matters for the agent.
Why a two-step pipeline (YAML → compiled JSON)
The full list of motivators:
- Per-node filtering —
alphashould not seebeta's ACL rows oruser3's identity info. The compiler extracts only what each node needs. - Agent simplicity — pre-resolved data. No schema validation at runtime. Smaller attack surface.
- Denormalization for performance — access matrix is pre-computed; the hot path does table lookups, not graph walks.
- Stable boundary — we can restructure YAML sources freely as long as the compiled schema is stable. Agent code decouples from operator ergonomics.
- Validation before distribution — a bad edit fails
florctl validatelocally; it never reaches nodes. - Determinism — same-input same-output across machines; two operators independently compiling produce identical artifacts, detectable by diff.
- Committed artifacts = inspectable, auditable shipments. "What config did alpha actually get?" =
git log .flor/compiled/alpha.json. No need to reconstruct from YAML. Artifacts contain only path references (no secrets), so committing them is safe. The config-server is the live wire path; git is the audit trail.
This is the same pattern as nginx.conf → loaded state, Terraform HCL → plan.json, or kubectl apply → etcd. In C1 the pattern carries forward with two compiled artifacts per node (one per flor instance in the recursive stack) instead of one.
Why CA-based identity (not pinned certs)
Two options were considered:
- Pinned certs per peer — every node must hold every peer's cert. Adding one peer = updating every node's trust store. Tangles trust distribution with ACL distribution — two concerns, one file.
- CA-signed certs (SPIFFE X.509-SVIDs) — nodes verify signatures, not exact certs. ACLs authorize by SPIFFE ID. Adding a peer = sign their cert, commit their name; existing nodes need no trust-store update.
Even at C0 scale, CA-based scales better — and by putting a SPIFFE URI in the SAN we produce real SPIFFE X.509-SVIDs, interoperable with SPIRE, Istio, Linkerd, Vault, and anything else in the SPIFFE ecosystem. Same amount of code as pinned-cert mode, forward-compatible with standards. Clean separation: authentication = CA signature at mTLS handshake; authorization = ACL check on SPIFFE ID.
CA private key stays on operator hardware (password-protected file or YubiKey PIV slot). Never in the repo. Revocation via ACL removal is sufficient at C0 scale; CRL/OCSP is a B1+ concern.
Why only the CA cert lives in the repo (not individual signed certs)
Signed identity certs are public material (public key + name + CA signature; nothing secret) — but they aren't semantically required in the cluster repo. Each principal holds its own cert locally and presents it at mTLS handshake; verifiers only need ca.crt and the expected identity name.
Options considered:
- Commit all certs to
identities/*.crt— operational convenience (one delivery channel), but redundant with the holder's local copy, clutters diffs, and creates confusion when certs expire. - CA cert in repo + certs delivered via enrollment bundle — clean separation. Repo holds only what's needed for compile and audit.
Enrollment-event audit is preserved via enrollment.log — an append-only record of operator sign-actions. This is a strictly better audit than cert-in-repo, because it records the decision (operator, principal, role, timestamp, fingerprint) rather than just the resulting artifact.
Why principals are uniform (users and services)
Istio (AuthorizationPolicy + SPIFFE IDs), Linkerd (MeshTLSAuthentication), and Consul (intentions) all treat services as principals of the same kind as human workloads. A service calling another service is mTLS between two identities, not a special case.
For Florete, modelling both user and service as principals means a single role mechanism covers both "which users can reach which services" (ingress ACL) and "which services can call which other services" (egress from service perspective). The compiler evaluates the same matrix either way.
This also makes p2p service pairs trivial: two services that call each other just hold two overlapping role grants. No special "p2p" primitive needed. C1 extends the same principle to cluster vertices as an additional principal kind for mesh-layer mTLS; nodes are already principals in C0 (for control-plane traffic), so the uniformity holds across all four kinds.
Why per-service SOCKS5 outbound (no container namespaces)
Big service meshes rely on sidecars per pod + network namespaces to separate services on the same host. Florete at C0 doesn't have an execution environment and can't assume containers.
Options for distinguishing per-service egress on a shared host:
| Mechanism | Verdict |
|---|---|
| Source-port heuristic | Unreliable |
SO_PEERCRED + OS user per service | Works, but forces OS-level per-service users |
| Shared secret / token per service | Bootstrap problem |
| Unix socket per service (app opens it) | Requires flor-aware apps |
| Per-service SOCKS5 proxy, port→identity | Works with any SOCKS5-aware client; no OS coupling |
SOCKS5 is supported by virtually all HTTP clients, gRPC clients, and database drivers via ALL_PROXY env var. Each service gets a dedicated localhost SOCKS5 port that flor binds to exactly one identity. Simple, portable, works today.
Transparent outbound for non-SOCKS5 apps (iptables redirect, libc shim) is a later optimization, not a C0 blocker.
Why C0 has no inter-node links (and no cluster-vertex identity)
In C0, QUIC connections are established directly between service/user endpoints over UDP — there's no mesh-of-vertices in between. The single flor at each node hosts QUIC endpoints for its local principals and services; there are no node-to-node mTLS links.
This means C0 has no cluster-vertex identity layer, no links.yaml, no paths.yaml — the hop-by-hop mesh crypto layer simply doesn't exist. Workload access is fully derivable from users + services + roles + groups without any routing configuration. Nodes do carry their own principal identity (for control-plane communication with the config-server and metrics services), but that's orthogonal to the workload data plane and to mesh transit.
Inter-node links appear only in C1, when cluster-flor is introduced on top and pairs of cluster vertices need authenticated channels over which multiple service-to-service connections can multiplex. That's when cluster vertex identity, links.yaml, and paths.yaml become necessary.
Why RBAC (not per-principal ACLs)
N principals × M roles is tractable; N principals × K services blows up fast. RBAC matches how teams already think about access. If someone needs an exception, create a new role (developer-kafka-only) rather than a principal-specific permission.
Applies uniformly whether the principal is a user or a service, which is what makes the "principals are uniform" principle work without extra mechanism.
Why operator-issued bundles (not user-initiated PRs)
An earlier design had users opening PRs with their CSRs. That failed real-pilot pressure-testing:
- Private repo → users need forge accounts + collaborator access, which most pilots can't provide.
- Non-technical users (sales, ops) shouldn't edit git.
- Giving users write access (even scoped) to the cluster repo inverts the trust model — they could tamper with state beyond their own enrollment.
- Per-user install tokens with write access effectively become signed URLs, so the git-ness adds no value over plain HTTPS delivery.
Replacement: operator issues a bundle per node (covering the node's identity and every workload it hosts); delivery is out-of-band (Telegram, email, any channel); user runs flor enroll. Bundle contains the signed certs, CA cert, config-server URL, and expected config-server SPIFFE ID — no git credential, no deploy key.
This is how WireGuard / OpenVPN enterprise deployments work. Users never see git; operator is the only writer. For security-conscious users, Flow B (principal generates keypairs, sends CSRs out-of-band, operator returns signed certs) preserves key-hygiene.
Why a config-server, not direct git pull by nodes
Every node needs the operator's latest compiled artifact for its own machine. Options considered:
- Nodes git-pull from the cluster repo. Requires embedding a git client in
florplus a deploy-key credential in every enrollment bundle. Adds a second auth plane on top of Florete mTLS, binds us to a git forge, and makes bundle revocation painful (rotate deploy key + re-issue every legitimate bundle). - Operator runs
scp <node>.jsonto each node. No audit trail; operator can't do this from a phone; doesn't survive the management use-case of "user wants to sync on demand". - Nodes fetch from a Florete-protected HTTPS service run by the cluster itself. Reuses the cluster's existing mTLS + SPIFFE identity model. No second auth plane. Bundle revocation = remove principal from YAML and publish. Runs on any provider (or none — self-hosted). Also gives us a natural home for the metrics sink (
POST /metrics/...) and future features (event streams, CRL distribution).
The third option — the config-server — wins on every axis that matters at our scale: fewer moving parts, one access-control system, no forge lock-in, no extra credential to rotate. The cost is that we now have one always-on server per cluster (the management node), but pilots already need that machine for metrics anyway, and the config-server is just another Florete-published service living next to it.
Why HTTP inside Florete instead of a bespoke RPC. HTTPS is already ubiquitous, trivially testable with curl, and has every client library any operator tool will want. Our mTLS happens at the Florete layer, not the HTTP layer — the HTTP endpoint binds 127.0.0.1 and is only reachable through the Florete tunnel, so HTTP is just the framing.
Why polling, not push. Pilot scale doesn't justify the complexity of server-push (long-polling, WebSocket, reconnect handling). Nodes poll on a timer (default: every 10 minutes, configurable) and operators can always trigger an immediate flor sync for urgent rollouts. Push lands in B1 when the coordination server becomes first-class anyway.
Management-node bootstrap is the only inherent drawback. The config-server can't serve its own artifact before it's running. The one-shot manual bootstrap (scp the bundle + mgmt01.json, start the agent) is documented as a playbook step and handles this cleanly — once mgmt01 is up, everything else flows through it.
Why --commit-timeout by default (safety net)
Publishing SSH over Florete means a bad flor sync can remotely brick a server. This is the classic network-device failure mode, and the industry-standard mitigation is commit confirmed: activate the change, require explicit confirmation within a timeout, else auto-rollback.
Cisco IOS, Juniper JunOS, and most serious network-equipment CLIs default to this pattern. For Florete's pilot use case, the same pattern:
- Costs ~50 lines of supervisor logic.
- Eliminates the highest-severity pilot failure mode.
- Trains operators into a safe habit from day one.
Combined with the "keep a non-Florete access path during pilots" recommendation, it bounds the blast radius of bad config.
Why atomic updates with a version stamp (not deltas yet)
Deltas add significant complexity: sequence numbers, resync-on-miss, per-mutation delta encoding, exploded test matrix. At pilot scale (≤5 nodes, handful of services), the full artifact is small and re-sending it is fine.
Forward-compat measure taken now, cheaply: a monotonic version number in the compiled artifact envelope. Costs one field; unlocks two things later:
- Delta mode: server asks "what version?" and sends deltas from there (or full re-push if too far behind).
- Idempotency / rollback: operator can tell which version each node is running.
Why the compiled artifact is a public, versioned contract
The agent consumes a compiled artifact regardless of who produced it. This stays invariant across all evolution stages:
| Stage | Producer | Distribution |
|---|---|---|
| C0 Tended Tunnels | florctl compile (local) | bundle + config-server fetch |
| C1 Manual Mesh | florctl compile (local) | bundle + config-server fetch |
| B1 Cloud Control | Central server, push | WebSocket / gRPC stream |
| Distributed control | Any agent, consensus-derived | Gossip + pull |
Because the agent-facing contract stays constant, every stage is an additive capability tier. Manual mode is not a stepping stone we retire — it's a permanent power mode useful for hackers, airgapped environments, personal setups, and disaster-recovery fallback.
The practical consequence: treat the compiled-artifact JSON Schema as public API from C0 onward. Semver it. Document it. Don't let implementation details leak into it.
Why SPIFFE IDs (not a bespoke flor:// scheme)
Two use cases pull in opposite directions:
- Tool compatibility: users want to type
ssh user@ssh.alpha.rete-lovers.rete; curl, ssh, psql, browsers need hostnames they can resolve. Argues for a DNS-like form. - Semantic cleanliness for identities: network identities shouldn't pretend to be DNS when they aren't — SPIFFE uses
spiffe://URIs precisely to avoid conflation. Argues for a URI form.
The resolution: both, interchangeably. Canonical form is the SPIFFE URI spiffe://<cluster>/<kind>/<name> — used in certs (SAN) and compiled artifacts. Convenience form is <service>.[<node>.]<cluster>.rete — resolved by a local flor stub resolver that doesn't leak queries to public DNS.
An earlier draft used a bespoke flor:// scheme. Switching to spiffe:// costs nothing and buys real compatibility: what we produce is a SPIFFE X.509-SVID by definition (trust domain in the authority, workload path in the URI, URI in the SAN). Any SPIFFE consumer — SPIRE agents, Istio, Linkerd, Vault, Envoy, third-party auditing — can ingest our identities without adapters. The brand loss is tiny; the optionality gain is large.
Path-kind namespacing (users/, services/, cluster-vertices/) is layered inside the SPIFFE path. SPIFFE itself leaves path structure to the implementer; we use these prefixes to prevent cross-kind collisions (alice-the-user vs. alice-the-service). The convenience hostname elides the services/ segment because services are the only kind of entity a workload ever dials.
.rete is unregistered. It's not a good DNS citizen in the strict sense (RFC 6761 reserves specific names, not this one), but Florete's stub resolver intercepts before the OS resolver sees it, so queries never leave the host. Alternatives like .rete.test or URI-only were considered; .rete won on ergonomics. Can be revisited if IETF registers something better.
Why two binaries (flor + florctl)
Two audiences with non-overlapping needs:
- End users and server admins install Florete to participate in a cluster. They need a small, focused command set: enroll, apply, check status, run the daemon. They should never see commands they can't or shouldn't run (CA signing, compile, bundle issuance).
- Operators additionally author cluster state. They need CA operations, validation, compilation, and bundle issuance — on their own workstation, against a checked-out repo, with no running agent involved.
This is the same split as kubectl/kubelet, terraform/agent, salt-master/salt-minion, gh/forge-server. The authoring tool and the participant runtime have different install footprints, different update cadences, and different security boundaries; conflating them into one binary makes the user CLI noisier without a real payoff.
Concrete wins:
- Smaller node surface. The node binary ships without CA code paths, signing primitives, or cluster-repo schema. Fewer attack surfaces on every pilot machine.
- Clearer docs and UX.
flor --helpon a user's laptop shows only what that user can do.florctl --helpis the operator's reference. - Independent release cadence possible later. CA and compiler changes don't force a daemon rollout.
Cost is modest: two main.rs shims sharing a workspace of library crates. Both binaries reuse identical schema, crypto, and artifact-format code.
An earlier draft deferred this split to later milestones ("single binary is faster to ship, split post-C1"). Re-evaluating during CLI design: the split is a day of work, the UX gain is permanent, and putting it off creates avoidable churn for pilot users (commands that exist today would move or disappear later). Doing it up front is cheaper than migrating.
Why C0 before C1
A single-flor prototype (service-to-service QUIC over UDP, identity-based tunnels, no FlorIO, no recursion, no cluster vertex layer) is:
- Smaller scope → faster to ship.
- Useful on its own as a "better VPN" product category.
- Dogfoodable in our own prod/staging immediately.
- First-pilot surface: we can sell "identity-based VPN replacement" before we sell "service mesh".
- De-risks C1: the service-endpoint machinery is exercised in production before cluster-vertex meshing lands on top.
C1 reuses the full CA + identity + enrollment + install + bundle machinery from this spec. C1 adds cluster vertex identities, links.yaml, paths.yaml, a second compiled artifact, and the cluster-flor instance on top — additive, not replacing.