Implemented surface
This page is the reverse roadmap: a detailed inventory of behavior that is currently wired, where it is exposed, and the conditions that decide whether an idea is actually done.
Use it when a feature sounds implemented but you need to answer a sharper question: which path is wired, which clients expose it, what blocks it, and what is still missing.
Reading This Page
Section titled “Reading This Page”Status meanings:
| Status | Meaning |
|---|---|
| Implemented | The main path is wired and has tests or direct operational surface. |
| Partial | A useful path exists, but important surfaces, limits, or polish remain. |
| Planned | The docs or design mention it, but users should not depend on it yet. |
| Out of scope | The behavior is intentionally not planned. |
Queue Identity
Section titled “Queue Identity”See also: core model and partition routing.
| Item | Status | Implemented surface |
|---|---|---|
| Topic plus optional group | Implemented | Broker, protocol, Rust client, TypeScript client, Python client, admin API, CLI |
| Default group normalization | Implemented | Empty group and default normalize to ungrouped on admin and CLI paths, and in Rust, TypeScript, and Python clients |
| Partitioned queue declaration | Implemented | DeclareQueue.partition_count, Rust client declare builder, config default, coordination catalogue |
| Producer partition routing | Implemented | Rust client topology cache, round-robin keyless routing, partition_key stable routing, version-fenced publish frames |
| Subscription fan-in | Implemented | Rust client opens per-partition subscriptions from topology and merges deliveries into one logical stream |
| Cluster catalogue (client) | Implemented | Rust, TypeScript, and Python clients expose the live set of declared queues and streams (with partition counts) via a snapshot accessor and a change-subscription, derived from topology and kept live by topology pushes (no extra round-trips) |
| Pattern subscribe / discovery routing (client) | Implemented | Opt-in client.routing() returns a routing view. subscribe_pattern (queues) and subscribe_stream_pattern (streams) fan in across every channel whose topic matches a *-glob and auto-attach channels that start matching later (driven by the catalogue feed). Manual and auto-ack variants. Rust, TypeScript, and Python. Client-side only, no broker changes |
| Operator-chosen partition id | Out of scope | Normal user-facing paths should choose queue and optional key, not a partition number |
Conditions and limits:
- A queue is addressed by topic plus optional group.
- Groups are namespaces, not consumer groups. They do not coordinate competing consumer membership by themselves.
- Declared queues can have more than one partition.
- Partition selection stays inside Fibril. Producers may provide a partition key for stable routing, or omit it for round-robin spread.
- If topology is unknown or standalone, clients conservatively use partition
0. - Partition count is fixed at creation in standalone mode. In Ganglion mode an experimental live-repartition path can grow or shrink a queue’s partition count (see the experimental cluster surface below).
Durable Queue State
Section titled “Durable Queue State”See also: reliability semantics and many idle queues.
| Item | Status | Implemented surface |
|---|---|---|
| Append-only message log | Implemented | Storage layer and broker publish path |
| Append-only event log | Implemented | Storage state changes and recovery |
| Snapshots | Implemented | Storage recovery and periodic snapshot path |
| Lazy startup indexing | Implemented | Existing queues can be indexed without materializing all queue state at startup |
| Queue recovery on first use | Implemented | First active operation can materialize and recover an indexed queue |
Conditions and limits:
- Durable messages and queue state live on disk.
- A queue can exist on disk without being loaded into memory.
- Loading a cold queue has a first-use cost.
- Single-node queue deletion is exposed through the admin API/dashboard. A coordinated multi-node delete is still pending.
- Message TTL (dropping individual messages by age) is implemented. Log retention by age (truncating old durable messages on a schedule) is not yet a user-facing feature.
Recovery Quarantine
Section titled “Recovery Quarantine”See also: recovery quarantine.
| Item | Status | Implemented surface |
|---|---|---|
| Recovery reference verification | Implemented | Recovery checks each replayed event’s referenced message offset against the message log’s durable tail, and decodes every event record |
recovery.on_mismatch policy | Implemented | Startup config: quarantine (default), refuse, or ignore |
| Per-partition quarantine | Implemented | A bad partition is parked (its ops error) while the rest of the broker stays up |
| Operator repair | Implemented | Admin quarantine banner + /admin/api/quarantine/repair truncate-to-valid, follower re-fetches the dropped suffix on next catch-up |
| Readiness health | Implemented | /readyz reflects quarantine state and the configured policy |
| Quarantine metric | Implemented | recovery.quarantined gauge and quarantines_total counter in the recovery snapshot |
Conditions and limits:
- A dangling event reference is only possible as a lost-tail suffix (events are written after their messages), so truncate-to-valid is a complete repair.
- A corrupt event record is the genuine mid-log failure and uses the same quarantine and truncate machinery.
refuseis lazy today (a mismatch is caught when the partition is first used); an eager whole-disk variant at boot is a tracked follow-up.
Publish
Section titled “Publish”See also: client usage and retries and delays.
| Item | Status | Implemented surface |
|---|---|---|
| Unconfirmed publish | Implemented | TCP protocol, Rust client, TypeScript client, Python client |
| Confirmed publish | Implemented | TCP protocol, Rust client, TypeScript client, Python client |
| Pipelined confirmation handles | Implemented | Rust publish_with_confirmation, TypeScript publishWithConfirmation, Python publish_with_confirmation |
| Delayed publish | Implemented | TCP protocol, broker, Rust client, TypeScript client, Python client |
| Content type metadata | Implemented | Protocol metadata, Rust client, TypeScript client, Python client, delivery path |
| Reserved metadata headers | Implemented | Broker protocol handler rejects fibril.* and stroma.* user headers |
| Partition key routing | Implemented | Rust client NewMessage::partition_key, protocol publish metadata, server per-partition publish routing |
| Partitioning-version fence | Implemented | Client stamps routed version, and server redirects stale publishes before appending |
| Message TTL (drop by age) | Implemented | Publish.ttl_ms + per-queue default_message_ttl_ms on declare, owner resolves an absolute deadline, expiry worker drops via the DLQ/discard pipeline. Rust Publisher::expiring + QueueConfig::default_message_ttl, TypeScript Publisher.expiring + QueueConfig.defaultMessageTtl, Python Publisher.expiring + QueueConfig.default_message_ttl (seconds-native) |
Conditions and limits:
- Confirmed publish returns the broker-assigned offset.
- Unconfirmed client calls only wait for the local client engine or command path, not for a broker-assigned offset.
- Delayed publish uses a distinct delayed-publish frame and a
not_beforedeadline. - Content type is stored outside the user header map for common cases.
- Manual
content-typeheaders are interpreted as content type metadata by clients. - Broker-side validation rejects reserved system header prefixes on normal and delayed publish.
- A partition key affects only partition selection. It is not a RabbitMQ-style routing key and is not part of the durable payload.
- Stale partitioning topology is handled by redirecting the client to refresh and retry.
- Message TTL drops a message that is not consumed before its deadline. A
per-message
ttl_mswins over the queue’sdefault_message_ttl_ms; with neither set a message never expires. The owner resolves the deadline against its own clock at publish, so it survives recovery and replication. An expired message is never dropped while it is in flight, and the drop honors the queue’s dead-letter policy (discard when no DLQ is configured, otherwise dead-lettered with reasonexpired). This is per-message age-drop, not queue expiration (auto-deleting an idle queue), which is a separate, not-yet-implemented idea.
Subscribe and Delivery
Section titled “Subscribe and Delivery”See also: backpressure and client usage.
| Item | Status | Implemented surface |
|---|---|---|
| Manual ack subscriptions | Implemented | TCP protocol, Rust client, TypeScript client, Python client |
| Auto ack subscriptions | Implemented | TCP protocol, Rust client, TypeScript client, Python client |
| Bounded prefetch | Implemented | Broker delivery path and clients |
| Backpressure | Implemented | Pull-based delivery bounded by prefetch |
| Unsubscribe redistribution | Implemented | Broker tests cover prefetched unacked messages returning to active subscribers |
| Competing consumers (default) | Implemented | Many consumers per queue, fair dispatch, unordered |
| Exclusive consumer groups | Partial | .exclusive()/consumer_target (Rust), consumerGroup()/consumerTarget() (TypeScript), and consumer_group()/consumer_target() (Python) + TCP protocol, per-partition gate, balanced+sticky assignment, soft consumer_target, assignment push, reconnect restore, one-cohort-per-queue guard |
| Cross-broker cohort coordination | Partial | Member identity + controller (aggregate→plan→publish) + owner apply all wired in cluster bootstrap, coordination-level multi-node rebalance test exists, fuller broker/client scenarios are still growing |
| Partition fan-in | Implemented | Both clients subscribe to all known partitions, merge deliveries while keeping per-partition settlement routing, and pick up partitions added by a live grow |
See consumer groups for the user-facing model.
Conditions and limits:
- Manual ack messages must be completed, failed, retried, or retried after a delay.
- Auto ack happens only after the delivery frame is successfully sent.
- Prefetch limits how many messages a subscription can hold at once.
- If a subscription ends with prefetched but unsettled messages, those messages are returned for redelivery.
- A message can be delivered more than once under failure, retry, or lease expiry conditions.
- Exclusive consumer groups are opt-in (Rust
.exclusive(), TypeScript.consumerGroup(), Python.consumer_group()). Without them, consumers compete (no ordering). A queue has a single exclusive cohort. The clients expose the assignment-events stream (Rustassignment_events(), TypeScriptonAssignmentChange, Pythonon_assignment_change). - Cross-broker cohort balance is advisory/eventually-consistent. The per-partition delivery gate is always the correctness backstop. Single-node is fully covered by tests.
- Plain subscriptions fan in over known partitions and pick up partitions added by a live grow. A topology warm step at
connect prevents pure consumers from staying on partition
0when topology is available.
Plexus Streams
Section titled “Plexus Streams”See also: Plexus streams and client usage.
| Item | Status | Implemented surface |
|---|---|---|
| Stream channel type (fan-out) | Implemented | Stroma StreamEngine (cursors/retention), broker fan-out actor, TCP protocol (DeclarePlexus/SubscribeStream, reuses Publish/Deliver/Ack), Rust + TypeScript + Python clients |
| Declare plexus (partitions, durability, retention, replication factor) | Implemented | declare_plexus/declarePlexus + StreamConfig in all three clients, including a per-stream replication-factor override that beats the cluster default |
| Durable named cursor | Implemented | Broker-side cursor per (channel, partition, name), resuming on restart and advancing on ack |
| Ephemeral start position | Implemented | latest / earliest / offset / n-back / by-time |
| Header filter | Implemented | AND of header == pattern with * glob, stream-only |
| Client-side fan-in across partitions | Implemented | Reuses the queue fan-in supervisor (failover resubscribe + live-grow pickup). Streams stay out of the reconnect-reconcile registry and resume via the cursor |
| Durability tiers (ephemeral/speculative/durable) | Implemented | Express lane wired: durable fsyncs before deliver/confirm, speculative delivers off the staged offset and defers the confirm until durable (with a fibril.speculative header), ephemeral delivers and confirms at staging with no fsync (AfterWrite). All log-backed |
| Cross-client wire vectors | Implemented | DeclarePlexus/DeclarePlexusOk/SubscribeStream pinned in clients/wire_vectors.json, asserted by Rust/Python/TS |
Conditions and limits:
- Every consumer of a stream sees every record. Partitioning a stream is for write throughput and per-key ordering, not consumer work-sharing.
- A durable name is single-active (last commit wins). Distinct names are independent fan-out consumers, each with its own per-partition cursor.
- A fresh durable name starts at the earliest retained record so it cannot silently miss data.
- Retention drops whole sealed segments (by age/bytes/records) and clamps a cursor that lags past the retained window.
- The durability tier changes delivery and confirm timing end to end: speculative and ephemeral deliver off the staged offset (no fsync wait), with speculative deferring the confirm until durable and ephemeral confirming immediately.
Reconnects
Section titled “Reconnects”See also: reconnects and reconnection grace internals.
| Item | Status | Implemented surface |
|---|---|---|
| Resume identity handshake | Implemented | TCP protocol, Rust client, TypeScript client, Python client |
| Reconnect grace window | Implemented | Runtime settings and TCP handler. On by default in the server seed (connection.reconnect_grace_ms, 5s); opt out with 0. Both manual and auto-ack subscriptions participate |
| Conservative subscription reconciliation | Implemented | Broker, Rust client, TypeScript client, Python client |
| Restore-client-subscriptions policy | Implemented | Broker, Rust client, TypeScript client, Python client |
| Reconnect observability | Implemented | Admin overview, TCP metrics log, structured reconciliation logs |
| Planned restart drain notice | Partial | POST /admin/api/drain broadcasts a GoingAway push (grace deadline + message) to connected clients. All three clients surface it as an app-observable event (going_away_events / onGoingAway / on_going_away) so the app can wind down, then rely on reconnect-on-close for the redirect. Broker-side graceful ownership handoff on drain (for true zero-downtime) is pending |
| Durable broker restart reconciliation | Planned | Design notes only |
| In-flight publish replay | Out of scope | Clients do not replay old in-flight protocol requests |
Conditions and limits:
- Reconnect grace only applies while the broker process stays alive.
- Resume requires the same client resume identity before grace expires.
- The conservative policy keeps matching subscriptions, drops server-only subscriptions, and closes client-side streams that the broker cannot prove are still valid.
- Restore mode can recreate missing server-side subscriptions reported by the client after a successful resume.
- Current Rust, TypeScript, and Python subscription receive APIs surface reconciliation-closed streams as end-of-stream rather than a typed close reason.
Settlement, Retry, and Leasing
Section titled “Settlement, Retry, and Leasing”See also: core model, reliability semantics, and retries and delays.
| Item | Status | Implemented surface |
|---|---|---|
| Ack | Implemented | TCP protocol, Rust client, TypeScript client, Python client |
| Nack without requeue | Implemented | TCP protocol, Rust client, TypeScript client, Python client |
| Immediate retry | Implemented | TCP protocol, Rust client, TypeScript client, Python client |
| Delayed retry | Implemented | TCP protocol, broker, Rust client, TypeScript client, Python client |
| Lease expiry | Implemented | Runtime delivery settings and broker/storage path |
Conditions and limits:
- Ack settles a delivered message.
- Fail means nack without requeue. Depending on queue policy, the message may be discarded or dead-lettered.
- Retry means nack with requeue.
- Delayed retry requires
requeue=trueand anot_beforedeadline. - Expired leases can return inflight messages to ready.
- Lease timing is controlled by runtime settings.
Dead Lettering
Section titled “Dead Lettering”See also: dead lettering and metadata policy.
| Item | Status | Implemented surface |
|---|---|---|
| Per-queue DLQ policy | Implemented | Rust client, TypeScript client, Python client, fibrilctl, admin API |
| Global DLQ target | Implemented | Stroma-owned runtime state, admin UI/API, fibrilctl |
| Max retry routing | Implemented | Broker/storage path and tests |
| Dead-letter reasons | Implemented | retries_exhausted, terminal_nack, pending_recovery, expired (message TTL) |
| DLQ metadata | Implemented | Reserved stroma.dlq.* headers on dead-lettered messages |
| DLQ replay by selected offsets | Implemented | Broker/storage path, admin API, admin dashboard, fibrilctl |
| Bulk replay filters | Planned | Not yet implemented |
| Delete or ack DLQ items from replay | Planned | Replay copies back to source and leaves the DLQ message in place |
Conditions and limits:
- Queue policy can discard, use the global DLQ target, or use a custom queue-specific target.
- The global target is live persisted runtime state, not startup config.
- Replay requires active DLQ messages with source metadata.
- Replay strips system metadata from the replayed copy.
- Replay skips offsets that are missing, inactive, or missing required DLQ source metadata.
- Replay currently accepts at most
100offsets per request through the admin API.
Message Inspection
Section titled “Message Inspection”See also: admin dashboard and dead lettering.
| Item | Status | Implemented surface |
|---|---|---|
| Inspect active queue messages | Implemented | Admin API, admin dashboard, fibrilctl |
| Include settled offsets | Implemented | Admin API, admin dashboard, fibrilctl |
| Payload previews | Implemented | Admin API and dashboard, base64 encoded |
| Status filtering | Implemented | Admin API and dashboard |
| Paginated offset navigation | Implemented | Admin dashboard and API offset parameters |
| High-volume live polling view | Out of scope | Use metrics and logs instead |
Conditions and limits:
- By default, inspection returns messages still active in queue state.
- Active states include ready, inflight, delayed, and pending DLQ.
- Settled records can be included explicitly.
- Offsets without matching inspectable state or log records are skipped rather than shown as misleading partial rows.
- Inspecting a queue can load that queue into memory. If idle cleanup is enabled and no active publisher or subscriber exists, cleanup can unload it again.
- Default page size is
50. - Current API hard cap is
5000messages per request. - Payload previews default to
4096bytes and cap at1048576bytes. - Inspection reads persisted data and can affect realtime performance on large requests.
Sparse Queues and Idle Cleanup
Section titled “Sparse Queues and Idle Cleanup”See also: many idle queues and idle queue internals.
| Item | Status | Implemented surface |
|---|---|---|
| Lazy loading | Implemented | Storage and broker startup behavior |
| Idle queue cleanup | Implemented | Runtime settings, broker worker, Stroma unmaterialization |
| Publisher idle expiry | Implemented | Runtime settings and broker publisher cache cleanup on incoming connection frames |
| Cleanup race guard | Implemented | Broker prevents cleanup from racing with newly created publisher/subscriber leases |
| Active lease materialization | Implemented | Publisher and subscriber creation materialize storage before returning a usable handle |
| Inspection reload cleanup | Implemented | Queues loaded by admin inspection can be unloaded again by idle cleanup |
| Cleanup observability | Partial | Admin queues page, /admin/api/queues_debug, cumulative metrics counters |
| Exact cleanup timing | Out of scope | Cleanup is approximate and sweep based |
Conditions for a queue to be unloaded from memory:
- idle queue cleanup is enabled
- the broker knows the queue as a cleanup candidate
- no active subscribers remain
- no active publishers remain
- no messages are currently leased to consumers
- no pending settlement work is still draining
- no broker delivery tags remain active
- storage reports no inflight messages
- the configured idle window has elapsed
- the cleanup sweep reaches that queue
- storage accepts the unmaterialization attempt
- no publisher or subscriber lease is being created at the same time as cleanup
Conditions that keep a queue in memory or skip cleanup:
- active publisher or subscriber
- a new publisher or subscriber lease arriving while cleanup is trying to unload the queue
- not idle long enough
- pending settlements
- broker-tracked deliveries still active
- storage-tracked inflight messages
- storage race while another operation is materializing or changing the queue
- queue not tracked by the broker for cleanup
Operator-facing behavior:
- Unloading is not deletion.
- Durable messages, events, snapshots, and queue identity stay on disk.
- A later publish, subscribe, or admin operation can reload the queue.
- Creating a publisher or subscriber loads the queue before the handle is returned.
- Admin message inspection can load a queue without creating a publisher or subscriber lease.
- The first operation on a cold queue can pay reload cost.
- Long-lived producer connections can keep queues active unless publisher idle expiry is enabled.
- Automated cleanup does not keep rechecking storage after a queue is already unloaded unless new queue activity happens.
Observability currently includes:
- loaded versus indexed-only queue state on the admin queues page
- active publisher and subscriber counts
- idle time and last-used time when known by the current process
- last cleanup result or skip reason per tracked queue
- cumulative cleanup attempts and selected outcomes in broker metrics snapshots
Developer-facing note:
- Broker
PublisherHandleis intentionally not cloneable. A broker publisher handle owns one sink task and one active-publisher lease. Creating another independently tracked publisher must go throughget_publisher, which creates another sink and another lease.
Runtime Settings and Startup Config
Section titled “Runtime Settings and Startup Config”See also: configuration, configuration policy, and configuration design.
| Item | Status | Implemented surface |
|---|---|---|
| TOML startup config | Implemented | Config crate and server binary |
| Env and CLI overrides | Implemented | Config crate and server binary |
| Admin auth startup config | Implemented | TOML, env, CLI, server wiring |
| Keratin fsync and segment startup config | Implemented | Config crate and server wiring |
| Runtime delivery settings | Implemented | Runtime settings manager and admin UI/API |
| Runtime idle cleanup settings | Implemented | Runtime settings manager, admin UI/API, broker worker |
| Runtime locks | Implemented | Locked groups reject admin edits |
| Global DLQ runtime state | Implemented | Stroma-owned versioned setting |
| Corrupt runtime settings recovery UI | Partial | Load issue reporting exists, richer operator reset flow is pending |
Conditions and limits:
- Startup config is loaded before the process starts serving.
- Startup precedence is defaults, TOML, environment, then CLI.
- Runtime seeds initialize persisted runtime settings only when no runtime settings exist.
- After runtime settings exist, persisted state owns those values unless a group is locked by startup config.
- Runtime updates use expected versions to detect concurrent edits.
- Locked runtime groups reject admin edits instead of silently applying changes.
Admin Surface
Section titled “Admin Surface”See also: admin dashboard.
| Item | Status | Implemented surface |
|---|---|---|
| Overview metrics | Implemented | Dashboard and API |
| Connections and subscriptions | Implemented | Dashboard and API |
| Queues page | Implemented | Dashboard and API, per-partition expand, follower-replication view, DLQ-policy column, hide-inactive toggle + search filter |
| Create queue | Implemented | POST /admin/api/queues + dashboard form (partition count, optional DLQ policy, optional default message TTL) |
| Delete queue (single-node) | Implemented | POST /admin/api/queues/delete + per-row dashboard button; refuses while messages are inflight (409) and in cluster mode (501) pending coordinated teardown |
| Streams page | Implemented | Dashboard and API (GET /admin/api/streams), per-topic partition rows with head/tail/retained, declared durability and retention, plus per-partition role and applied offset and a follower-replication view (GET /admin/api/streams_debug) |
| Create stream | Implemented | POST /admin/api/streams + dashboard form (partition count, durability tier, optional retention by records/bytes/age) |
| Settings page | Implemented | Dashboard and API, incl. replication and streaming-replication settings |
| Message inspection page | Implemented | Dashboard and API |
| DLQ replay controls | Implemented | Dashboard and API |
| Topology page | Implemented | Coordination nodes, per-partition ownership/epochs, consensus block |
| Repartition control | Partial | /admin/api/repartition and a topology-page form (Ganglion mode) |
| Coordination membership control | Partial | Add/remove a consensus voting member from the topology page and API |
| Cohort visibility | Partial | Per-broker exclusive-cohort membership on the subscriptions page and /admin/api/cohorts |
| Quarantine banner and repair | Implemented | Global banner, /admin/api/quarantine, repair endpoint, /readyz |
| Basic admin auth | Implemented | Login, logout, session cookie, auth-disabled mode |
| Fine-grained admin roles | Planned | Current model is admin access or auth disabled |
Conditions and limits:
- When admin auth is enabled, dashboard pages require login.
- When admin auth is disabled, dashboard pages are accessible directly and should be protected by network boundaries.
- Settings updates use version checks.
- The dashboard is for operational inspection, not continuous high-frequency monitoring.
CLI Surface
Section titled “CLI Surface”See also: source deployment and dead lettering.
| Item | Status | Implemented command |
|---|---|---|
| Queue declaration | Implemented | fibrilctl queue declare |
| Global DLQ get, set, clear | Implemented | fibrilctl admin global-dlq |
| Message inspection | Implemented | fibrilctl admin messages |
| DLQ replay | Implemented | fibrilctl admin dlq replay |
| Queue observability | Implemented | fibrilctl admin queues |
| Pub or sub from CLI | Planned | Useful for manual testing, not currently implemented |
Conditions and limits:
- CLI uses broker TCP for queue declaration.
- CLI uses admin HTTP for admin commands.
- By default it reads the same startup config path handling as the server config crate.
- Container usage assumes the CLI runs where it can reach the broker or admin surface.
Client Surface
Section titled “Client Surface”See also: client usage, quickstart, and reconnection grace.
Server-side reconnect grace for inflight settles is implemented in the TCP
handler when connection.reconnect_grace_ms is configured. Clients now attempt
one automatic reconnect before a new operation when the old engine is already
closed. Clients and broker also exchange subscription metadata after a
successful resume. Active subscription streams continue when reconciliation
confirms that the subscription should be kept. An opt-in restore policy can ask
the broker to recreate subscriptions that the client still owns but the server
does not currently have.
| Item | Rust | TypeScript |
|---|---|---|
| Connect and auth | Implemented | Implemented |
| Publish unconfirmed | Implemented | Implemented |
| Publish confirmed | Implemented | Implemented |
| Pipelined confirmation handle | Implemented | Implemented |
| Delayed publish | Implemented | Implemented |
Message TTL (expiring publisher + queue default_message_ttl) | Implemented | Implemented |
| Manual ack subscription | Implemented | Implemented |
| Auto ack subscription | Implemented | Implemented |
| Plexus stream declare + subscribe (durable cursor, filter, fan-in) | Implemented | Implemented |
| Exclusive consumer group | Implemented | Implemented |
| Assignment-change events | Implemented | Implemented (onAssignmentChange) |
| Resume identity handshake | Implemented | Implemented |
| Explicit reconnect outcome | Implemented | Implemented |
| Existing publishers after explicit reconnect | Implemented | Implemented |
| New subscriptions after explicit reconnect | Implemented | Implemented |
| Conservative automatic reconnect before new operation | Implemented | Implemented |
| Subscription reconciliation metadata exchange | Implemented | Implemented |
| Active subscription recovery after accepted resume | Implemented | Implemented |
| Opt-in subscription restore after accepted resume | Implemented | Implemented |
| Delayed retry | Implemented | Implemented |
| Queue declaration | Implemented | Implemented |
| Content type helpers | Implemented | Implemented |
| Default group normalization | Implemented | Implemented |
| Automatic reconnect and resubscribe | Planned | Planned |
Conditions and limits:
- Both clients expose msgpack, JSON, text, raw payloads, content type metadata, and custom user headers.
- Both clients treat content type separately from the header map.
- Both clients expose delayed publish and delayed retry.
- TypeScript uses
bigintfor protocolu64values such as offsets. - Explicit reconnect returns whether the broker accepted the resume identity.
- Publisher handles use the latest client engine after explicit or automatic reconnect.
- New subscriptions created after reconnect use the latest client engine.
- Automatic reconnect is bounded by client policy and defaults to one attempt before a new operation.
- After a successful resume, both clients send known subscription metadata and read the broker reconciliation result.
- When the broker returns
keep, both clients route later deliveries for that subscription into the existing stream. - If the broker keeps a subscription under a different server
sub_id, both clients remap the existing stream to that server id. - The default reconciliation policy is conservative. Client-only or mismatched subscriptions are closed client-side, while server-only subscriptions are dropped by the broker.
- The opt-in restore-client-subscriptions policy recreates client-owned subscriptions that are missing server-side, then keeps the existing client stream using the broker’s new subscription id.
- Operations already in flight when the socket fails are not replayed.
- Active subscriptions still need application-level handling when resume is rejected or reconciliation reports a mismatch.
- Late settlements after a short disconnect are accepted only when the client explicitly resumes before grace expires.
- Broker process restart reconciliation is not implemented. Current reconnect grace depends on the broker process keeping dormant connection state in memory.
Benchmarks and Operational Scripts
Section titled “Benchmarks and Operational Scripts”See also: benchmarks.
| Item | Status | Implemented surface |
|---|---|---|
| Throughput benchmark | Implemented | Rust bench helper and shell scripts |
| Steady-state rate benchmark | Implemented | Bench binary and scripts |
| Memory sampling during bench | Implemented | Bench scripts |
| Formal CI benchmark reporting | Planned | Local and informal docs exist, reproducible CI reporting is pending |
| One command pre-commit verification helper | Planned | Individual checks exist, unified helper is still pending |
Conditions and limits:
- Current benchmark numbers are local architecture checks.
- They are not a stable published performance contract.
- Hardware, storage, durability settings, payload size, queue depth, and batching strongly affect results.
Deployment Surface
Section titled “Deployment Surface”See also: source deployment.
| Item | Status | Implemented surface |
|---|---|---|
| Website Docker deployment | Implemented | Compose file and Traefik labels |
| Broker server image | Implemented | Dockerfile and publish workflow |
fibrilctl in image | Implemented | Server image includes CLI |
| Source deployment docs | Implemented | Docs site |
| Full production hardening guide | Partial | Basic guidance exists, deeper ops runbook is pending |
Conditions and limits:
- Server image exposes broker TCP and admin HTTP ports.
- Persistent broker data should be mounted under the configured data directory.
- Admin auth should be enabled outside local protected environments.
Experimental Cluster and Replication Surface
Section titled “Experimental Cluster and Replication Surface”See also: clustering and replication.
| Item | Status | Implemented surface |
|---|---|---|
| Ganglion coordination mode | Partial | Startup config, embedded coordinator, TCP transport, broker self-registration, topology endpoint |
| Broker advertise address | Partial | broker.listener.advertise / FIBRIL_BROKER_ADVERTISE (priority-ordered list, peer-derived default), carried to clients as owner_endpoints; clients dial the first entry (Rust high-level client connects to socket-address entries only) |
| Queue catalogue and placement controller | Partial | Declared queues register partitions, controller assigns owners and followers, placement is stable and anti-churn |
| Partition ownership gate | Partial | Broker serves only assigned owners in Ganglion mode. Standalone mode owns all queues |
| Follower pull replication | Partial | Follower workers pull owner records over protocol, apply durably, install checkpoints when needed |
| Automatic failover | Partial | Dead owner can trigger epoch-bumped reassignment, follower promotion at local tail, stale owner demotion |
| Cold-restart orphan reconciliation | Partial | A partition reassigned away while a node was down is retained as inert on-disk cold storage after restart (ownership-gated serving means it is never served or materialized) and surfaced at startup. Reclaim of that disk is still manual |
| Epoch fencing | Implemented | Role transitions advance log epochs before serving or applying replicated batches |
| Replica-durable confirms | Partial | Owner waits for durable follower progress according to assignment policy, with timeout and ISR floor |
| Durable stream replication (Plexus) | Partial | Tier-gated: the durable tier replicates record + cursor logs to stream_replication_factor followers (express tiers stay owner-only), durable publishes confirm on replica durability, and a caught-up follower is promoted on owner failover. Reuses the queue follower-worker, confirm gate, and failover-candidate selection |
min_in_sync_replicas | Implemented | Runtime setting, fail-fast publish refusal when healthy ISR is below floor |
| Live repartitioning | Partial | Grow or shrink a queue’s partition count in Ganglion mode (versioned routing, in-flight transition serialization, drain-and-retire on shrink); admin control + API |
| Topology visibility | Partial | Admin API/page (with repartition + coordination-membership controls) and fibrilctl topology. Cross-broker lag aggregation is pending |
| Live topology push | Partial | Broker pushes a TopologyUpdate to each connection when that connection’s routing content changes (not on every coordination metadata bump); the Rust, TypeScript, and Python clients apply it to their routing cache and ack the generation |
| Repartition cutover fencing | Partial | The controller fences a repartition’s finalize (retiring shrunk-away partitions and clearing the marker) on cluster-wide client adoption of the new routing, derived from topology acks, bounded by repartition_adoption_timeout_ms. Publish version-fencing remains the correctness backstop. See live routing and cutover |
| Cohort visibility (admin) | Partial | Per-broker exclusive-cohort membership on the subscriptions page and /admin/api/cohorts; cluster-wide cohort assignment is broker-local, not centrally committed |
| Multi-node cohort coordinator test | Partial | Coordination-level e2e covers cross-broker membership aggregation and rebalance. Full broker/client scenario coverage is still growing |
Conditions and limits:
- This surface is experimental on the replication/sharding branch.
- Ganglion mode is the active embedded-coordination path. The older etcd-shaped plan remains useful as a design reference, but the current implementation uses the same coordination trait with Ganglion underneath.
- Replication is follower-pull. Followers apply durably, then report progress through stamped replication reads.
- Failover safety relies on assignment epochs plus local Stroma promotion gates.
- Replica-durable confirms are meaningful only when the assignment durability policy requires more than the owner.
- Cross-broker topology lag and ISR aggregation into the topology page is still pending.
- More failure testing is needed before treating this as production-ready HA.
Not Implemented or Not Planned
Section titled “Not Implemented or Not Planned”| Item | Status | Notes |
|---|---|---|
| Transactions | Out of scope | Not planned |
| Production-ready clustered HA | Planned | Experimental coordination, replication, and failover are wired, but hardening and runbooks remain |
| Single-node queue deletion | Implemented | Admin API/dashboard; see Admin Surface |
| Multi-node queue deletion | Planned | Coordinated teardown across replicas is pending |
| Message TTL (drop by age) | Implemented | Per-message + per-queue default; see Publish |
| Queue expiration (auto-delete idle queue) | Planned | Distinct from message TTL; needs global/coordinated idle tracking |
| Log retention by age (truncate old messages) | Planned or undecided | Not currently exposed as a user feature |
| Message purge (empty a queue) | Planned | Re-scoped: needs a replicated reset, not in-memory only |
| Python client | Planned | Future client priority, including a blocking client |
| C# client | Planned | Future client priority after Python |
| Go client | Planned | Future client priority after C# |
| Java client | Planned | Future client priority after Go |