Skip to content

Benchmarks

Current benchmark numbers are informal architecture checks, not claims of production capacity.

Internal measurements with the current TCP transport and durable path on a single Ubuntu node have observed:

WorkloadObservation
Ingressroughly 250k+ messages/sec
Egressroughly 250k+ messages/sec
Payload1KB messages
MachineRyzen 5950X

Memory usage during these runs ranged from a few hundred MB at lower load to roughly 1-2GB near peak throughput, depending on queue depth, batching, and inflight state.

These numbers are useful mostly as a sanity check:

  • the durable path is not obviously too slow
  • batching and the queue execution model are promising
  • memory behavior still needs tuning
  • larger payloads will shift bottlenecks toward memory, copying, storage, and network I/O

The project still needs:

  • broader payload-size sweeps across more hardware
  • durability-setting comparisons
  • richer latency histograms and structured output
  • restart/replay timing
  • multi-consumer fairness and backpressure scenarios

The current TCP-layer benchmark helper is e2e_c. It is still an early benchmark, but it now reports wall throughput, active receive throughput, sent and received counts, missing receive count, retry metadata observed on delivered messages, and latency percentiles from publish time to reader delivery.

For a quick local run, use the benchmark script:

Terminal window
MESSAGES=500000 CLIENTS=10 SIZE=1024 PREFETCH=16384 scripts/bench-e2e-c.sh

The script builds the release server and benchmark binary, waits for /healthz, runs a small warmup so lazy queue setup is out of the main measurement path, then starts a reader and writer for the measured run.

Useful knobs:

VariableDefaultMeaning
MESSAGES500000Messages per client in the measured run
CLIENTS10Parallel reader and writer client count
SIZE1024Raw payload size in bytes
PREFETCH16384Reader subscription prefetch
WARMUP_MESSAGES1000Warmup messages before the measured run
READY_SETTLE_SECONDS0.5Pause after reader readiness before starting the writer
IDLE_TIMEOUT_MS10000Reader idle timeout before reporting partial receive counts
CONFIRMED0Set 1 to wait for publish confirmations for correctness/debug checks
LOG_FILEtemporary fileBuild, server, and noisy runtime logs
RESULTS_FILEtemporary fileDeterministic benchmark summary and queue snapshots

The helper can also be run manually. Start the server in one terminal:

Terminal window
cargo run --release --bin fibril-server

Start the writer in another terminal:

Terminal window
cargo run --release --bin e2e_c -- -m 500000 -c 10 --writer --size 1024

Start the reader in a third terminal, as close to the writer start time as practical:

Terminal window
cargo run --release --bin e2e_c -- -m 500000 -c 10 --reader --prefetch 16384

The reader side prints latency percentiles when it receives messages. If the reader goes idle before receiving the target count, it reports the partial count and missing count instead of waiting indefinitely. The wall throughput includes any idle timeout tail, while active receive throughput uses the span between the first and last received message. Retry counts are read from Fibril’s reserved delivery metadata headers when present. Structured benchmark output and scenario tables are still future work.

The burst benchmark intentionally lets writers run as fast as possible. That is useful for saturation checks, but it can build backlog and make latency look larger as the message count grows.

For latency at a controlled offered load, use the steady-state helper:

Terminal window
WRITERS=10 READERS=10 RATE_PER_SEC=100000 WARMUP_SECS=3 DURATION_SECS=10 \
SIZE=1024 PREFETCH=16384 scripts/bench-steady-c.sh

The steady helper runs readers and writers in one coordinated process. It marks warmup messages separately, measures only the configured steady window, and prints both publish-to-delivery and server-receive-to-delivery latency. The wrapper also writes full server logs and full benchmark results to files, then prints a compact summary including publish and confirmation error counts plus server RSS average and peak sampled during the benchmark run.

The wrapper starts a local fibril-server on the default broker and admin ports by default. Run one wrapper benchmark at a time. A second run will fail if those ports are already occupied.

To target an already running server or a kept cluster, set START_SERVER=0, BROKER_ADDR, and ADMIN_ADDR. DURABILITY_LABEL is only a result label, but it is useful when comparing local, cluster-routed, and later replica-durable confirmed runs.

To reuse the real cluster lifecycle checks, run the steady benchmark through the tryout script:

Terminal window
CONFIRMED=1 RATE_PER_SEC=1000 WARMUP_SECS=2 DURATION_SECS=5 \
scripts/cluster-tryout.sh --ganglion --nodes 3 --steady-bench

cluster-tryout.sh starts the nodes, waits for the cluster, declares the benchmark topic, runs the normal data-plane smoke, then calls bench-steady-c.sh against the live cluster with START_SERVER=0. Use --bench-topic <topic> when you want the tryout script to declare and benchmark a different topic. This path currently measures clustered client routing. Treat it as replica-durable only once the assignment and post-run follower state prove the chosen follower applied the measured log range.

When CONFIRMED=1, writers still run with pipelined publish confirmations by default. Set CONFIRM_WINDOW=1 if you specifically want the older serial “publish, wait, publish” shape.

Useful knobs:

VariableDefaultMeaning
WRITERS10Parallel writer clients
READERS10Parallel reader clients
RATE_PER_SEC100000Target aggregate publish rate
WARMUP_SECS5Warmup duration excluded from measured results
DURATION_SECS30Steady measurement duration
DRAIN_TIMEOUT_SECS10Time allowed for measured messages to drain
SIZE1024Raw payload size in bytes
PREFETCH16384Reader subscription prefetch
CONFIRMED0Set 1 to require broker publish confirmations
CONFIRM_WINDOW1024In-flight confirmations per writer when CONFIRMED=1
TOPICtopic1Queue topic used by the steady benchmark
START_SERVER1Set 0 to target an external server or cluster
BROKER_ADDR127.0.0.1:9876Broker TCP address passed to the benchmark client
ADMIN_ADDR127.0.0.1:8081Admin address used for health checks and queue snapshots
DURABILITY_LABELlocalLabel printed into results and summary tables
BUILD1Set 0 to skip rebuilding release binaries
LOG_FILEtemporary fileBuild, server, and noisy runtime logs
RESULTS_FILEtemporary fileDeterministic benchmark summary and queue snapshots

Memory numbers are sampled from the local fibril-server process RSS once per second during the wrapper benchmark. The average is over the sampled run period, not precisely only the warmup-excluded steady window.

For repeatable local sweeps, use the matrix helper:

Terminal window
scripts/bench-matrix.sh smoke
scripts/bench-matrix.sh baseline confirmed
scripts/bench-matrix.sh throughput-1k payload

The matrix helper builds the release server and benchmark binary once, then runs named steady-state cases with one results file and one log file per case. It also writes summary.md, a Markdown table generated from those result files. Set OUT_DIR=... to choose where files go. Without arguments, it runs only the quick smoke scenario.

Available scenarios:

ScenarioPurpose
smokeShort low-rate sanity check
baseline1KB 50k/s and 150k/s unconfirmed
confirmed1KB 50k/s and 150k/s with pipelined confirmations
throughput-1kHigher-rate 1KB exploratory sweep
payload8KB, 64KB, 512KB, and 1MB spot checks
large-backlogLarge-payload cases expected to build backlog
allbaseline, confirmed, throughput-1k, and payload

To regenerate a table from existing result files:

Terminal window
scripts/bench-results-table.sh bench-results/steady-*/baseline-*.results.txt

The offered rate is the requested publish rate, not a guarantee that the broker or machine can keep up without queueing. When actual measured rate reaches the target but latency climbs, the run is usually showing backlog, not low-latency capacity. When actual measured rate falls below target, writers or the machine could not sustain the requested input rate.

Measured missing should usually be zero. Non-zero values mean the benchmark stopped before every measured message was delivered, commonly because the drain timeout expired or the run failed.

RSS is sampled from the server process only. It excludes benchmark client memory and the operating system page cache, so it is useful for comparing local runs but not a full machine-memory budget.

Quick validation run from June 7, 2026, using WRITERS=10, READERS=10, SIZE=1024, PREFETCH=16384, WARMUP_SECS=2, and DURATION_SECS=5:

ModeTarget rateActual measured rateMissingpublish→deliver p50/p95/p99/maxserver-receive→deliver p50/p95/p99/maxErrorsEnd queue
unconfirmed50k/s50,000/s015 / 22 / 25 / 58 ms10 / 16 / 18 / 20 ms0ready=0, inflight=0
unconfirmed150k/s149,999/s011 / 15 / 17 / 56 ms9 / 13 / 14 / 17 ms0ready=0, inflight=0
confirmed, window=102450k/s50,010/s014 / 19 / 21 / 28 ms11 / 15 / 16 / 21 ms0ready=0, inflight=0
confirmed, window=1024150k/s149,999/s012 / 16 / 17 / 21 ms10 / 13 / 15 / 18 ms0ready=0, inflight=0

Higher-rate exploratory sweep from the same run shape:

ModeTarget rateActual measured rateMissingpublish→deliver p50/p95/p99/maxNotes
unconfirmed250k/s250,017/s013 / 16 / 17 / 55 msClean short run
unconfirmed350k/s350,002/s079 / 114 / 122 / 130 msLatency knee starts showing
unconfirmed400k/s400,000/s0792 / 1060 / 1096 / 1107 msDrains, but backlog-driven
unconfirmed450k/s449,953/s01807 / 2038 / 2055 / 2066 msDrains, high latency
unconfirmed500k/s499,955/s02520 / 2783 / 2815 / 2833 msDrains, high latency
unconfirmed600k/s599,916/s03806 / 4546 / 4579 / 4600 msDrains, very high latency

For this short local run, the practical low-latency region is below roughly 350-400k/s for 1KB messages. Above that, the broker can still drain the run, but latency reflects backlog building during the measurement window.

Pipelined confirmed publishes follow the same pattern. With CONFIRM_WINDOW=1024, a 400k/s target reached about 385k/s. Raising the window to 4096 reached the 400k/s target, and 450k/s also reached target, but latency rose into the 1-2 second range. Larger windows are useful for saturating the path while preserving publish confirmation correctness. They are not a latency optimization.

Payload-size spot checks on the same SATA SSD development machine:

PayloadTarget rateActual measured rateMissingpublish→deliver p50/p95/p99/maxServer RSS avg/peakNotes
8KB50k/s50,010/s014 / 17 / 19 / 61 msnot sampledClean short run
8KB150k/s139,987/s02608 / 3117 / 3168 / 3245 msnot sampledCould not reach target, backlog-driven
64KB10k/s10,000/s018 / 22 / 23 / 32 msnot sampledClean short run
64KB20k/s18,285/s01605 / 1891 / 2144 / 2277 msnot sampledCould not reach target, likely storage-bandwidth bound
512KB1k/s999/s027 / 34 / 39 / 47 ms262.9 / 310.2 MiBClean short run
512KB2k/s2,000/s01165 / 1669 / 1756 / 1841 ms951.4 / 1538.0 MiBDrains, but backlog-driven
1MB500/s498/s033 / 45 / 51 / 63 ms~290 / ~334 MiBClean short run. Reruns varied slightly
1MB1k/s1,000/s01812 / 2539 / 2693 / 2801 ms847.0 / 1187.5 MiBDrains, but backlog-driven

For larger payloads, the bottleneck shifts away from message scheduling and toward memory copying, TCP throughput, and especially durable storage bandwidth. On this SATA SSD machine, 64KB at 10k/s is already roughly 625 MiB/s of application payload before protocol, replication within the durable path, and filesystem overhead. Treat payload-size numbers as hardware-specific. The large-payload memory samples also show the expected split: clean runs can stay in the low hundreds of MiB, while backlog-driven runs retain much more payload data in-process and can exceed 1 GiB RSS.

Recent local exploratory run, using WRITERS=10, READERS=10, SIZE=1024, PREFETCH=16384, WARMUP_SECS=3, and DURATION_SECS=10:

Target rateActual measured rateMissingpublish→deliver p50/p95/p99/maxserver-receive→deliver p50/p95/p99/maxEnd queue
50k/s49,962/s017 / 25 / 29 / 63 ms11 / 17 / 19 / 52 msready=0, inflight=0
100k/s99,785/s013 / 18 / 62 / 136 ms10 / 14 / 57 / 130 msready=0, inflight=0
150k/s149,831/s012 / 17 / 173 / 225 ms10 / 14 / 171 / 222 msready=0, inflight=0
200k/s199,733/s013 / 78 / 259 / 294 ms11 / 76 / 258 / 292 msready=0, inflight=0
250k/s249,591/s014 / 260 / 367 / 397 ms12 / 258 / 365 / 394 msready=0, inflight=0
300k/s298,528/s018 / 578 / 623 / 659 ms16 / 577 / 613 / 655 msready=0, inflight=0

These older results remain useful as a development checkpoint, but the newer short sweeps above are a better current summary: the broker can drain substantially higher short-run rates, while the practical low-latency region depends heavily on payload size, durable storage bandwidth, and whether backlog is allowed to build. Treat all tables here as reproducible local checkpoints, not capacity promises.

Plexus streams fan out: every live subscriber receives every matching record, so delivered throughput is readers x published. Streams have three durability tiers (durable, speculative, ephemeral) that trade the delivery and producer confirm timing against the storage guarantee. The helper is scripts/bench-stream.sh, the stream counterpart of bench-steady-c.sh:

Terminal window
DURABILITY=ephemeral RATE_PER_SEC=150000 WRITERS=4 READERS=2 \
CONFIRMED=1 CONFIRM_WINDOW=4096 scripts/bench-stream.sh

DURABILITY selects the tier and DATA_DIR selects the data filesystem, so the same run can be compared on an SSD versus a tmpfs to isolate storage effects.

1KB records, four writers, two fan-out readers, one partition, pipelined confirmations (window 4096), on the same SATA SSD development machine. Latency is publish to delivery:

TierTarget rateMeasured ratedeliver p50/p95/p99/maxConfirm p50RSS peak
durable50k/s49,952/s7 / 9 / 27 / 56 ms10 ms53 MiB
speculative50k/s50,000/s1 / 2 / 3 / 10 ms10 ms65 MiB
ephemeral50k/s49,986/s1 / 2 / 3 / 9 ms4 ms49 MiB
durable150k/s149,597/s49 / 67 / 95 / 128 ms50 ms103 MiB
speculative150k/s149,653/s2 / 16 / 36 / 57 ms55 ms140 MiB
ephemeral150k/s149,800/s1 / 2 / 10 / 28 ms2 ms78 MiB
durable250k/s235,306/s64 / 82 / 97 / 108 ms64 ms137 MiB
speculative250k/s236,722/s3 / 19 / 28 / 59 ms45 ms155 MiB
ephemeral250k/s249,256/s7 / 31 / 45 / 53 ms8 ms124 MiB

The tiers separate as designed:

  • durable waits for the fsync before it delivers and confirms, so its delivery latency is the fsync latency. Strictest guarantee, highest and most predictable latency.
  • speculative delivers as soon as the record is staged and defers the producer confirm until it is durable. Delivery is near-instant while the confirm reflects real durability, and records carry a fibril.speculative header so a consumer knows they may still be rolled back.
  • ephemeral delivers and confirms at staging and persists in the background. Lowest latency on every axis and the lightest on memory. It keeps a tight tail on a real disk because a background flush drains dirty pages on keratin’s fsync worker stage rather than letting them pile up until the kernel throttles the writer.

Every reader is an independent subscriber that receives the whole stream, so the delivered rate is readers x publish rate until a bottleneck bites. 1KB, 100k/s offered, one partition, scaling the reader count.

Ephemeral tier, cursorless readers reading from the live tail:

ReadersPublish rateDelivered ratePer readerdeliver p50/p95/p99/maxRSS peak
199,990/s99,990/s99,990/s0 / 2 / 3 / 17 ms58 MiB
299,990/s199,980/s99,990/s1 / 2 / 4 / 20 ms60 MiB
4100,006/s400,023/s100,006/s0 / 2 / 3 / 17 ms66 MiB
899,984/s799,866/s99,983/s1 / 3 / 17 / 52 ms80 MiB
1697,165/s1,546,066/s96,629/s3 / 52 / 171 / 233 ms169 MiB
3293,687/s850,976/s26,593/s969 / 1175 / 1269 / 1387 ms206 MiB

Durable tier, auto-ack readers (each commits a durable cursor per record):

ReadersPublish rateDelivered ratePer readerdeliver p50/p95/p99/maxRSS peak
1100,000/s99,570/s99,570/s59 / 85 / 115 / 173 ms87 MiB
2100,000/s199,256/s99,628/s60 / 93 / 143 / 178 ms88 MiB
4100,000/s396,174/s99,043/s64 / 90 / 115 / 150 ms89 MiB
899,990/s788,713/s98,589/s62 / 97 / 127 / 176 ms87 MiB
1699,832/s1,105,715/s69,107/s196 / 613 / 646 / 732 ms140 MiB
3295,410/s759,266/s23,727/s1067 / 1250 / 1360 / 1505 ms201 MiB

Both tiers fan out near-linearly while a single partition’s fan-out actor has headroom: every reader sees the full 100k/s up to eight readers. The ephemeral tier peaks around 1.5M frames/s at sixteen readers, where the single per-partition fan-out actor and its delivery tasks saturate. Past that knee one partition thrashes, so thirty-two readers delivers less aggregate than sixteen at backlog-driven latency. The durable tier holds the same near-linear shape to eight readers and pays a steady delivery-latency floor for the fsync-before-deliver guarantee, with its per-reader cursor work bringing the knee in a little earlier. Durable auto-ack at this fan-out is only viable because cursor commits are microbatched, coalesced per partition into one durable record and one actor message per window. Committing a cursor per record inline collapsed delivery to tens of records per second per reader at multi-second latency.

Those two tables are bounded by delivery throughput: aggregate frames per second (readers times publish rate) saturating one partition’s fan-out actor. That is a separate question from how many readers a partition can fan out to when delivery throughput is not the bottleneck. Holding the rate low (1KB ephemeral, 1k/s offered) so the aggregate stays well under the ceiling, reader count scales cleanly:

ReadersPublish rateDelivered ratePer readerdeliver p50/p95/p99/maxRSS peak
321,000/s32,000/s1,000/s4 / 6 / 7 / 9 ms40 MiB
641,000/s64,000/s1,000/s4 / 6 / 7 / 9 ms47 MiB
1281,000/s128,000/s1,000/s4 / 6 / 7 / 9 ms61 MiB
2561,000/s256,000/s1,000/s4 / 7 / 7 / 11 ms84 MiB

All 256 readers on a single partition keep up at a flat few-millisecond latency, and memory grows gently with connection count. Fan-out reach is cheap: the limit is total delivered frames per second, not the number of readers, so one partition serves many readers as long as readers times rate stays under its delivery ceiling.

The lever past the single-partition delivery ceiling is partitions: each partition has its own fan-out actor and reader connections, so spreading a stream across partitions scales delivery throughput horizontally.

These are reproducible local checkpoints, not capacity promises, the same as the queue numbers above. The per-reader delivered rate matching the publish rate in every row is how the bench shows each reader received the whole stream rather than a thinned subset. ephemeral confirms before the record is durable (best effort), so a crash can lose the most recent unflushed records. That is the tradeoff for its latency, and the durable tier exists for when that is not acceptable.