Skip to content

Optimization log

This page tracks low-level optimization ideas before they become code changes. The rule is simple: define the starting point, explain why the change should help, measure before and after, then keep the result even when the idea does not work.

Optimization work should not start from taste. It should start from a concrete cost model and a benchmark that can prove the result.

Fibril already has useful end-to-end benchmarks:

  • scripts/bench-steady-c.sh measures controlled-rate publish and delivery, latency percentiles, missing messages, publish errors, confirmation errors, server RSS, and queue snapshots.
  • scripts/bench-matrix.sh runs repeatable steady-state scenarios and writes one results file plus one log file per case.
  • scripts/bench-e2e-c.sh measures burst-style ingress and egress saturation.
  • scripts/bench-results-table.sh turns steady benchmark result files into a Markdown table.

The current microbenchmark surface is still small, but crates/protocol/benches/encode.rs is wired as a Criterion encode/decode bench for publish frames. Before changing serialization, headers, batching, or queue-handle lookup behavior, add a targeted microbench or an explicit end-to-end benchmark case that isolates the expected effect.

Before changing a low-level path, record:

  • Git commit and whether the Stroma source is local or git-sourced.
  • Rust version and build mode.
  • Machine, CPU governor if known, disk type, and ulimit -n.
  • Benchmark command, full environment variables, and output directory.
  • Median result across at least three runs when the run is short enough.
  • p50, p95, p99, max latency, missing messages, errors, server RSS avg and peak, and queue end state.
  • Runtime warnings and errors from the benchmark log.

Use the same machine, same data directory class, same payload size, same durability settings, same client count, same prefetch, and same benchmark duration when comparing before and after.

Quick smoke:

Terminal window
OUT_DIR=/tmp/fibril-opt-smoke scripts/bench-matrix.sh smoke

Short 1KB baseline:

Terminal window
OUT_DIR=/tmp/fibril-opt-baseline WARMUP_SECS=2 DURATION_SECS=5 \
scripts/bench-matrix.sh baseline confirmed

Latency knee check:

Terminal window
OUT_DIR=/tmp/fibril-opt-throughput WARMUP_SECS=2 DURATION_SECS=5 \
scripts/bench-matrix.sh throughput-1k

Payload-size check:

Terminal window
OUT_DIR=/tmp/fibril-opt-payload WARMUP_SECS=2 DURATION_SECS=5 \
scripts/bench-matrix.sh payload

For noisy low-level work, prefer running one scenario three times over running many scenarios once. A one-run improvement is a clue, not a conclusion.

Use these as defaults unless the experiment says otherwise:

  • A change aimed at throughput should improve measured throughput or the latency knee by at least 10% without increasing p95 or p99 latency in the normal-load cases.
  • A change aimed at latency should improve p95 or p99 by at least 10% without lowering sustainable throughput.
  • A change aimed at memory should reduce server RSS peak or steady RSS by at least 10% without adding missing messages, errors, or a clear latency regression.
  • A change that adds hot-path instrumentation must show that the instrumentation overhead is small enough to keep enabled.
  • A change that moves bottlenecks between CPU, storage, and memory must say which bottleneck moved and why.

Small cleanups can still land without a visible benchmark win, but they should not be described as optimizations unless the numbers support it.

Hypothesis: the delivery loop rebuilds a Vec<Arc<ConsumerState>>, scans capacity, clones queue keys, allocates delivery records, and wakes the queue for every delivered message. Reducing per-message work here could improve 1KB throughput and the high-rate latency knee.

Why it might help: this path runs once per delivered message and its cost is mostly CPU, atomics, map operations, and allocation. These costs matter most for small payloads where storage bandwidth is not yet the bottleneck.

Benchmark needed: baseline, confirmed, and throughput-1k. A microbench for round-robin consumer selection would help if the implementation is localized enough.

Risk: batching delivery too aggressively can hurt fairness, backpressure, and tail latency.

Status: tuned once, then revisited in the 2026-07 hot-path audit (PERF_AUDIT_HOT_PATHS.md at the repo root). The per-message costs are now reduced: the delivery pump moves owned fields into the wire frame instead of cloning (including the payload), metrics and activity tracking are per poll batch, the loop no longer notifies itself per message, and the reverse tag-lookup map is gone from the delivery and settle paths. Measured effect on the standard scenarios was neutral and the 1KB latency knee did not move (still between 350k/s and 400k/s), so the knee is set elsewhere, most likely the engine pipeline or the per-message channel hop. The broker-to-pump hop was then batched (a Vec per consumer per poll batch) and turned out to be the knee itself, moving saturation onset from 350-400k/s to 500-600k/s.

Hypothesis: the current publish sink uses immediate draining plus a 1ms coalesce window for small batches. Tuning batch size, coalesce behavior, or confirmation handoff could improve throughput under high publisher concurrency.

Why it might help: fewer storage append calls and fewer completion handoffs can reduce scheduler and channel pressure.

Benchmark needed: baseline, confirmed with multiple confirmation windows, and burst bench-e2e-c. Include low-rate steady runs to catch latency regressions.

Risk: coalescing can make low-rate latency worse. Confirmation behavior is part of client-visible correctness, so errors and missing confirms matter as much as throughput.

Status: idea only.

Hypothesis: frame encode/decode and header maps may cost too much for small messages, especially when most publishes have no user headers.

Why it might help: fewer allocations and fewer map entries can reduce CPU and memory traffic. The current design already stores common metadata positionally, which should be preserved unless a benchmark proves otherwise.

Benchmark needed: a properly wired Criterion bench for encode/decode with:

  • 1KB payload, no headers.
  • 1KB payload, content type only.
  • 1KB payload, several user headers.
  • 64KB payload, no headers.

Then run baseline to confirm the microbench result matters end to end.

Risk: custom serialization changes can make the protocol harder to evolve and can lose against rmp-serde on real data.

Status: measured on the current hand-rolled positional codec (2026-07-03): 1KB publish encode ~0.8us, decode ~0.11us, round trip under 1us. The codec is not a meaningful bottleneck at current throughput ceilings. The earlier ~6.3-6.7us figure predates the current codec shape.

Hypothesis: staging encode, write, fsync, and notification as a deeper pipeline could improve durable write throughput.

Why it might help: the writer may be serializing work that could overlap, especially for larger payloads where encoding, file I/O, and fsync do not need to block the same stage.

Benchmark needed: payload and large-backlog scenarios, plus Keratin-level writer microbenchmarks in the Keratin repository.

Risk: this lives below Fibril and must preserve durability ordering. It can also improve large payloads while doing little or nothing for small payloads.

Status: idea only, belongs mostly in Keratin.

Hypothesis: a bounded memory cache of likely next deliverable messages could reduce log reads on hot queues.

Why it might help: Stroma knows ready offsets. When the queue is draining in order, reading ahead can improve locality and reduce per-message read overhead.

Benchmark needed: read-heavy steady runs, burst egress runs, and memory ceiling tests with multiple payload sizes.

Risk: memory growth and cache churn can easily make this worse. This should not start without a strict byte budget and visibility in metrics.

Status: idea only.

Hypothesis: replacing repeated String clones with validated topic and group types backed by compact or shared storage could reduce allocation and memory traffic.

Why it might help: topic and group keys are cloned into records, maps, messages, and admin snapshots.

Benchmark needed: microbench for publish and deliver record creation, plus steady 1KB throughput runs. Also test many sparse queues because the memory benefit may show there before it shows in single-queue throughput.

Risk: interners and shared strings can add locks or lifetime complexity. A bad implementation can be slower than cloning small strings.

Status: idea only.

Hypothesis: replica-durable confirmed publish latency is currently bounded by the follower pull worker’s caught-up poll interval. When a follower is already caught up, the default next poll delay is 1000ms, so the first confirmed publish after an idle period may wait for the follower to poll again before the owner can observe durable follower progress.

Why it might help: changing this path to owner notification, long-poll replication reads, or a lower adaptive delay could reduce cluster confirmed publish latency without changing standalone behavior.

Benchmark needed: one owner and one follower with replica-durable confirms enabled. Run low offered loads to expose idle penalty, then a small knee sweep. Record confirmed publish p50, p95, p99, timeout/error count, follower lag, and server RSS.

Risk: lowering the poll interval globally can add idle CPU and network wakeups. Notification or long-poll machinery can add complexity and must not hide owner failure.

Status: audit finding, benchmark not yet written.

Hypothesis: the current protocol replication resolver keeps one serialized connection per owner. This is simple and conservative, but it may bottleneck a node that follows many queues from the same owner.

Why it might help: per-queue peers, a small peer pool, or request multiplexing could increase catch-up throughput if the one-connection design becomes the limit.

Benchmark needed: one follower catching up many queues from one owner, with both small and medium payloads. Compare catch-up wall time, owner CPU, follower CPU, connection count, and error behavior.

Risk: more connections or multiplexing can make failure handling harder and can increase owner pressure. Keep the current design unless the benchmark proves it is the bottleneck.

Status: audit finding, benchmark not yet written.

Hypothesis: partitioned publish routing, not-owner redirects, and multi-partition subscription fan-in can regress independently of standalone publish/delivery benchmarks.

Why it might help: cluster routing adds topology lookup, redirect retry, multiple client engines, and merged subscription streams. These costs should be visible before TypeScript parity or further routing polish.

Benchmark needed: Rust client scenarios for keyed publish, keyless round-robin, redirected publish, and fan-in subscriptions across multiple partitions.

Risk: optimizing routing too early can make the client API more complex. Measure first, then only change the pieces that show up.

Status: audit finding, benchmark not yet written.

Add entries here when an optimization is actually tried.

DateCommitExperimentExpected effectBenchmarkResultDecision
2026-06-08investigationEstablish benchmark-first optimization processAvoid unmeasured low-level changesExisting benchmark inventoryNo runtime changeKeep this log updated
2026-06-08local experimentBatch delivery metrics and queue activity touch per poll batchReduce per-message time reads, atomics, and queue wakes in the delivery loopthroughput-1k, 250k-500k/s, 2s warmup, 5s steadyNeutral at 250k/s, worse p95/p99 and RSS at 350k/s, roughly neutral to worse above thatReverted. The existing per-message updates stay until a better-targeted change proves itself
2026-06-08local experimentWire protocol encode/decode Criterion bench and add content-type/header/payload-size casesMake protocol/header optimization measurable before changing wire layoutcargo bench -p fibril-protocol --bench encode -- --sample-size 201KB encode+decode about 6.3-6.7µs. Three small user headers were only slightly slower than no headers. 64KB encode+decode about 373µsKeep the bench. Header layout is not the first obvious bottleneck for current end-to-end runs
2026-06-08local experimentReduce publish sink small-batch coalesce window from 1ms to 250µsKeep useful batching while reducing artificial wait in normal and near-saturation runsbaseline, confirmed, and throughput-1k, 2s warmup, 5s steadyLow-rate 50k/150k was neutral. High-rate p95 improved from 471ms to 385ms at 350k/s, 1598ms to 1533ms at 400k/s, and 3304ms to 3267ms at 500k/s. RSS was also lower in high-rate runsKeep 250µs. It improves the knee without increasing the window
2026-06-08local experimentRemove publish sink small-batch wait entirelyAvoid any coalescing delay and rely only on immediate channel drainingthroughput-1k, 2s warmup, 5s steadyGood at 250k/s, but worse than 250µs at 350k/s and 400k/s. Similar at 500k/sReverted. Some short coalescing still helps around the backlog knee
2026-06-08local experimentStart the publish sink coalesce window at the previous flush instead of after the next message arrivesAvoid artificial sparse-message latency while keeping burst batching when traffic is already flowingbaseline, confirmed, and throughput-1k, 2s warmup, 5s steadyLow-rate 50k/150k stayed neutral. High-rate p95 improved from the 250µs always-wait run: 385ms to 337ms at 350k/s, 1533ms to 1418ms at 400k/s, and 3267ms to 3152ms at 500k/sKeep. This gives the sparse case immediate flush behavior and improves the measured saturation knee
2026-06-15toolingLet the steady benchmark target an external broker/admin pair and reuse cluster-tryout.sh for live cluster-routed runsMake cluster-routed and later replica-durable confirm runs comparable without changing the benchmark client shapecargo check -p fibril-benches --bin steady_c --locked, shell syntax checks, 3-node Ganglion tryout smoke at 500/sThe tryout-backed run delivered 500 measured messages with 0 missing and p99 publish-to-deliver latency of 15ms. A later audit showed the follower had not applied the measured range, so these are cluster-routed numbers, not replica-durable numbers. External targets report RSS as unavailable because the wrapper does not own the server processKeep. Next step is an operator or test harness path that can require follower catch-up and replica-durable confirmation
2026-06-15tuningAdd replica-durable latency split and follower read-budget runtime settingsDistinguish client/confirm backlog from broker delivery time, then test whether follower batch limits were the replica-durable throughput ceiling3-node Ganglion tryout, replica_durable:2, 1 KiB payload, 50k/s target, 50k confirm window, heartbeat 1s, liveness TTL 30s, follower budgets 1024 messages/read, 1024 events/read, 16 iterations/tickMeasured about 45.6k/s with zero missing and zero errors. Follower tail reached benchmark writes and owner/follower cursors matched. publish-to-server-receive p99 was about 3.9s, while server-receive-to-deliver p99 was about 141msKeep the knobs. The first ceiling was partly follower-read-budget limited. The next question is confirm-window and client-side backpressure behavior
2026-06-15investigationAdd temporary protocol codec timing logs gated by FIBRIL_PROTOCOL_CODEC_TIMINGCheck whether large-payload replicated runs are starving the server, the benchmark client, or both during encode/decode/frame-copy work3-node Ganglion tryout, replica_durable:2, 1 MiB payloads, explicit codec timing enabledThe probe showed large encode/decode costs on both sides, and especially expensive follower-side decode of large ReplicationReadOk frames. Moving large replication response decode to spawn_blocking protects Tokio workers but does not reduce deserialization timeRefactor or remove the logs before merge. If this signal remains useful, convert it to sampled or aggregated metrics rather than per-frame logs. Real speedup likely needs a different replication wire shape or dedicated decode/apply pipeline
2026-06-15fixAdd byte-aware owner replication read capsStop large-payload followers from requesting giant single ReplicationReadOk responses when record-count limits allow too much payload data256 KiB replica-durable run exposed responses growing into tens or hundreds of MiB, with follower decode taking seconds and publish confirms timing outWith max_bytes_per_read = 8388608, the same 256 KiB, 170/s, confirm-window 100 run completed with 0 publish errors, 0 confirm errors, 0 missing messages, p99 publish-to-deliver about 69ms, and matching owner/follower cursorsKeep. Longer-term transport may split payload and event streams so size limits are natural rather than envelope-level
2026-06-15investigationSweep confirm windows for high-rate 1 KiB replica-durable publishesSeparate sustainable low-latency throughput from throughput achieved by allowing a large confirmation backlog3-node Ganglion tryout, replica_durable:2, 1 KiB payload, 50k/s target, confirm windows 5k, 7.5k, 10k, 20k, 50k5k-7.5k windows stayed low latency at about 30k-32k/s with p99 publish-to-deliver 17ms. 10k reached about 34.8k/s but p99 rose to 841ms. 20k-50k reached about 38.5k-46.9k/s but p99 stayed around 2.8s. Raising event/message read limits at the 7.5k window did not improve throughputKeep the split latency metrics. The next target is owner ingest/confirm scheduling, client confirm pacing, or storage isolation, not just larger follower read batches
2026-06-15investigationCompare tmpfs-backed and disk-backed local cluster data dirsCheck whether the current replica-durable low-latency ceiling is mostly physical disk contentionSame 3-node tryout, 1 KiB payload, 50k/s target, 7.5k confirm window. Default /tmp is tmpfs. Disk comparison used CLUSTER_TRYOUT_RUN_ROOT under the repo filesystemTmpfs reached about 31.8k/s with p99 publish-to-deliver 17ms. Disk-backed reached about 28.2k/s with p99 28ms. Storage placement matters, but tmpfs still has the same broad ceiling shapeKeep the script run-root override. Prioritize scheduler, protocol, confirm pacing, and single-partition owner-path analysis before assuming this is only disk throughput
2026-06-15investigationSplit replica-durable publish-only and consume-drain paths, then inspect protocol codec logsFind whether the current ceiling is publish replication, consume/ack, client windowing, roundtrips, or serialization3-node Ganglion tryout on tmpfs, replica_durable:2, 1 KiB payload, 50k/s publish-only target. Tuned follower budgets: 4096 messages/read, 4096 events/read, 16 MiB/read, 64 iterations/tick. Larger-batch variant used 16k records/read and 64 MiB/readConsume-drain reached about 105k/s after preload, so consume/ack was not the bottleneck. Publish-only stayed around 27k/s. Lowering retry poll to 1ms did not help. Larger batches barely changed throughput but raised confirm latency from about 360ms p50 to about 1.36s p50. Codec logs showed owner ReplicationReadOk encode around 13-16ms for ~4.8 MiB frames, while follower decode took about 117-130ms. Larger ~19 MiB frames decoded in roughly 460-590msCurrent primary bottleneck is follower-side bulk ReplicationReadOk MessagePack decode, not physical disk or simple roundtrip count. Next useful work is a replication-specific wire shape or a bounded decode/apply pipeline
2026-07-03df89e33Release consumer credit at settle accept instead of after the settle fsyncRemove the ack group-commit latency from the consumer flow-control cycleconfirmed at 50k/150k with PREFETCH=64 and default 16384, 3 runs each sideAt prefetch 64 and 150k/s, publish-to-deliver collapsed from p50 ~800ms / p99 ~1290ms to p50 10ms / p99 14ms, matching the prefetch-16k profile. Default-prefetch runs unchangedKeep. The credit-bound regime is gone and at-least-once semantics are unchanged
2026-07-0386fdbe7Remove the per-publish ConfirmStream offset send and drain tasksRemove one channel hop and task wakeup per confirmed publish plus a bounded-channel stall hazardbaseline plus confirmed at 50k/150k, 3 runs each sideLatency-neutral, all percentiles within run noiseKeep as overhead and hazard removal, not claimed as a latency win
2026-07-033179dae, 150a65a, b7ef762Delivery pump field moves, per-batch delivery bookkeeping, batched ack settle path, reverse tag map removal, publish-arm cache-first ordering, cached transport sinkCut per-message allocations, atomics, map ops, and channel awaits on the delivery and settle pathsbaseline, confirmed, throughput-1k (250k-500k), 2 runs per stepSub-knee scenarios neutral, and the 1KB knee did not move (still between 350k/s and 400k/s). This supersedes the reverted 2026-06-08 metrics-batching trial: bundled with the payload move and wake-path changes it showed no 350k/s regressionKeep. The knee is set elsewhere, most likely the engine pipeline or the per-message broker-to-pump channel hop
2026-07-039d1efb9Batch the broker-to-pump delivery hop (one Vec per consumer per poll batch)Amortize the per-message channel send and task wakeup on the delivery pathbaseline, confirmed, throughput-1k, low-prefetch confirmed, plus 600k/700k rate probesThis hop was the knee. 400k/s went from p50 ~750ms to 16ms and 500k/s from p50 ~2.4s to 26-27ms at full rate with zero missing. Saturation onset moved from 350-400k/s to 500-600k/s and high-rate RSS roughly halved. Sub-knee and low-prefetch unchangedKeep. The next ceiling sits past 500k/s. A codec re-measure puts 1KB encode at ~0.8us and decode at ~0.11us, so the likely remaining bottleneck is the engine pipeline and scheduling, not the wire codec
2026-07-03investigationCompare tmpfs and SATA-SSD (ext4, ~92% full) data dirs on the post-audit buildCheck how much of the measured single-node profile is storage-independentconfirmed plus throughput-1k, TMPDIR routed to the SSD, 2 runsOn the SSD the sustainable 1KB rate is roughly 250k-330k/s (400k+ targets fall short) with zero missing, vs 500k+ on tmpfs, and confirmed p50 ranged 15-36ms in the better run. Run-to-run variance on the SSD was large (the worse run showed 80-120ms p50 at low rates), consistent with a nearly full drive throttling writes. The pre-batching delivery-hop knee (350-400k/s) sat above what this disk sustains, so that win shows on fast storageKeep both profiles in mind: tmpfs measures the architecture ceiling, this SSD measures this drive. Disk-backed numbers on healthier storage are worth collecting before publishing throughput claims
2026-07-03keratin dda8555Self-clock keratin group commit when the fsync worker is idleStop durability acks from waiting out the fsync interval tick when no fsync is in flightbaseline, confirmed, throughput-1k on tmpfs, plus back-to-back disk A/B on the SATA SSDConfirm p50 fell from 11ms to 4-5ms and publish-to-deliver p50 to 2-4ms sub-knee. 400k/s went from p50 16ms to 3-4ms and 500k/s from 27ms to 8-9ms at full rate, zero missing. The disk A/B was within drive variance with the new policy slightly ahead. A min_fsync_interval_ms floor (default 0) exists for storage where fsync rate is expensive, exposed through fibril configKeep. The end-to-end p50 floor is no longer interval-bound
2026-06-16investigationAdd broker replication timing metrics to the cluster tryout outputSeparate client-observed confirm latency from owner replica-gate wait, owner read cost, follower owner-read wait, follower apply cost, and whole follower tick time3-node Ganglion tryout, replica_durable:2, 1 KiB payload, 50k/s publish-only target, confirm window 1024, 4096 messages/events per read, 16 MiB/readActual rate was about 49k/s with confirm p50/p95/p99/max = 204/243/244/246ms. Owner replica_confirm_wait averaged about 0.033ms over 144k samples, while owner reads averaged about 1ms, follower owner-read await averaged about 6.3ms, follower apply averaged about 19.2ms, and whole follower ticks averaged about 552msKeep the metrics. This run says the visible latency is not mostly waiting after local append for follower progress. Next isolate local append completion latency and follower tick batching before changing durability semantics or poll cadence
  • End-to-end benchmarks are good enough to catch large regressions, but they do not isolate small serialization or allocation changes.
  • Low-rate steady runs are as important as saturation runs because batching can improve peak throughput while making normal latency worse.
  • Reducing obvious per-message metric work did not improve the measured high-rate TCP path in the first trial. Scheduler behavior and delivery-loop wake timing need more precise measurement before changing this path.
  • For 1KB publish frames, MessagePack encode/decode is measurable but not large enough by itself to explain the current backlog knee.
  • For internal replication reads, MessagePack was the wrong shape. The biggest replicated-throughput win so far came from replacing the hot ReplicationReadOk response with a raw, easy-to-parse binary frame carrying fixed metadata, raw message header/payload bytes, raw event bytes, and explicit offsets. That removed a large nested decode/re-encode cost from follower catch-up.
  • Record-count limits are not enough for replication batches. Payload size can dominate, so byte caps must be part of the protocol and must rewrite progress metadata to the actual returned frontier.
  • The publish sink benefits from a much shorter coalescing window than the original 1ms. Removing the wait entirely lost useful batching around the 350k-400k/s knee. Measuring the window from the previous flush gives sparse traffic immediate sends while preserving useful batching under load.
  • Large payload results are hardware-specific on the current SATA SSD machine. They are still useful for detecting memory and storage-regression direction.
  • For replica-durable runs, publish-to-deliver latency can be misleading by itself. The current 50k/s tuned run shows most latency before server receive, while server-receive-to-deliver remains much lower. Keep reporting both.
  • The first broker timing metrics show that replica-durable confirm latency is not automatically the same thing as post-local-append follower wait. In the 50k/s window-1024 run, owner replica_confirm_wait averaged about 0.033ms while client-observed confirmation p50 was about 204ms. The next useful split is local append completion latency versus client/window backlog and follower tick/apply work.
  • Replica-durable replication is now usable enough to benchmark seriously, but it is not a finished performance story. Current runs prove correctness and useful throughput, while the latency caveat is still real: higher throughput currently depends on enough outstanding confirms, and that backlog becomes visible client latency.
  • Compared with mature replicated queues and logs, the main open performance questions are architectural rather than micro-optimizations: committed visibility watermarking, long-poll or push-hinted follower replication, cheaper watermark-based confirmation, better separation of payload transfer from event/progress movement, whole-frame versus streaming decode, and raw Keratin range replication that avoids decode/re-encode work where safe.
  • Local multi-node benchmarks are useful stress tests, but they are not a full deployment model. They share CPU, page cache, and one drive, while also hiding real network cost.
  • The next narrow measurement should split local append completion, confirm sink backlog, replica-gate wait, and follower apply/tick time before making more protocol or scheduling changes.
  • Temporary timing probes are acceptable during an audit, but they should not remain as ad hoc logs. Promote useful signals to cheap aggregated metrics or sampled diagnostics, then remove the raw per-frame logging.
  • Operator metrics should stay cheap and useful. Percentile-heavy measurement belongs in benchmark tools unless there is a clear production need.