We Benchmarked Everything: NOX Performance Analysis
Criterion microbenchmarks, throughput tests, latency distributions, and entropy measurements — every number we quote, we measured.
We Benchmarked Everything (Because Nobody Else Does)
NOX benchmarks: 33 data files. 36 chart pairs. Six tiers of measurement from cryptographic primitives to full DeFi pipeline economics. And an uncomfortable truth about transparency in privacy infrastructure.
Part 5 of 6 by Xythum Labs
We went looking for Nym's benchmark data.
They run 550+ nodes across 64 countries. They've raised $94.5 million. They've been in production since 2021. Surely, somewhere, they've published throughput numbers, latency distributions, privacy metrics -- the basic data you'd want before trusting a network with your metadata.
We found two benchmark functions in their codebase. Two. In the nymtech/sphinx repository: bench_new_no_surb and bench_unwrap. The results? Never published. We dug through their git history and found a cpu-cycles/ directory with vendored libcpucycles (amd64-tsc backend) -- scaffolding for CPU-cycle measurement that was never connected to any Rust code. Their documentation references performance in vague, qualitative terms -- "low latency," "high throughput" -- without a single number attached.
Katzenpost is better. They publish per-operation micro-benchmarks for 18 cipher suites in their nightly CI pipeline, with a 110% regression detection threshold via benchmark-action/github-action-benchmark. That's a real CI setup. But their coverage stops at Sphinx creation and unwrapping. No end-to-end throughput. No latency distributions. No privacy analytics.
Tor publishes metrics, to their credit -- bandwidth, relay counts, circuit latencies from six global vantage points via OnionPerf, continuously since 2009. It's the gold standard for real-world anonymity network monitoring. But the anonymity properties are studied primarily by external researchers, not by the Tor Project itself. And Tor's threat model is fundamentally different from a mix network's.
We think this is a problem. Privacy infrastructure asks users to trust it with their most sensitive data. You'd think they'd at least show their work. Numbers. Reproducible data. Something.
So we benchmarked everything. Every layer of NOX from Sphinx per-hop processing to full DeFi round-trips with reproducible Criterion data and published artifacts.
Why We Benchmark: The Reproducibility Crisis in Mixnet Research
There's a dirty secret in privacy networking research: most performance claims are unfalsifiable.
The Loopix paper (USENIX Security 2017) reports ">300 messages per second" per relay. The measurement? A Python implementation on AWS EC2 instances, circa 2017. What instance type? What region? What Linux kernel version? What Python version? The paper doesn't say, and the original test environment is long gone. You cannot reproduce this number. You have to take it on faith.
This isn't unusual. It's the norm. Survey the academic mixnet literature from the last decade and you'll find a pattern: simulation results on unnamed hardware, throughput claims without methodology, latency numbers without distributions. The numbers exist to pass peer review. They were never designed to be reproduced by anyone outside the original lab.
We think this matters especially for privacy systems, where the performance characteristics directly affect the security properties. A mixnet that's too slow gets low adoption. Low adoption means small anonymity sets. Small anonymity sets mean bad privacy. The performance-privacy feedback loop is real, and you can't reason about it without actual data.
Here's what we consider the minimum bar for credible mixnet benchmarking:
- Hardware specification. CPU model, core count, RAM, OS version, compiler version. Not "a commodity server" or "a cloud instance."
- Release-mode compilation. Debug builds in Rust can be 10-50x slower due to disabled optimizations, bounds checking, and debug assertions. If your benchmarks aren't compiled with
--release, they're measuring your debug scaffolding, not your system. - Warm-up runs. The first invocation of any function is dominated by cold caches, JIT compilation (if applicable), and page faults. Criterion.rs does 100 warm-up iterations by default before measurement begins. If your benchmark framework doesn't warm up, your numbers are dominated by startup artifacts.
- Statistical significance. One measurement is an anecdote. Five thousand measurements with confidence intervals and outlier detection is data. We use Criterion.rs, which computes 95% confidence intervals, detects outliers (mild and severe), and reports if performance has regressed between runs.
- Raw data publication. Percentiles are summaries. Summaries hide information. We publish the raw measurement arrays (4,916 individual latency samples in one file alone) so anyone can compute their own statistics.
- Reproducibility script.
run_all.sh. One command. All 33 data files regenerated. All 36 chart pairs re-rendered. If you can't reproduce it, it's not science -- it's marketing.
This is not a high bar. It's the absolute minimum for any system that asks people to trust it with their metadata. And yet, of the five major mixnet implementations (Nym, Katzenpost, Tor, Loopix, NOX), only one publishes data that meets all six criteria.
It's us. That's not a brag -- it's an indictment of the field.
Warm-up matters more than people realize. Our Sphinx hop benchmark without Criterion's warm-up phase averaged 147 microseconds for the first 10 iterations -- 4.9x slower than the warmed-up mean of 30.17 microseconds. The culprit is the instruction cache: process_sphinx_packet is ~12KB of compiled x86-64 code that must be fetched from main memory on first call. A production node processing hundreds of packets per second will always be cache-hot. The warmed Criterion number is the correct number for system design.
Publication Bias and Why We Publish Everything
When a project publishes only the numbers that look good, the absence of other numbers is informative. Katzenpost publishes per-hop microbenchmarks but no throughput data. Nym publishes nothing -- $94.5M in funding, zero published benchmarks. We publish everything, including the embarrassing parts: our n-1 attack succeeds 100% of the time, our intersection attack converges in 20 epochs, our forward secrecy is nonexistent. The research community needs honest data more than we need flattering optics.
Someone will object: "Microbenchmarks don't reflect real-world performance." They're right, in part. But if you can't measure the pieces, you can't understand the whole. Our 6-tier framework bridges this gap: Tier 1 (Criterion microbenchmarks) gives the theoretical floor, Tier 2 (in-process integration) the realistic per-operation cost, Tier 3 (multi-process) the system throughput, and Tiers 4-6 privacy, security, and economics. The gaps between tiers are themselves informative -- the 2x gap between Criterion (30.17 us) and integration (61.6 us) quantifies the system infrastructure overhead.
The Benchmark Suite
NOX ships with a 6-tier benchmark framework covering the full stack from microsecond-level cryptographic operations to dollar-denominated relayer economics:
| Tier | What It Measures | Tool | Data Files |
|---|---|---|---|
| 1. Cryptographic Primitives | Per-hop Sphinx processing, ECDH, blinding, MAC, AES | Criterion.rs + custom integration bench | per_hop_breakdown.json |
| 2. Network Performance | Multi-node throughput, latency distributions, CDF curves | nox_bench (7 subcommands) | throughput_sweep.json, latency_cdf.json, latency_vs_delay.json, surb_rtt.json |
| 3. Multi-Process Scale | Cross-process E2E throughput, horizontal scaling | nox_multiprocess_bench (4 subcommands) | mp_throughput_sweep.json, scaling.json |
| 4. Anonymity Metrics | Shannon entropy, unlinkability, cover traffic analysis | nox_privacy_analytics (7+ subcommands) | entropy.json, unlinkability.json, cover_traffic.json, cover_analysis.json |
| 5. Attack Resistance | n-1 attack, intersection attack, compromised nodes, PoW DoS, replay | Custom adversary models | attack_sim.json, pow_dos.json, replay_detection.json |
| 6. DeFi Pipeline | Full deposit-to-withdraw pipeline, gas profile, relayer economics | micro_mainnet_sim + analytical model | defi_pipeline.json, gas_profile.json, economics.json, fec_recovery.json, fec_vs_arq.json |
Total: 33 JSON data files, 36 publication-quality chart pairs (PNG + SVG), all reproducible from a single run_all.sh script. The raw data is in the repository. You can regenerate every chart yourself.
Hardware and Environment
CPU: AMD Ryzen 7 9800X3D 8-Core Processor (5 physical cores, 9 logical threads under WSL2). This is a 2024 desktop CPU with AMD's 3D V-Cache technology -- a gaming chip, not a server processor. We mention this because it matters: the 96MB L3 cache on this chip means our Sphinx processing benefits from cache locality that a typical cloud VM won't have. Our numbers would be somewhat worse on an AWS c5.xlarge.
Memory: 30 GB RAM. More than enough that memory pressure never enters the picture for our test configurations (up to 50 nodes, max observed usage 295 MB).
OS: Ubuntu 24.04.2 LTS running under WSL2 on Windows 11. WSL2 uses a real Linux kernel (not translation), so syscall overhead is minimal, but networking between WSL2 processes has slightly different characteristics than bare-metal Linux. Our multi-process benchmarks use TCP sockets between OS processes, which under WSL2 go through the Hyper-V virtual switch -- adding perhaps 10-20 microseconds of latency per hop compared to bare metal. This is noise at the millisecond-scale mixing delays we test, but we mention it because honest methodology requires it.
Rust version: 1.93.0 (stable, 2026-01-19). All benchmarks compiled with --release (optimizations enabled, debug assertions disabled, LTO enabled via Cargo profile).
Git commit: db6ea3c (2026-03-02) for all benchmark data files. Every JSON file includes the exact commit hash so you can check out the same code and reproduce.
What we don't control: CPU frequency scaling. The cpu_governor field in our data files reads "unknown" because WSL2 doesn't expose the Linux CPU frequency governor. On bare metal, you'd pin the governor to performance to eliminate frequency scaling jitter. Under WSL2, Windows manages CPU frequency. This could introduce up to ~5% variance in micro-benchmark timings. Criterion's outlier detection and large sample sizes (5,000 iterations) mitigate this, but it's a known imperfection.
Benchmark Binaries
The benchmark suite is implemented as several Rust binaries, each targeting a different measurement tier:
-
cargo bench --bench crypto_bench: Criterion.rs microbenchmarks for cryptographic primitives. Measures Sphinx hop peel, Sphinx packet creation (1/2/3 hops), ECDH shared secret, key blinding, AES-CTR encryption, HMAC computation. Each benchmark: 100 warmup + 5,000 measurement iterations. -
nox_bench(7 subcommands): In-process network benchmarks. Spawns N nodes as async tasks within one process. Subcommands:throughput(injection rate sweep),latency(CDF generation),latency-vs-delay(delay parameter sweep),surb-rtt(round-trip measurement),operational(startup/memory/disk),per-hop(instrumented breakdown),concurrency(parallelism tuning). -
nox_multiprocess_bench(4 subcommands): Multi-process network benchmarks. Spawns each node as a separate OS process. Subcommands:throughput(E2E injection rate sweep),latency(CDF generation),scaling(5/10/25/50 node sweep),surb-rtt(multi-process RTT). -
nox_privacy_analytics(7+ subcommands): Privacy metric measurement. Subcommands:entropy(Shannon entropy vs delay),entropy-vs-users(scaling with anonymity set size),unlinkability(KS test),cover-traffic(bandwidth overhead),cover-analysis(active/idle distinguishability),timing-correlation(input-output timing),attack-sim(n-1, intersection, compromised nodes). -
nox_economics(3 subcommands): DeFi pipeline measurement. Requires a running Anvil instance and deployed contracts. Subcommands:gas-profile(per-circuit gas and proof gen),economics(profitability analysis across price/gas scenarios),fec(recovery rate and ARQ comparison). -
micro_mainnet_sim: Full end-to-end simulation. Deploys contracts to Anvil, spawns a 5-node mixnet, creates wallets with funded accounts, runs the complete deposit → split → join → transfer → withdraw pipeline through both direct and paid mixnet transport. This is the integration test that ties everything together.
All binaries write structured JSON to scripts/bench/data/. A Python script (scripts/bench/charts/generate_all.py) reads these JSON files and produces paired PNG + SVG charts using matplotlib.
Tier 1: Cryptographic Primitives
The fundamental question of any Sphinx mixnet is: how fast can you peel an onion layer? Everything else -- throughput, latency, scalability -- is downstream of this number.
How Criterion.rs Works (and Why It Matters)
Before we show numbers, a brief detour on measurement methodology, because it's the difference between "we timed it once" and "we know the distribution."
Criterion.rs is a statistical benchmarking framework for Rust. Here's what happens when you run cargo bench --bench crypto_bench -- sphinx_hop_peel:
-
Warm-up phase: 100 iterations of the benchmark function, results discarded. This heats up instruction caches, data caches, branch predictors, and page tables. Without warm-up, the first few iterations are dominated by cache misses and page faults, giving you numbers that reflect OS behavior more than algorithmic performance.
-
Measurement phase: 5,000 iterations, each timed with
std::time::Instant(which maps toclock_gettime(CLOCK_MONOTONIC)on Linux -- nanosecond resolution, no wall-clock drift). The framework automatically groups iterations into batches to amortize timing overhead. -
Statistical analysis: Criterion computes the mean, standard deviation, median, and median absolute deviation. It estimates the 95% confidence interval for the mean using bootstrapping (100,000 bootstrap resamples by default). It classifies outliers as "mild" (1.5-3x the IQR from the quartiles) or "severe" (>3x).
-
Regression detection: If you've run the benchmark before, Criterion compares the current run to the cached baseline and reports whether performance has changed with statistical significance. A >5% regression triggers a warning. This is how you catch accidental performance regressions before they land in production.
The key insight: a single timing is meaningless. What matters is the distribution of timings across thousands of runs, because that distribution tells you about cache behavior, OS scheduling, and thermal throttling -- all of which affect real-world performance.
Sphinx Per-Hop Processing: 61.6 Microseconds End-to-End
Every packet in a Sphinx mix network is peeled one layer at each hop. That peel involves public-key cryptography (ECDH key agreement), key blinding (so the next hop can't recompute previous shared secrets), MAC verification (integrity), routing header decryption, and payload body decryption. The per-hop cost is the fundamental throughput bottleneck of any Sphinx-based mixnet.
We measured this two ways:
Criterion micro-benchmark (isolated): 30.17 microseconds per hop. This is the raw Sphinx processing time -- no event bus, no logging, no backpressure. Measured with Criterion.rs: 100 warmup iterations, 5,000 measurement iterations, outlier detection enabled. The function under test receives a pre-constructed Sphinx packet and peels one layer. Nothing else.
Integration benchmark (realistic): 61.6 microseconds per hop. This is the end-to-end per-hop cost measured inside a running 5-node mixnet processing 500 packets. The 2x gap reflects real-world overhead: bus event dispatch, context switching, tracing instrumentation, and backpressure management.
The integration number is the one that matters for system design. Here's where that time goes:
| Phase | Mean (us) | Min (us) | Max (us) | p50 (us) | p99 (us) | Share | Notes |
|---|---|---|---|---|---|---|---|
| ECDH (X25519) | 35.8 | 28.0 | 211.9 | 28.8 | 72.7 | 58.2% | Shared secret computation |
| Key Blinding | 22.9 | 0.0 | 97.8 | 28.1 | 64.8 | 37.2% | Group exponentiation for next hop |
| MAC Verify | 0.82 | 0.33 | 10.6 | 0.77 | 1.63 | 1.3% | HMAC-SHA256 integrity check |
| Routing Decrypt | 0.59 | 0.39 | 11.4 | 0.53 | 1.12 | 1.0% | AES-CTR on routing header |
| Body Decrypt | 0.53 | 0.27 | 106.8 | 0.38 | 1.09 | 0.9% | AES-CTR on 32KB payload |
| Key Derivation | 0.49 | 0.19 | 2.2 | 0.48 | 1.04 | 0.8% | HKDF for symmetric keys |
| Total | 61.6 | 29.4 | 223.9 | 59.3 | 146.8 | 100% | 1,449 hop samples |
Public-key operations (ECDH + blinding) account for 95.4% of per-hop cost. Symmetric operations -- including AES-CTR over the entire 32KB payload -- are negligible at 3.0% combined. This means that optimizing symmetric ciphers is a waste of time. The path to faster Sphinx is batched ECDH, hardware-accelerated X25519, or KEM-based designs that eliminate blinding factors entirely.
Why X25519 and Not Something Else
A reasonable question: why X25519 for ECDH? There are alternatives.
P-256 (NIST curve): Widely deployed, hardware-accelerated on some CPUs (Intel's SHA-NI extensions can accelerate the underlying field arithmetic). But P-256 implementations have a nasty history of timing side-channels. The curve was designed by NIST with coefficients that appear random but were never explained, leading to persistent (if unproven) concerns about potential backdoors. For a privacy system, the optics alone are disqualifying.
Curve448: Stronger security margin than X25519 (224 bits vs 128 bits). About 2-3x slower. Given that our ECDH is already 35.8 microseconds and we need sub-second latency for DeFi operations, the 2-3x penalty would push us toward 100+ microsecond per-hop times. The security margin of 128 bits is more than adequate for our threat model (the weakest link is the mixing delay, not the key exchange).
X25519: Daniel Bernstein's Curve25519 in Montgomery form. Clean design, no unexplained constants, extensive analysis, constant-time implementations available in every language. The x25519-dalek crate we use is one of the most audited Rust cryptography libraries. The 35.8 microsecond cost is dominated by the finite field arithmetic -- about 1,000 field multiplications for a scalar multiplication.
Post-quantum (ML-KEM-768): The Katzenpost project has shown this is viable at ~243 microseconds per hop for their Xwing hybrid (X25519 + ML-KEM-768). That's about 4x our current cost. We don't do post-quantum yet -- it's on the roadmap but not a launch priority. The honest reason: post-quantum Sphinx is an active research area, and we'd rather wait for the designs to stabilize than ship something we'll need to replace. Katzenpost is ahead of us here, and we give them credit for it.
The Body Decryption Surprise
The body decryption number deserves special attention: 0.53 microseconds for 32KB of AES-CTR. Let's think about what that means.
32,768 bytes in 530 nanoseconds is roughly 60 GB/s of AES-CTR throughput. That's not algorithmic cleverness -- it's hardware. Modern x86 CPUs with AES-NI (Intel's AES New Instructions, also supported by AMD since Bulldozer) can encrypt/decrypt AES in hardware at ~1 cycle per byte. At 4.7 GHz boost clock on the 9800X3D, that's about 4.7 GB/s per core per round -- and AES-128 has 10 rounds, but AES-NI pipelines them, giving effective throughput far above that naive calculation.
The practical consequence: NOX processes 16x larger payloads than Katzenpost (32KB vs 2KB) and Nym (32KB vs 2KB) while spending less than 1% of per-hop time on payload decryption. Packet size is essentially free in the symmetric domain. If you're designing a mixnet for DeFi payloads (which include ZK proofs at 2-4KB, encrypted UTXO notes at 208 bytes, and Merkle proofs at variable size), there's no performance reason to artificially constrain packet size.
This is why we chose 32KB packets. The DeFi payload easily fits. The per-hop cost barely changes. And we avoid the fragmentation and reassembly complexity that smaller packets would require.
The Broader Cryptographic Primitive Landscape
Beyond Sphinx processing, we use several other cryptographic primitives in the system. While we haven't run Criterion on all of them (the Sphinx per-hop is the critical hot-path operation), here's the broader picture:
Poseidon2 Hash (BN254 field): Used for Merkle tree commitments, nullifier derivation, and intent hashing. Poseidon is an arithmetic-friendly hash designed for ZK circuits -- it has minimal multiplicative complexity, which means fewer constraints in the Noir circuit. In native Rust, Poseidon2 over two BN254 field elements completes in low microseconds. The cost is negligible on the node side; what matters is the constraint count inside the ZK circuit (which affects proof generation time, not network throughput).
AES-128-CBC with PKCS#7 Padding: Used for encrypting UTXO notes. A 192-byte note plaintext produces a 208-byte ciphertext (192 bytes + 16 bytes PKCS#7 padding block). The encryption must produce byte-identical results across TypeScript, Noir, and Rust -- any mismatch makes notes unspendable because the Merkle leaf commitment is Poseidon2(packed_ciphertext). We enforce cross-language parity using vector generation scripts (packages/wallets/scripts/gen_*.ts) that produce test vectors consumed by all three implementations.
BabyJubJub Scalar Multiplication: Used for ECDH on the BabyJubJub curve (not to be confused with the X25519 ECDH used in Sphinx). BabyJubJub is a twisted Edwards curve embedded in the BN254 scalar field, which means BabyJubJub operations can be efficiently verified inside BN254-based ZK circuits. The scalar multiplication is used for deriving shared secrets between sender and recipient for note encryption. In native Rust, this is fast (microseconds). Inside the ZK circuit, it's one of the most expensive operations (dominating the gate count).
X25519 Key Exchange (~30 microseconds): Used in Sphinx packet processing (both ECDH shared secret and key blinding). This is the hot-path operation that determines per-hop throughput, as we've discussed extensively.
The pattern is clear: the on-chain / ZK circuit costs dominate over network-level cryptographic costs by 4-6 orders of magnitude. A proof generation takes 9,000,000 microseconds. A Sphinx hop takes 62 microseconds. Optimizing the Sphinx hop by 2x (saving 31 microseconds) improves end-to-end latency by 0.0003%. Optimizing the proof generation by 2x (saving 4.5 seconds) improves end-to-end latency by 45%. The leverage is entirely on the circuit/prover side.
Why ChaCha20 for Payload, Not AES-CTR
Wait -- we just praised AES-CTR, so why does our Sphinx use AES-CTR for routing headers but ChaCha20 for some symmetric operations in the broader system?
The answer is context-dependent: AES-CTR is faster on hardware with AES-NI. ChaCha20 is faster on hardware without it (ARM processors, older CPUs). For the Sphinx packet format specifically, we use AES-CTR because mix nodes are expected to run on modern x86 servers where AES-NI is universal. For client-side operations (running on phones, browsers, embedded devices), ChaCha20 would be the better choice. We haven't made the client-side decision yet because we don't have a mobile client.
Competitive Comparison: Per-Hop Processing
| Implementation | Per-Hop (us) | Relative | Language | Crypto Suite | Payload |
|---|---|---|---|---|---|
| NOX (Criterion) | 30.17 | 1.0x | Rust | X25519 ECDH + AES-CTR + HMAC-SHA256 | 32 KB |
| NOX (integration) | 61.6 | 2.0x | Rust | Same (with system overhead) | 32 KB |
| Katzenpost KEM | 55.7 | 1.8x | Go | X25519 KEM + stream + MAC + SPRP | 2 KB |
| Katzenpost NIKE | 144.1 | 4.8x | Go | X25519 NIKE + blinding + stream + MAC + SPRP | 2 KB |
| Katzenpost Xwing | 172.6 | 5.7x | Go | ML-KEM-768 + X25519 hybrid | 2 KB |
| Loopix (reference) | ~1,500 | ~50x | Python | Sphinx (sphinxmix library) | N/A |
| Nym | N/P | -- | Rust | X25519 ECDH + AES-CTR + HMAC | 2 KB |
Sources: NOX numbers from Criterion + integration benchmarks (commit db6ea3c, 2026-03-02). Katzenpost numbers from their nightly CI (sphinx_benchmark_test.go, BenchmarkSphinxUnwrap). Loopix from the USENIX Security 2017 paper, Section 7.
Important caveats before anyone screams:
These are cross-hardware, cross-language comparisons. Katzenpost numbers come from their CI benchmarks running on different hardware (likely cloud VMs). Loopix numbers are from the 2017 paper running on AWS EC2. We're comparing Criterion measurements on a 2024 desktop CPU against 2017-era cloud instances and 2024 Go benchmarks. Don't tattoo the ratios on your arm.
That said, the ratios are large enough that hardware differences don't fully explain them. A 50x gap against the Python implementation is largely language overhead. The 4.8x gap against Katzenpost NIKE -- both doing the same cryptographic operations -- reflects three factors:
- Language runtime: Rust (LLVM, zero-cost abstractions, no GC) vs Go (garbage collector pauses, less aggressive optimization). Typical Rust/Go ratios for crypto workloads are 1.5-3x.
- NIKE vs KEM path: Katzenpost's NIKE variant requires blinding factor computation (expensive group exponentiation) that their KEM variant avoids. NOX uses ECDH with blinding, which is similar in cost to the NIKE path, but our Rust implementation is faster at the underlying field arithmetic.
- SPRP overhead: Katzenpost applies a wide-block cipher (SPRP/LIONESS) per hop for decryption oracle resistance. NOX uses standard AES-CTR, which is faster but doesn't provide this property. This is a known security tradeoff documented in our comparison report. The SPRP is a legitimate security feature we lack, not unnecessary overhead.
The most interesting comparison is Katzenpost KEM at 55.7 microseconds. Their KEM path eliminates blinding factors entirely -- instead of computing g^(ab) and then blinding the public key for the next hop, KEM encapsulates a fresh shared secret per hop. This is why their KEM is 2.6x faster than their NIKE (55.7 vs 144.1). It's also close to our Criterion number (30.17 vs 55.7), and the remaining gap is mostly Rust-vs-Go.
Nym? No published numbers. They have the benchmark code -- bench_unwrap in nymtech/sphinx -- but the results have never been made public. We literally cannot compare. The benchmark function exists in their repository. It uses Criterion. They run CI. But no results appear anywhere -- not in docs, not in blog posts, not in issues.
Sphinx Packet Creation: The Client-Side Cost
Per-hop processing is the server-side bottleneck. But there's also a client-side cost: constructing the Sphinx packet in the first place. The sender must compute ECDH with each hop's public key, derive all symmetric keys, encrypt the routing header and payload layer by layer, and construct the MACs.
| Hops | Creation Time (us) | Per-Hop Marginal Cost |
|---|---|---|
| 1 | 76 | 76 |
| 2 | 148 | 72 |
| 3 | 222 | 74 |
Linear scaling confirmed: each additional hop costs about 74 microseconds. For a 3-hop path, the client spends 222 microseconds building the packet. This is negligible compared to proof generation (9-11 seconds) and mixing latency (50-300 milliseconds). Packet construction will never be the bottleneck for DeFi operations.
Tier 2: Network Performance
Individual operation speed matters, but what happens when you assemble a full network and push packets through it?
Throughput: What It Means and How We Measured It
Throughput in a mixnet is not the same as throughput in a web server. In a web server, you measure requests per second -- how many independent operations the system can handle. In a mixnet, every packet traverses multiple hops, consuming resources at each node. A packet that enters the network touches 3 nodes (for 3-hop paths). So "100 PPS system throughput" means each of the 3 nodes on the packet's path processes that packet -- roughly 300 node-level operations per second for the nodes in the path.
We measured throughput using two architectures:
In-process mode: All nodes run as async tasks within a single OS process. They share CPU cores, memory, and the Tokio runtime. This mode shows the contention-limited ceiling -- what happens when all nodes compete for the same resources.
Multi-process mode: Each node runs as a separate OS process with its own memory space and Tokio runtime. Packets are serialized across process boundaries via TCP. This mode reflects a more realistic deployment where nodes have independent resources, but adds IPC overhead.
In-Process Throughput: The Contention Ceiling
| Target PPS | Achieved PPS | Loss Rate | Mean Latency |
|---|---|---|---|
| 100 | 86 | 0.0% | 113 ms |
| 500 | 142 | 5.5% | 80 ms |
| 1,000 | 178 | 14.4% | 101 ms |
| 2,000 | 211 | 24.5% | 108 ms |
| 5,000 | 234 | 30.1% | 121 ms |
The system saturates at 234 PPS with 5 nodes. Notice the degradation pattern: from 100 to 234 PPS, loss rate climbs from 0% to 30%. This isn't packet corruption -- it's queue overflow. When injection rate exceeds processing capacity, the Poisson mixing queues fill up, and packets arriving at full queues are dropped.
The mean latency actually decreases from 113ms at 100 PPS to 80ms at 500 PPS before climbing again. This is because at low injection rates, most of the latency is mixing delay (waiting in the Poisson queue). At higher rates, packets spend less time in queues because queues drain faster -- but at even higher rates, contention overhead dominates and latency climbs.
The 234 PPS ceiling is not a hard architectural limit. It's a contention artifact of running 5 nodes on 5 physical CPU cores within one process. Each node's Sphinx processing is CPU-bound (61.6 microseconds per hop), and when all nodes share cores, the OS scheduler introduces context-switch overhead that doesn't exist in a distributed deployment.
Multi-Process Throughput: The Real Number
| Target PPS | Achieved PPS | Loss Rate | Mean Latency | p99 Latency |
|---|---|---|---|---|
| 50 | 38 | 0.0% | 104 ms | 203 ms |
| 100 | 71 | 0.0% | 190 ms | 285 ms |
| 200 | 127 | 0.0% | 174 ms | 305 ms |
| 500 | 251 | 0.0% | 119 ms | 204 ms |
| 1,000 | 369 | 0.0% | 103 ms | 202 ms |
This is the headline result. 369 PPS at zero loss in multi-process mode with 10 nodes.
The zero-loss result is significant. Multi-process mode has real IPC overhead (packets serialized across process boundaries via TCP), real OS scheduling, and real resource contention between processes. Despite this, every packet gets through. The multi-process architecture is actually faster than in-process (369 vs 234 PPS) because each node gets its own OS process with dedicated CPU scheduling, eliminating the intra-process contention that limits the in-process mode.
Let's put this in perspective. At 369 PPS with 3-hop paths, the network is processing 1,107 hop operations per second across 10 nodes. Each hop operation costs 61.6 microseconds of CPU time. That's 68 milliseconds of CPU time per second -- about 6.8% of one CPU core. We're nowhere near the theoretical per-hop throughput limit (~16,000 hops/sec at 61.6us/hop). The current bottleneck is the Poisson mixing delay (packets wait in queues for their scheduled departure time) and the inter-process TCP transport, not the cryptographic processing.
Does throughput scale linearly with nodes? We tested this:
| Nodes | Achieved PPS | Per-Node PPS | Mean Latency | Loss Rate |
|---|---|---|---|---|
| 5 | 209 | 41.8 | 170 ms | 9.1% |
| 10 | 217 | 21.7 | 173 ms | 9.1% |
| 25 | 200 | 8.0 | 190 ms | 9.1% |
| 50 | 279 | 5.6 | 127 ms | 9.1% |
Throughput stays stable from 5 to 25 nodes (200-217 PPS), then increases at 50 nodes (279 PPS). The per-node throughput decreases with more nodes because the same total traffic is distributed across more paths -- each individual node handles fewer packets. The 50-node spike (279 PPS) suggests that the wider topology provides enough parallel paths to exceed the contention ceiling seen at smaller scales.
The constant 9.1% loss rate across all configurations is a test harness artifact, not a scaling limitation. It comes from the warmup/cooldown window: packets injected during the first and last seconds of the test may not complete their full path before the test ends. The fact that this rate is identical across all configurations confirms it's a measurement artifact, not congestion.
This is encouraging for production deployment: adding nodes increases route diversity and throughput without degrading latency. The consistent loss rate across configurations confirms that the loss mechanism is test infrastructure, not network congestion. A production deployment with 100+ nodes would likely see similar or better throughput characteristics, with route diversity providing better anonymity as a bonus.
Multi-Process End-to-End: True Delivery Verification
The numbers above use in-process throughput as the saturation metric. But the multi-process results tell an even more compelling story when you look at true end-to-end delivery -- from packet injection through full mixnet traversal to exit-node payload decryption and response delivery back to the sender.
We ran multi-process E2E benchmarks with 10 nodes, 1ms mixing delay, 3 hops, and measured real delivery (not just "packet was forwarded," but "the response came back via the SURB path"):
| Target PPS | Achieved PPS | Success Count | Loss Rate | p50 Latency (ms) | p99 Latency (ms) | Std Dev (ms) |
|---|---|---|---|---|---|---|
| 50 | 37.7 | 380 | 0.0% | 101.6 | 202.8 | 16.3 |
| 100 | 70.6 | 719 | 0.0% | 202.6 | 285.3 | 37.2 |
| 200 | 127.1 | 1,308 | 0.0% | 202.3 | 304.5 | 60.5 |
| 500 | 250.6 | 2,575 | 0.0% | 101.5 | 203.9 | 39.6 |
| 1,000 | 368.8 | 3,798 | 0.0% | 101.5 | 201.7 | 10.8 |
Zero loss at all tested rates. This is the multi-process headline: not a single packet was lost across 8,780 total deliveries. Every Sphinx packet completed its 3-hop forward path, the exit node decrypted the payload, and the response traversed the 3-hop SURB return path successfully.
The latency pattern is interesting. At low injection rates (50 PPS), p50 is ~101ms -- essentially one 100ms polling interval. At medium rates (100-200 PPS), p50 rises to ~202ms (two polling intervals), because queue pressure means packets spend more time waiting at intermediate nodes. But at the highest rates (500-1000 PPS), p50 drops back down to ~101ms. Why? Because at high rates, there are always packets ready to depart at every mixing interval, so the Poisson scheduling waste (time spent waiting for the next packet to process) disappears. The node is never idle, and packets flow through at the maximum rate the Poisson scheduler allows.
The standard deviation tells the same story. At 1,000 PPS target, the std dev is only 10.8ms -- remarkably tight for a system with random mixing delays. At 200 PPS, it's 60.5ms -- much more spread because packets are experiencing variable wait times in partially-filled queues.
Separate latency-focused measurement (500 packets, 50 concurrent, 10 nodes): mean latency of 176ms, p50 of 202ms, p99 of 305ms, with a maximum of 405ms. The p99/p50 ratio of 1.5x indicates predictable tail behavior -- no wild outliers from queue buildup or routing pathologies.
The 50-Node Throughput Jump: What's Going On?
The jump from 200 PPS at 25 nodes to 279 PPS at 50 nodes deserves closer examination, because it tells you something about the relationship between topology width and throughput.
With 5 nodes and 3 hops, there are 5 x 5 x 5 = 125 possible paths through the network. Each node participates in roughly 25 of these paths (each node is a possible choice at each layer). So the traffic is distributed across 125 paths, but each node is processing traffic for many paths simultaneously.
With 50 nodes and 3 hops, there are 50 x 50 x 50 = 125,000 possible paths. Each node participates in a smaller fraction of total paths. More importantly: with 50 nodes competing for CPU on a 5-core machine, you'd expect worse performance from contention, not better.
So why is 50 nodes faster? The answer is in the latency distribution. At 50 nodes, the p50 latency drops to 103ms (from 190ms at 25 nodes) and the max drops to 207ms (from 305ms at 25 nodes). The p90 is 204ms versus 304ms. The entire distribution compresses downward.
This happens because with 50 nodes, each individual node processes fewer packets. The per-node queue depth is lower. Lower queue depth means packets spend less time waiting in the Poisson mixing queue. Less waiting means lower latency. Lower latency means the sender can inject the next packet sooner. Higher injection rate means higher measured PPS.
It's a virtuous cycle: more nodes → lower per-node load → shorter queues → faster drain → higher system throughput. This is exactly the scaling behavior you'd want in a production network. The concerning scenario -- more nodes means more contention means worse performance -- doesn't materialize.
The caveat: our 50-node test uses only 200 packets (fewer than the 1,000+ packets in other tests). With more traffic, the per-node load advantage of more nodes might be partially offset by increased inter-process communication overhead. A proper production scaling test would run 50 nodes under sustained load for minutes, not our brief 200-packet burst. That test is on our roadmap.
Concurrency Tuning: Finding the Right Parallelism
How many packets should you process concurrently? Too few, and you underutilize the hardware. Too many, and you create contention that degrades throughput. We swept across six concurrency levels at a fixed 200 PPS target rate with 10 nodes:
| Concurrency | Achieved PPS | Mean Latency (ms) | p50 Latency (ms) | p99 Latency (ms) | Packet Loss |
|---|---|---|---|---|---|
| 10 | 61.6 | 128 | 102 | 203 | 0% |
| 25 | 67.5 | 290 | 304 | 305 | 0% |
| 50 | 127.5 | 176 | 202 | 305 | 0% |
| 100 | 128.1 | 170 | 202 | 304 | 0% |
| 200 | 128.0 | 173 | 202 | 304 | 0% |
| 500 | 128.7 | 175 | 202 | 304 | 0% |
The data tells a clear story. At concurrency 10, we're bottlenecked by parallelism -- only 10 packets can be in-flight simultaneously, limiting throughput to 61.6 PPS despite targeting 200. The jump from 10 to 50 doubles throughput (61.6 → 127.5 PPS). But above 50, additional concurrency provides zero benefit: 100, 200, and 500 all achieve the same ~128 PPS.
The zero packet loss across all configurations is noteworthy. This is the multi-process architecture proving its worth -- each node in its own OS process with dedicated resources. The in-process benchmark loses 30% of packets at high throughput. The multi-process benchmark loses nothing.
The mean latency shows an interesting anomaly at concurrency 25: 290ms, much higher than at 50 (176ms) or even 10 (128ms). This is because at concurrency 25, enough packets are in-flight to create queue buildup, but not enough parallelism to drain those queues quickly. The queues fill to the point where packets sit for a full Poisson mixing delay cycle before being processed. At concurrency 50+, the higher parallelism drains queues faster, keeping latency down.
Practical recommendation: Concurrency of 50 is the sweet spot for a 10-node deployment. It achieves maximum throughput (128 PPS) without the diminishing returns of higher concurrency settings. The Tokio runtime's task scheduler handles 50 concurrent async tasks efficiently; going to 500 just creates more scheduler overhead without processing more packets.
The Multi-Process Architecture
A brief word on why multi-process mode exists and how it works, because it has implications for deployment architecture.
In production, each mix node runs as a separate OS process (potentially on a separate machine). The multi-process benchmark simulates this by spawning each node as a child process, communicating via TCP sockets that simulate real network links.
The architecture of each node process:
-
Ingress server (HTTP): Accepts Sphinx packets from clients via HTTP POST. Validates PoW, checks replay detection, and publishes to the internal event bus.
-
Mix engine (Tokio task): Subscribes to incoming packet events. Peels one Sphinx layer, applies Poisson mixing delay, and forwards to the next hop via libp2p (or in the benchmark, via direct TCP).
-
Exit handler (Tokio task): For packets where the current node is the final hop, decrypts the payload and processes the request (in DeFi mode: runs contract simulation, generates proof, submits transaction).
-
SURB response handler (Tokio task): For SURB responses, packs the response into SURB return packets and publishes them as outgoing packets.
-
P2P overlay (libp2p/gossipsub): Maintains mesh connections to other nodes. Handles topology discovery, epoch transitions, and gossip-based key distribution.
Each of these components runs as an async task within a Tokio runtime. The event bus (tokio::sync::broadcast) connects them. This architecture means each node is a single-process, multi-task system with lock-free communication between components.
The multi-process benchmark validates that this architecture performs well when nodes are resource-isolated. The 369 PPS result with zero loss demonstrates that the inter-process communication overhead (TCP serialization, OS scheduling) doesn't create a bottleneck.
For context: the original Loopix paper reports ">300 messages per second" per relay. Our 369 PPS is total system throughput for a 10-node network, so the comparison isn't direct. But it puts us in the same performance class as the academic reference while running on a single machine with all nodes contending for the same resources. Distribute these nodes across actual data centers and the bottleneck shifts from CPU contention to network latency -- the numbers would look very different.
Neither Nym nor Katzenpost publish any throughput numbers. Zero. Let that sink in. The two most prominent Sphinx-based mix networks in 2026 have never publicly stated how many packets per second they can process. For Nym -- a production network with 550+ nodes handling real user traffic -- this is extraordinary. They must have internal throughput data. Every production system monitors its throughput. They just haven't shared it.
We can estimate Nym's theoretical throughput from first principles: if their per-hop processing is similar to ours (they use the same crypto suite: X25519 + AES-CTR + HMAC, implemented in Rust), it's probably in the 30-100 microsecond range per hop. With 550 nodes and 3-layer stratification, that gives a theoretical system throughput ceiling in the tens of thousands of PPS -- far above any current demand. But theoretical throughput and measured throughput under production conditions with real traffic patterns, real network latency, and real node heterogeneity can be very different things. Our in-process ceiling (234 PPS) vs multi-process ceiling (369 PPS) illustrates how architectural choices affect throughput independent of raw crypto speed.
Latency: Sub-100ms at the 50th Percentile
We measured end-to-end latency across six different mixing delay configurations (5 nodes, 3 hops, 2,000 packets each):
| Mix Delay | p50 (ms) | p90 (ms) | p95 (ms) | p99 (ms) | Std Dev |
|---|---|---|---|---|---|
| 0 ms | 89 | 171 | 208 | 261 | 68 ms |
| 1 ms | 84 | 180 | 211 | 269 | 67 ms |
| 5 ms | 50 | 167 | 208 | 271 | 63 ms |
| 10 ms | 124 | 240 | 260 | 299 | 76 ms |
| 50 ms | 181 | 332 | 388 | 505 | 101 ms |
| 100 ms | 364 | 637 | 745 | 990 | 188 ms |
The 0ms and 1ms configurations behave nearly identically -- at these delays, system overhead dominates. At 50ms per-hop mean delay (3 hops = 150ms expected mixing contribution), the p50 reaches 181ms. At 100ms (300ms expected mixing), the p50 is 364ms.
The Poisson distribution creates a characteristic long right tail. This is by design: the exponential distribution means most packets are released quickly, but some are held much longer. The p99 at 100ms delay is nearly 1 second. An adversary trying to correlate input and output timing has to contend with this unpredictable delay distribution -- any given packet might exit in 100ms or in 900ms, and the adversary can't tell which.
Latency CDF: 5,000-Packet Distribution
Our most detailed latency measurement: 5,000 packets through a 5-node mixnet at 1ms Poisson delay. The CDF is included as raw data (4,916 individual measurements) in latency_cdf.json for anyone who wants to plot it themselves.
Key percentiles from the distribution:
| Metric | Value |
|---|---|
| Minimum | 3.3 ms |
| p10 | ~7 ms |
| p25 | ~12 ms |
| p50 (median) | 69.7 ms |
| p75 | ~110 ms |
| p90 | 197.8 ms |
| p95 | 215.8 ms |
| p99 | 267.1 ms |
| Maximum | 327.1 ms |
| Mean | 87.2 ms |
| Std Dev | 71.2 ms |
The distribution is not Gaussian -- it's right-skewed with a long tail, characteristic of Poisson mixing. The 3.3ms minimum reflects packets that drew near-zero mixing delays at all three hops. The 327ms maximum reflects packets that drew high delays at multiple hops. This tail behavior is a feature, not a bug: it makes timing analysis harder for adversaries.
The mean (87.2ms) is higher than the median (69.7ms) -- a hallmark of right-skewed distributions. If someone reports only the mean for a Poisson-mixed system, they're overstating the typical user experience. The median is the better summary statistic for interactive latency.
The multimodal structure. Looking at the raw distribution more carefully, there's a visible clustering pattern. About 40% of packets complete in under 50ms (fast mode), 35% between 50-200ms (medium mode), and 25% above 200ms (slow mode). These modes correspond to how many hops drew "short" vs "long" mixing delays from the exponential distribution. A packet that drew short delays at all 3 hops (probability = e^{-3λ·threshold}, roughly) ends up in the fast mode. A packet with one long delay lands in the medium mode. A packet with two or three long delays lands in the slow mode.
This multimodal structure has privacy implications that are underappreciated. An adversary who observes exit timing doesn't see a smooth exponential distribution -- they see clusters. If they can classify which "mode" a packet belongs to (fast, medium, slow), they reduce the effective anonymity set by a factor of 2-3. This is why publishing the raw CDF matters: percentile summaries hide the modal structure.
For DeFi applications, the fast mode is good news: 40% of packets arrive in under 50ms, which is fast enough for most interactive operations. The slow mode (25% above 200ms) is the cost of Poisson mixing. You can guarantee sub-200ms delivery for 75% of packets, but the remaining 25% will be slower. If your application requires a hard latency bound (e.g., an auction closing in 500ms), you need to account for the p99 (267ms) rather than the p50 (70ms).
0ms Delay vs 1ms Delay: The CDF Comparison
We ran identical 5,000-packet benchmarks at 0ms and 1ms mixing delay to isolate the delay's effect on the latency distribution:
| 0ms Delay | 1ms Delay | Difference | |
|---|---|---|---|
| Packets | 4,921 | 4,916 | - |
| Min | 3.7 ms | 3.3 ms | -0.4 ms |
| p50 | 98.1 ms | 69.7 ms | -28.4 ms (-29%) |
| p90 | 170.1 ms | 197.8 ms | +27.7 ms (+16%) |
| p95 | 209.3 ms | 215.8 ms | +6.5 ms (+3%) |
| p99 | 259.7 ms | 267.1 ms | +7.4 ms (+3%) |
| Max | 310.3 ms | 327.1 ms | +16.8 ms (+5%) |
| Mean | 91.5 ms | 87.2 ms | -4.3 ms (-5%) |
| Std Dev | 65.7 ms | 71.2 ms | +5.5 ms (+8%) |
| Loss Rate | 2.55% | 2.65% | ~same |
| Throughput | 80.2 PPS | 79.6 PPS | ~same |
Something unexpected: 0ms delay has a higher p50 than 1ms delay (98.1ms vs 69.7ms). How can adding mixing delay decrease median latency?
The answer lies in the queue dynamics. At 0ms delay, packets are forwarded immediately upon arrival. With 50 concurrent senders injecting packets, downstream nodes experience bursty arrivals -- packets arrive in clusters whenever multiple senders happen to inject simultaneously. These bursts cause queuing at intermediate nodes: packets arrive faster than they can be processed, creating a convoy effect where later packets in the burst wait behind earlier ones.
At 1ms delay, the Poisson mixing provides natural jitter. Each hop holds the packet for a random delay (mean 1ms), which spreads the bursts across time. The downstream node sees a smoother arrival pattern, which means less queuing, which means lower median latency. The 1ms delay trades a tiny per-hop cost (1ms expected) for a much larger reduction in queuing delay (28ms at the median).
The p90 and tail tell the opposite story. At 1ms delay, some packets draw unfavorable Poisson delays (3+ ms at each of 3 hops) and end up in the slow tail, pushing p90 from 170ms to 198ms. The total distribution is more spread (std dev 71.2ms vs 65.7ms) because the random mixing delays add variance.
The practical implication: 1ms mixing delay is strictly better than 0ms for DeFi applications. It reduces median latency (better typical UX), has the same throughput and loss rate, and costs only a small increase in tail latency. The "free" timing jitter is a bonus for privacy, even if 1ms alone isn't enough for strong timing unlinkability. There is no reason to run a production mixnet at 0ms delay.
Backpressure and Queue Depth
One thing our top-level throughput and latency numbers don't show is what happens under the hood when the system approaches saturation. The Poisson mixing model requires each node to maintain a queue of pending packets, each with a scheduled departure time drawn from an exponential distribution. When injection rate exceeds departure rate, these queues grow.
We instrumented queue depth during our throughput sweep (in-process mode, 5 nodes, 10 seconds per injection rate):
At 100 PPS (well below saturation): queue depth stays near zero. Packets arrive, get a short Poisson delay, and depart before the next packet arrives. The system is bored.
At 500 PPS (above 234 PPS saturation ceiling): queue depth grows linearly. Each node's queue accumulates ~30 packets per second. The 5.5% loss rate is caused by queue overflow when depth exceeds the configured maximum (1,024 packets). The backpressure mechanism kicks in: when a node's queue is full, arriving packets are silently dropped. This is the correct behavior for a mixnet -- you don't want unbounded queue growth consuming memory, and you can't apply TCP-style backpressure to Sphinx packets (there's no sender to notify, by design).
At 5,000 PPS (extreme overload): queue depth hits the maximum almost immediately and stays there. The 30.1% loss rate reflects the fraction of packets that arrive at full queues. Interestingly, the system doesn't crash or deadlock -- it gracefully degrades by dropping excess packets while continuing to process and deliver the packets it accepts.
The queue depth dynamics explain why our multi-process throughput (369 PPS) exceeds in-process (234 PPS). In multi-process mode, each node has its own CPU allocation, so queues drain faster. The departure rate per node is higher, which means queues stay shorter, which means fewer drops, which means higher effective throughput. It's not that multi-process is architecturally superior for cryptographic operations -- it's that resource isolation prevents the contention-induced queue buildup that limits in-process mode.
Latency vs Mixing Delay: The Privacy-Performance Tradeoff
The mixing delay is the single most important tuning parameter in a mixnet. Higher delays mean packets are held longer at each node, making timing correlation harder -- but they directly increase end-to-end latency. We swept six delay configurations to map this tradeoff precisely (5 nodes, 2,000 packets, 3 hops, concurrency 50):
| Mix Delay (ms) | p50 Latency (ms) | p90 Latency (ms) | p99 Latency (ms) | Mean (ms) | Throughput (PPS) | Loss Rate |
|---|---|---|---|---|---|---|
| 0 | 89 | 171 | 261 | 88 | 31.8 | 6.6% |
| 1 | 84 | 180 | 269 | 88 | 65.7 | 4.4% |
| 5 | 50 | 167 | 271 | 74 | 66.1 | 4.4% |
| 10 | 124 | 240 | 299 | 132 | 32.2 | 5.5% |
| 50 | 181 | 332 | 505 | 199 | 237.7 | 2.4% |
| 100 | 364 | 637 | 990 | 390 | 123.3 | 2.4% |
Several patterns emerge:
Latency scales sub-linearly with delay at low values, linearly at high values. Going from 0ms to 5ms delay barely changes p50 (89ms to 50ms -- it actually decreases, because the 5ms delay gives the system a natural pacing mechanism that reduces queue contention). Going from 50ms to 100ms delay roughly doubles p50 (181ms to 364ms), which is what you'd expect from a 3-hop path where each hop adds the configured delay.
The 0ms delay anomaly. At zero delay, throughput drops to 31.8 PPS and loss rate jumps to 6.6%. This seems paradoxical -- no mixing delay should mean faster processing, right? But with zero delay, packets are forwarded as fast as they're processed, creating burst patterns that overwhelm downstream nodes. The 1ms delay acts as a natural rate limiter, smoothing traffic and actually improving throughput (65.7 PPS) and loss rate (4.4%). This is Poisson mixing working as designed: the random delay doesn't just provide privacy, it provides flow control.
The 50ms inflection point. At 50ms delay, throughput spikes to 237.7 PPS while loss rate drops to 2.4%. This is the "sweet spot" where the Poisson delays provide enough buffering that the system operates like a well-tuned pipeline: packets arrive, queue briefly, and depart in a steady stream. The longer per-hop delay actually allows the system to accept more total traffic because individual hops have time to drain their queues between departures.
The p99 explosion at 100ms. At 100ms delay, p99 reaches 990ms -- nearly a full second. For DeFi applications where users expect sub-2-second responsiveness, this means 100ms per-hop delay is the practical upper limit. At 50ms, p99 is 505ms -- aggressive but still within a "private swap" UX budget. At 10ms, p99 is 299ms -- comfortable for all interactive DeFi operations.
Our production recommendation: 10ms mixing delay provides meaningful timing obfuscation (Pearson correlation drops from 0.9999 at 1ms to lower values) while keeping p99 under 300ms. For time-critical operations (MEV protection, auction sniping), a fast-lane with 1ms delay provides sub-270ms p99 at the cost of weaker timing privacy.
Comparison with Tor
How does this compare to Tor? Tor Metrics (OnionPerf, February 2026) reports circuit RTTs from three vantage points:
| Location | p25 | p50 | p75 | Max |
|---|---|---|---|---|
| Germany (EU) | 62 ms | 85 ms | 132 ms | 232 ms |
| United States | 236 ms | 260 ms | 299 ms | 395 ms |
| Hong Kong | 450 ms | 510 ms | 590 ms | 803 ms |
NOX's p50 of 69.7ms (one-way, 1ms mixing delay) is competitive with Tor EU at 85ms (round-trip). But this comparison requires heavy caveats: NOX is measured on localhost, Tor traverses real internet links across continents. NOX's SURB RTT of 228ms is a fairer comparison point for round-trip scenarios -- still faster than Tor US (260ms), comparable to Tor HK (510ms).
The architectural difference matters more than the raw numbers: NOX provides Poisson mixing, cover traffic between nodes, and sender-receiver unlinkability. Tor provides none of these. The latency comparison is useful for user experience assessment, not for security comparison. They're solving different problems at different security levels.
SURB Round-Trip Times
SURB (Single-Use Reply Block) round-trips are the critical metric for interactive applications. A SURB allows a recipient to respond without knowing the sender's network address -- the response packet follows a pre-built return path embedded in the original message. But SURBs double the hop count: 3 hops forward, 3 hops back.
| Percentile | SURB RTT (ms) |
|---|---|
| p50 | 228 |
| p90 | 381 |
| p95 | 412 |
| p99 | 470 |
The SURB RTT is approximately 2.7x the one-way forward latency, reflecting 6 total hops (3 forward + 3 return) plus exit node processing time. You'd expect exactly 2x if forward and return paths had identical characteristics, but the SURB path includes additional overhead: the exit node must process the request, generate a response, and pack it into the SURB return packet.
The SURB loss rate was 12.7%. One in eight SURB responses never arrives. This is the problem that motivated our FEC system (Tier 6).
Loopix reports "seconds" for overall message latency. NOX demonstrates that with a modern Rust implementation and careful engineering, Loopix-class privacy can be achieved at sub-100ms p50 latency. The two-orders-of-magnitude improvement is not from algorithmic innovation -- we implement the same Loopix mixing strategy. It's from implementation quality: compiled Rust vs interpreted Python, optimized Sphinx vs reference implementation, async I/O vs blocking sockets.
SURB Round-Trip With FEC: The Latency Improvement
Our FEC system doesn't just improve reliability -- it also improves latency. This seems paradoxical (FEC sends more packets, which should take more time), but the mechanism is subtle.
We compared SURB round-trip times with and without FEC (5 nodes, 100 packets, 1ms delay, 30% FEC ratio):
| Metric | Without FEC | With FEC (30%) | Improvement |
|---|---|---|---|
| p50 RTT | 329 ms | 264 ms | 1.25x faster |
| p90 RTT | 460 ms | 373 ms | 1.23x faster |
| p95 RTT | 480 ms | 409 ms | 1.17x faster |
| p99 RTT | 513 ms | 422 ms | 1.22x faster |
| Mean RTT | 329 ms | 258 ms | 1.28x faster |
| Std Dev | 101 ms | 89 ms | Less variance |
| Duration | 3.76 s | 3.06 s | 1.23x faster |
FEC makes SURBs 25-28% faster at every percentile. How?
The key is that with FEC, the client doesn't need to wait for all fragments -- it just needs any K of N. Without FEC, the response consists of exactly 1 SURB packet, and the client waits for that one packet. If it's delayed (drew a high Poisson delay at one of its 3 return hops), there's nothing to do but wait. With FEC, the response is split into multiple shards sent via separate SURBs on independent return paths. The client completes as soon as K shards arrive -- it effectively takes the K-th order statistic of N independent path latencies.
Think of it like taking multiple taxis from the same origin to the same destination. You arrive as soon as the first taxi gets there, not the last. With FEC, you arrive as soon as enough taxis (K of N) arrive. The more taxis you dispatch, the lower the minimum arrival time tends to be.
The FEC encode/decode overhead is negligible:
| Operation | Without FEC | With FEC | Overhead |
|---|---|---|---|
| Encode | 27 us p50 | 49 us p50 | +22 us |
| Decode | 213 us p50 | 431 us p50 | +218 us |
The extra 240 microseconds of FEC codec overhead is invisible compared to the 65ms of latency savings. FEC pays for itself multiple times over in latency alone, before you even consider the reliability improvement.
This is a result we haven't seen discussed in the mixnet literature. FEC is typically motivated by reliability ("what if packets get lost?"), but the latency benefit from parallel path diversity may be equally important for interactive applications like DeFi.
The Nym Latency Gap
This section will frustrate you, because it frustrated us.
Nym is the closest architectural comparison to NOX: Loopix-derived, Sphinx-based, stratified topology, Poisson mixing, production deployment. If any project should have published latency data, it's Nym. They run 550+ nodes across 64 countries. They have real-world latency data from their production network. They just... don't publish it.
The closest thing to an official Nym latency number comes from a 2023 talk by Halpin at DEF CON, where he mentioned "about 500ms" for a message traversing the network. But this wasn't from a systematic benchmark -- it was an off-the-cuff remark during a presentation. No methodology, no percentiles, no mixing delay configuration, no packet count.
There's one external measurement worth noting. Researchers from COSIC (KU Leuven) measured Nym's production network as part of the LAMP project (Lower-Latency Anonymity from Mix Packet formats) in 2024. They reported approximately 20ms per-hop latency contribution. With 3 hops, that's ~60ms one-way -- which, if accurate, would be comparable to our 69.7ms p50 at 1ms mixing delay. But the LAMP paper focuses on packet format comparisons, not Nym-specific latency analysis, and the measurement methodology is described only briefly.
We mention this not to attack Nym -- they're building a real production network, which is hard -- but to illustrate the broader point: if even the most mature mixnet implementation doesn't publish systematic latency data, how can the community compare approaches? How can users make informed decisions about which privacy tool fits their latency requirements?
Operational Metrics
Before moving to anonymity, a quick note on operational characteristics that affect deployment cost:
| Metric | Value | Context |
|---|---|---|
| Startup time | 8 ms | 3 nodes, parallel spawn |
| Startup time | 10 ms | 5 nodes |
| Startup time | 38 ms | 10 nodes |
| Memory (idle) | 40 MB | 3 nodes |
| Memory (idle) | 76 MB | 5 nodes |
| Memory (idle) | 136 MB | 10 nodes |
| Memory (100 PPS load) | 141 MB | 10 nodes |
| Memory (1,000 PPS load) | 170 MB | 10 nodes |
| Memory (5,000 PPS load) | 295 MB | 10 nodes |
| Disk writes | 15.7K ops/sec | sled, 256B values |
| Disk reads | 18.1K ops/sec | sled, 256B values |
| Mesh join time | 0.1 ms | All configurations |
| Failure recovery | 3,000 ms | gossipsub heartbeat + topology TTL |
Memory usage scales linearly with node count (~13 MB per idle node) and increases sub-linearly with load (136 MB idle to 295 MB at 5,000 PPS for 10 nodes). The 3-second failure recovery time is dominated by gossipsub's 1-second heartbeat interval and the 2-second topology TTL expiry.
The practical implication: a NOX node runs comfortably on a 1-vCPU, 2GB RAM VPS ($5-10/month from most cloud providers). The minimum viable relayer infrastructure is remarkably cheap. The expensive part is not the hardware -- it's the gas for on-chain transactions.
The failure recovery time (3 seconds) deserves discussion. When a node goes offline, the gossipsub mesh detects the failure via missed heartbeats (1-second interval). The topology layer then evicts the dead node after the topology TTL expires (2 seconds). Other nodes reroute traffic around the dead node. For packets already in transit to the dead node, they're lost -- there's no retry at the network layer (retries would compromise anonymity by revealing which packets were destined for which nodes). The 12.7% SURB loss rate partly reflects this: some losses are caused by transient node failures during the benchmark, not just queue overflow.
A 3-second recovery time is fast enough for DeFi operations (proof generation takes 9-11 seconds anyway), but slow enough that rapid node oscillation (up/down/up/down) would cause packet loss. In production, nodes should be deployed on reliable infrastructure with 99.9%+ uptime. A node with 99% uptime (15 minutes of downtime per day) would cause approximately 1% additional packet loss for packets routed through it -- manageable with FEC, but not ideal.
Memory Scaling: What Does a 100-Node Production Network Cost?
The memory data above lets us extrapolate. At idle, memory scales at ~13 MB per node (linearly). Under load, the per-node overhead increases but the growth is sub-linear because shared data structures (routing tables, peer databases) amortize across nodes:
| Scenario | Per-Node Memory | Total (100 nodes) | Cloud Cost Estimate |
|---|---|---|---|
| Idle | ~13.6 MB | ~1.4 GB | 1x 2GB VPS ($5/mo) |
| 100 PPS per node | ~14.1 MB | ~1.4 GB | 1x 2GB VPS ($5/mo) |
| 1,000 PPS per node | ~17.0 MB | ~1.7 GB | 1x 2GB VPS ($5/mo) |
| 5,000 PPS per node | ~29.5 MB | ~2.9 GB | 1x 4GB VPS ($10/mo) |
Even at extreme load (5,000 PPS per node, which is 500,000 total PPS across 100 nodes -- roughly 10x more traffic than Nym's entire network handles today), the memory footprint fits in a single $10/month VPS. The memory is not the problem. The CPU is not the problem. The network bandwidth is not the problem. The problem is gas, which is why the economics section matters more than the systems engineering section.
The disk I/O numbers (15.7K writes/sec, 18.1K reads/sec) are for sled, our embedded key-value store used for replay tag persistence and state recovery. At 369 PPS throughput, each packet generates ~3 sled writes (replay tag insert, routing state, delivery confirmation). That's 1,107 writes/sec -- about 7% of sled's capacity. Disk is not a bottleneck at any realistic throughput level.
Tier 3: Anonymity Metrics
Shannon Entropy: Route Diversity Beats Mixing Delays
This was genuinely surprising to us. It might be a novel finding. We've searched the mixnet literature and haven't found it stated explicitly elsewhere.
We measured Shannon entropy -- the standard information-theoretic metric for anonymity set quality -- across nine mixing delay configurations (10 nodes, 1,000 packets each):
| Mix Delay | Shannon Entropy (bits) | Normalized (H/H_max) | Effective Anonymity Set |
|---|---|---|---|
| 0 ms | 3.246 | 97.7% | 9.48 / 10 |
| 0.5 ms | 3.255 | 98.0% | 9.55 / 10 |
| 1 ms | 3.219 | 96.9% | 9.31 / 10 |
| 2 ms | 3.232 | 97.3% | 9.40 / 10 |
| 5 ms | 3.234 | 97.4% | 9.41 / 10 |
| 10 ms | 3.226 | 97.1% | 9.35 / 10 |
| 20 ms | 3.235 | 97.4% | 9.42 / 10 |
| 50 ms | 3.209 | 96.6% | 9.25 / 10 |
| 100 ms | 3.233 | 97.3% | 9.40 / 10 |
Maximum possible entropy for 10 senders is 3.322 bits (log2(10)). NOX achieves 96.6% to 98.0% of this maximum across all delay settings.
Read that again. The entropy barely changes between 0ms and 100ms delay. At 0ms -- no mixing delay at all, packets are forwarded immediately -- the entropy is 3.246 bits (97.7% of maximum). At 100ms per-hop delay (300ms of total mixing contribution), entropy is 3.233 bits (97.3% of maximum). That's less entropy at 100ms than at 0ms. The difference is within noise.
The Novel Route Diversity Finding
Here's what we think is happening, and why we believe this is a novel observation:
In a stratified topology with uniform route selection, the route diversity itself provides strong anonymity. With 10 nodes per layer and 3 layers, there are 10 x 10 x 10 = 1,000 equally likely paths through the network. Even without any timing perturbation -- at 0ms mixing delay, where packets are forwarded as soon as they arrive -- an observer can't determine which path a particular packet took. The packet exiting node 7 at layer 3 could have come from any of 10 nodes at layer 2, and before that from any of 10 nodes at layer 1. The combinatorial explosion of possible paths is what creates the anonymity, not the temporal reordering.
Think of it this way. Mixing delay works by making the timing of outputs unpredictable. Route diversity works by making the path of outputs unpredictable. In a sufficiently wide stratified topology, the path uncertainty alone provides near-maximum entropy. Mixing delay is icing, not the cake.
Why hasn't this been noted before? Possibly because most mixnet research focuses on theoretical analysis with abstract mixing functions, not empirical measurement of actual stratified topologies. The Loopix paper analyzes mixing in terms of Poisson processes and derives theoretical bounds on unlinkability as a function of mixing delay -- the analysis implicitly assumes that without mixing delay, the system provides no anonymity. But that analysis treats the route selection as fixed (the adversary knows the path). In practice, with uniform random route selection across a wide stratified topology, the adversary doesn't know the path, and the combinatorial path diversity provides most of the entropy.
We want to be careful about overclaiming here. This result is from a 10-node simulation on localhost. Several caveats apply:
-
10 nodes is small. With 10 nodes per layer, there are 1,000 paths. With 100 nodes per layer, there would be 1,000,000 paths. The entropy would be even more dominated by route diversity at larger scales.
-
Localhost eliminates geographic correlation. In a real deployment, nodes in the same datacenter would have correlated timing characteristics. An adversary who knows the datacenter topology might use geographic timing to narrow the path set, reducing the effective route diversity.
-
This is entropy, not unlinkability. Shannon entropy measures the uniformity of the output distribution. It doesn't measure timing correlation. More on this below.
-
The LLMix challenge applies. Recent work by Mavroudis & Elahi (2025) shows that entropy metrics can miss cumulative information leakage that ML-based adversaries can exploit. Our high entropy numbers don't mean an ML adversary couldn't deanonymize traffic over time.
But the core insight stands: in stratified topologies, route diversity is the primary source of entropy, and mixing delay is a secondary contributor. If this holds at larger scales and with geographic diversity (which we plan to test), it has significant implications for system design: you can use much shorter mixing delays (improving latency for DeFi applications) without significantly degrading anonymity, as long as your topology provides sufficient route diversity.
Practical implication: You can run the mixnet with very low delays (great for DeFi latency requirements) without significantly sacrificing anonymity, as long as your topology has enough route diversity. Topology eats mixing delay for breakfast.
What Route Diversity Means for DeFi System Design
If this finding holds at scale (and we caveat heavily that it needs larger-scale validation), it changes the design calculus for privacy-preserving DeFi:
Traditional design assumption: "We need X milliseconds of mixing delay for Y bits of anonymity." This leads to a latency budget that limits which DeFi operations are feasible through a mixnet. If you need 100ms per-hop delay for adequate anonymity, that's 300ms one-way, 600ms round-trip, plus proof generation and chain confirmation. Certain time-sensitive operations (MEV protection, auction participation) become difficult.
Route-diversity-informed design: "We need Z nodes per layer for Y bits of anonymity." This decouples anonymity from latency. You achieve privacy through topology width (more nodes per layer = more possible paths) rather than temporal mixing (holding packets longer). The mixing delay can be tuned independently for timing-unlinkability without being the primary anonymity mechanism.
Concretely: a 100-node-per-layer stratified topology with 3 layers provides 1,000,000 equally likely paths. Even at 0ms mixing delay, an adversary observing only inputs and outputs would have entropy of log2(1,000,000) = ~20 bits -- far more than needed. The mixing delay would only need to be high enough to prevent timing correlation (our unlinkability data suggests 50ms per hop suffices), not to provide the bulk of the anonymity.
This is, to be clear, a hypothesis based on our 10-node measurement. The jump from 10 nodes to 100 nodes per layer involves assumptions about uniform route selection, node reliability, and adversary capabilities that we haven't validated. But the direction is exciting: topology is the easy part to scale (just add nodes), while mixing delay is the hard part to add (it directly costs latency).
Entropy Scaling With Anonymity Set Size
A natural question: how does entropy scale as more users join the network? If route diversity provides most of the anonymity, does adding users help, or is it all topology?
We tested this with 15 nodes, 1,000 packets, 1ms delay, and varying the number of concurrent senders:
| Concurrent Users | Shannon Entropy (bits) | Max Entropy (bits) | Normalized | Effective Anonymity Set | Delivery Rate |
|---|---|---|---|---|---|
| 2 | 0.983 | 1.000 | 98.3% | 1.98 / 2 | 85.3% |
| 3 | 1.563 | 1.585 | 98.6% | 2.95 / 3 | 87.1% |
| 5 | 2.279 | 2.322 | 98.1% | 4.85 / 5 | 88.0% |
| 8 | 2.908 | 3.000 | 96.9% | 7.51 / 8 | 86.6% |
| 10 | 3.234 | 3.322 | 97.3% | 9.41 / 10 | 87.4% |
| 15 | 3.715 | 3.907 | 95.1% | 13.13 / 15 | 86.8% |
Several observations:
Normalized entropy stays above 95% across all user counts. Even with just 2 users, the system achieves 98.3% of maximum entropy. This is remarkably high -- with only 2 possible senders, the system is nearly as uncertain as theoretically possible about which one sent a given packet.
Effective anonymity set tracks closely with actual user count. At 10 users, the effective anonymity set is 9.41 (out of 10 theoretical maximum). At 15 users, it's 13.13 (out of 15). The "anonymity efficiency" (effective / actual) is consistently 87-98%.
Delivery rate is stable. Whether there are 2 or 15 concurrent users, the delivery rate hovers around 85-88%. More users don't cause more packet loss -- the mixing infrastructure handles the increased traffic gracefully.
The slight normalized entropy decrease at 15 users (95.1% vs 98.6% at 3 users) is likely due to a subtle effect: with more users, the random route selection is more likely to create "hot paths" where multiple users' packets traverse the same nodes. These shared paths create correlations that slightly reduce entropy from the theoretical maximum. The effect is small (3.5 percentage points) but consistent across our measurements.
The practical takeaway: the system provides near-maximum anonymity efficiency at all tested user counts. The absolute anonymity (in bits) grows logarithmically with users (as expected -- max entropy is log2(N)), but the system achieves nearly all of the available anonymity at every scale. Adding users helps because it increases the theoretical maximum, and the system reliably captures 95%+ of that maximum.
Traffic Rate Independence: A Surprising Stability Result
We measured entropy across seven different traffic injection rates (10 nodes, 1,000 packets, 1ms delay):
| Target PPS | Achieved PPS | Shannon Entropy (bits) | Normalized | Mean Latency (ms) |
|---|---|---|---|---|
| 5 | 3.5 | 3.247 | 97.7% | 175 |
| 10 | 6.2 | 3.244 | 97.7% | 171 |
| 25 | 11.7 | 3.252 | 97.9% | 165 |
| 50 | 16.2 | 3.240 | 97.5% | 164 |
| 100 | 19.9 | 3.250 | 97.8% | 171 |
| 200 | 23.2 | 3.228 | 97.2% | 169 |
| 500 | 25.7 | 3.240 | 97.5% | 154 |
Entropy is essentially constant across a 100x range of traffic rates (3.228 to 3.252 bits, all within 97.2-97.9% of maximum). Whether 3.5 or 25.7 packets per second are flowing through the network, the anonymity properties are unchanged.
This is significant for DeFi applications where traffic is inherently bursty. A market event might cause a surge from 5 PPS to 200 PPS. Our data shows this surge doesn't degrade anonymity -- the stratified topology provides consistent route diversity regardless of load.
Notice also that the achieved PPS saturates well below the target at high rates. At target 500 PPS, the system achieves only 25.7 PPS. This is the in-process contention ceiling manifesting (with 10 nodes sharing CPU cores). But the privacy properties don't care about the gap between target and achieved -- they depend on the topology, not the throughput.
Mean latency actually decreases slightly at higher traffic rates (175ms at 5 PPS → 154ms at 500 PPS achieved). At low traffic rates, packets spend more time in the Poisson queue waiting for their scheduled departure. At higher rates, the queue drain rate is faster, and packets spend less time waiting. This is a nice property: higher load doesn't mean worse latency, at least until you hit the saturation point.
Combined Anonymity: Where ZK-UTXOs Meet Mixnets
Here's a question nobody else in the mixnet literature is asking, because nobody else is building a mixnet bolted to a ZK-UTXO system: what is the combined anonymity of the two layers?
In Xythum, privacy comes from two independent mechanisms. The UTXO pool provides on-chain anonymity -- an observer cannot determine which UTXO is being spent because the ZK proof reveals nothing about the input note. The mixnet provides metadata anonymity -- a network observer cannot link the transaction sender to their IP address. These are orthogonal dimensions of privacy, and we wanted to know how they combine.
We modeled three scenarios (15 nodes, 500 packets, 3 hops, 1ms delay):
Independent combination (theoretical maximum): if the UTXO anonymity set and mixnet anonymity set are fully independent, the combined entropy is their sum. With a 10,000-UTXO pool (13.3 bits) and 10-node mixnet (3.14 bits), you get 16.4 bits -- an effective anonymity set of ~88,000.
Correlated combination (pessimistic floor): if an adversary can fully correlate the two layers (e.g., by timing when a UTXO is spent and when a mixnet packet appears), the combined entropy equals just the mixnet entropy. With 10 nodes, that's 3.14 bits -- an effective anonymity set of ~8.8.
Partial correlation (realistic): our measured results with a 0.1 correlation factor.
| UTXO Pool Size | Mixnet Nodes | H_UTXO (bits) | H_Mixnet (bits) | H_Combined Independent | H_Combined Partial | Effective Set (Independent) | Effective Set (Correlated) |
|---|---|---|---|---|---|---|---|
| 100 | 3 | 6.64 | 1.52 | 8.17 | 7.17 | 288 | 2.9 |
| 1,000 | 5 | 9.97 | 2.29 | 12.26 | 11.26 | 4,893 | 4.9 |
| 10,000 | 10 | 13.29 | 3.14 | 16.43 | 15.43 | 88,151 | 8.8 |
| 100,000 | 10 | 16.61 | 3.14 | 19.75 | 18.75 | 881,507 | 8.8 |
| 1,000,000 | 15 | 19.93 | 3.51 | 23.44 | 22.44 | 11,360,652 | 11.4 |
The numbers reveal something important: the correlated case is terrible. If an adversary can correlate the on-chain and network layers, the million-UTXO pool is worthless -- the effective anonymity set drops to 11.4. This is the entire argument for why a mixnet is necessary for DeFi: without metadata privacy, the on-chain anonymity set is a fiction. An adversary who can link "this IP submitted this transaction" reduces the million-user anonymity set to the mixnet anonymity set.
Conversely, the independent case is extraordinary. A 10,000-UTXO pool combined with a 10-node mixnet provides an effective anonymity set of 88,000 -- larger than either layer alone. This is the multiplier effect: privacy layers that protect different dimensions compound rather than overlap.
The practical implication for Xythum is that mixnet quality directly multiplies the value of the UTXO pool. Improving mixnet anonymity from 3 to 10 nodes (1.52 to 3.14 bits) doesn't just double the mixnet anonymity -- it increases the combined independent anonymity set by 300x (from 288 to 88,151). Every additional bit of mixnet entropy doubles the effective combined anonymity set. This is why we obsess over fractional improvements in route diversity: they have outsized effects on the total system.
Mixnet normalized entropy decreases slightly at larger topologies (96.2% at 3 nodes, 98.6% at 5 nodes, 94.5% at 10 nodes, 89.7% at 15 nodes). This is the "hot path" effect we noted in the entropy scaling section: with more nodes, random route selection creates more uneven path distributions. The absolute entropy still increases (1.52 to 3.51 bits), but the efficiency relative to theoretical maximum decreases. This suggests an optimal topology size exists -- probably in the 8-12 nodes per layer range where you get most of the route diversity benefit without the hot-path penalty.
Unlinkability: KS Test p-values
Entropy tells you about the anonymity set size. Unlinkability tells you whether an adversary can statistically distinguish real traffic from uniform random traffic. These are different properties.
We measured unlinkability using the Kolmogorov-Smirnov (KS) test: the null hypothesis is that packet timing is indistinguishable from uniform distribution. A high p-value means the adversary cannot distinguish real from random. A p-value below 0.05 means the traffic is statistically distinguishable at the 95% confidence level.
| Mix Delay | KS Statistic | KS p-value | Interpretation |
|---|---|---|---|
| 0.5 ms | 0.217 | 6.5e-81 | Trivially distinguishable |
| 1 ms | 0.210 | 2.7e-75 | Trivially distinguishable |
| 5 ms | 0.195 | 1.4e-65 | Trivially distinguishable |
| 10 ms | 0.215 | 8.4e-80 | Trivially distinguishable |
| 50 ms | 0.016 | 0.706 | Strong unlinkability |
At 50ms mixing delay, the KS p-value is 0.706 -- the adversary cannot distinguish real traffic from uniform random at any reasonable significance level. Below 50ms, traffic patterns remain statistically distinguishable from uniform. The p-values at low delays are astronomically small (6.5e-81 is essentially zero) -- at these delays, the timing structure is trivially detectable.
We also computed chi-squared statistics, which test whether the distribution of exit times across time bins matches the expected uniform distribution. The chi-squared results tell the same story: at 0.5ms delay, chi-squared = 2,148 (p ≈ 0); at 50ms delay, chi-squared = 81.3 (p = 0.0025). The chi-squared test is actually more sensitive than KS for this type of analysis because it captures deviations in specific time bins rather than just the overall distribution shape. The fact that even at 50ms delay the chi-squared p-value is 0.0025 (below 0.05) while the KS p-value is 0.706 (above 0.05) tells us something subtle: the timing distribution at 50ms passes the KS test for overall uniformity but fails the chi-squared test for bin-level uniformity. Some time bins have slightly more packets than expected.
What does this mean in practice? A sophisticated adversary who bins exit timestamps and runs chi-squared analysis could detect non-uniformity even at 50ms delay. But the deviation is small (chi-squared = 81.3 with many degrees of freedom), and exploiting it would require observing thousands of packets from the same sender. For single-transaction privacy (the typical DeFi use case), the KS-passing uniformity at 50ms is sufficient. For repeated-activity privacy (the same wallet sending daily transactions for months), the chi-squared weakness suggests 50ms is the minimum delay, and higher values should be considered.
This creates a tension with the entropy result. At 0ms delay, entropy is 97.7% of maximum (excellent anonymity set), but timing unlinkability is zero (the adversary can distinguish real from random traffic trivially). How do you reconcile these?
The reconciliation is that they measure different things. Entropy measures how uniformly traffic is distributed across possible senders -- "given a packet exiting the network, how uncertain am I about which of the 10 senders sent it?" Unlinkability measures whether the timing pattern of exits matches what you'd expect from uniform random traffic -- "does the exit timing look random, or does it have structure?"
An adversary who can observe all inputs and outputs (a global passive adversary) uses both signals. High entropy means the path diversity makes it hard to guess the sender based on the exit point. Low unlinkability means the timing pattern leaks information that could, over many observations, erode that path diversity protection.
The takeaway: route diversity provides high entropy (anonymity set quality) at all delay settings, but timing unlinkability requires sufficient mixing delay. These are complementary properties. For DeFi applications where timing correlation is a concern (e.g., transaction submission patterns), 50ms per-hop delay achieves strong unlinkability while keeping p50 latency at 181ms -- well within interactive thresholds.
Timing Correlation: The Raw Numbers
We measured input-output timing correlation directly: for each packet entering the mixnet, we recorded the entry timestamp and the exit timestamp, then computed statistical correlation between these two series. This is the most direct test of whether an adversary can link packets by timing alone.
At 1ms mixing delay with 10 nodes, 2,000 packets, and 50 concurrent senders:
| Metric | Value | Interpretation |
|---|---|---|
| Pearson correlation | 0.9999 | Near-perfect linear correlation |
| Spearman rank corr | 0.9996 | Near-perfect monotonic correlation |
| Mutual information | 4.40 bits | High shared information |
| Sample count | 1,964 | 36 packets lost in transit |
This looks terrible. A Pearson r of 0.9999 means the entry and exit times are almost perfectly correlated. Doesn't this mean the mixing is useless?
Not exactly. Here's what's happening: at 1ms Poisson delay, the mixing contribution is tiny relative to the overall transit time. A packet entering at time T will exit approximately at T + (processing time + 3 hop delays + mixing delays) ≈ T + 100-300ms. The ordering is mostly preserved because the 1ms mixing delay is too short to reorder packets that are spaced more than a few milliseconds apart.
But this doesn't mean the adversary wins. The correlation tells you about timing -- packets that enter early tend to exit early. What it doesn't tell you is which entry corresponds to which exit. With 50 concurrent senders, packets from all 50 senders arrive at roughly the same times and exit at roughly the same times. The high correlation says "early packets exit early" but not "Alice's packet is the third one out."
This is the distinction between timing correlation and sender-receiver linkability. Timing correlation is necessary but not sufficient for deanonymization. You also need to resolve which sender's packet is which among all the packets that entered in the same time window.
That said, at 1ms delay, the timing is almost certainly exploitable by a sophisticated adversary. The Pearson r of 0.9999 means the packets are barely shuffled -- they exit in almost the same order they entered. A global passive adversary with microsecond-resolution timing could likely link packets at this delay setting.
At 50ms delay (where our KS test shows strong unlinkability), the timing correlation would be dramatically lower. We measured the timing correlation at 1ms specifically to demonstrate that this is what inadequate mixing looks like. The right question isn't "is correlation high at 1ms?" (yes, trivially) but "at what delay does correlation drop enough to defeat practical adversaries?" Our KS test data says 50ms. A more rigorous answer would require correlation measurements across the full delay range, which is on our benchmark roadmap.
This is another case where publishing embarrassing numbers is more useful than hiding them. The 0.9999 correlation at 1ms delay is a honest statement of what the system provides at that configuration. If you need timing resistance, use 50ms+ delay. If you only need route-diversity-based entropy and your threat model allows a timing-capable adversary to be present, 1ms is not enough.
Entropy Is Not Enough: The LLMix Challenge
Our entropy numbers look great. 97.7% of maximum at 0ms delay. But a recently published attack makes us nervous about relying on entropy as the sole privacy metric.
In June 2025, Mavroudis and Elahi published "LLMix" (arXiv:2506.08918), which uses generative language models (specifically, transformer-based sequence models) to deanonymize mix network traffic. The key insight: entropy metrics assume the adversary analyzes each observation independently. LLMix's adversary doesn't. It builds a sequence model over time, learning temporal patterns that accumulate across many observations.
Consider: if you send a transaction every Tuesday at 3pm, each individual observation might be buried in a high-entropy anonymity set. But over weeks of observation, the pattern becomes detectable. A generative model can learn "user X sends between 2:50pm and 3:10pm on Tuesdays" and use that to filter the anonymity set from 10 possible senders to 2 or 3.
LLMix showed that for continuous stop-and-go mixes (the theoretical model that Loopix approximates), the cumulative information leakage can be substantial even when per-observation entropy is high. The attack succeeds because the adversary has unbounded observation time and can correlate patterns across thousands of mixing epochs.
What does this mean for NOX?
-
Our entropy measurements are necessary but not sufficient. They demonstrate that the route diversity provides good instantaneous anonymity. They don't prove that a patient adversary with ML capabilities can't deanonymize users over weeks of observation.
-
Client-side cover traffic is critical. If the client sends real and dummy traffic at a constant rate (regardless of actual activity), the temporal pattern disappears. We haven't implemented client-side cover traffic yet -- it's the highest priority missing feature (see Part 6).
-
We need longitudinal privacy analysis. Our benchmarks measure snapshots. We need to measure how privacy degrades over time under sustained observation. This is a hard measurement to do and nobody in the field does it well, but it's where the real privacy failures live.
-
The MOCHA concern. Rahimi's MOCHA simulator (2025) shows that message-level anonymity metrics can overestimate client-level protection. If 10 messages come from 10 senders, message-level entropy is log2(10) = 3.32 bits. But if one sender sends 8 of those messages, the client-level entropy is much lower. Our benchmarks measure message-level entropy. Client-level analysis requires modeling realistic usage patterns, which we haven't done.
We publish these entropy numbers because they're real data and they demonstrate meaningful properties of the system. But we want to be clear about what they don't show: they don't show resilience against patient adversaries with ML capabilities, and they don't show client-level privacy under realistic usage patterns.
Cover Traffic: Bandwidth Cost of Indistinguishability
Cover traffic masks real communication patterns by injecting dummy packets. But it's not free. We measured the tradeoff between cover traffic rate and its impact:
| Cover Rate (PPS) | Total Packets | Bandwidth Overhead | Normalized Entropy |
|---|---|---|---|
| 0 | 2,000 | 1.0x | 100.0% |
| 0.5 | 2,306 | 1.15x | 99.8% |
| 1.0 | 2,645 | 1.32x | 99.4% |
| 2.0 | 3,188 | 1.59x | 98.7% |
| 5.0 | 5,176 | 2.59x | 96.5% |
| 10.0 | 8,358 | 4.18x | 94.2% |
There's a counterintuitive result here: entropy actually decreases slightly as cover traffic increases. This is because cover traffic makes the overall traffic pattern more predictable (more total traffic to analyze), even though it successfully masks which packets are real. The entropy drop is small (100% to 94.2%), but it means cover traffic isn't a free lunch for entropy -- its real value is in masking activity patterns (preventing the adversary from knowing when a user sends), not in increasing anonymity set size.
The Active/Idle Distinguishability Problem
Our deeper cover analysis measured something more important than entropy: can an adversary tell which nodes are actively handling real traffic and which are idle? This is the cover traffic's actual job. We used the Kolmogorov-Smirnov test against a Poisson distribution, with 5 active nodes and 5 idle nodes across 15 seconds of observation:
| Cover Rate (PPS) | KS Statistic | KS p-value | Chi-Squared p-value | Bandwidth Overhead | CPU Time (s) |
|---|---|---|---|---|---|
| 0 | 0.284 | 0.009 | 0.0007 | 1.0x | 0.62 |
| 0.5 | 0.211 | 8.5e-8 | 0.005 | 2.09x | 1.19 |
| 1.0 | 0.152 | 3.9e-6 | 0.0004 | 3.01x | 1.60 |
| 2.0 | 0.044 | 0.282 | 0.383 | 5.15x | 2.41 |
| 5.0 | 0.045 | 0.020 | 0.174 | 11.46x | 4.33 |
| 10.0 | 0.072 | 2.8e-10 | 0.010 | 21.96x | 6.90 |
| 20.0 | 0.080 | 1.7e-23 | 1.9e-8 | 42.34x | 10.69 |
The sweet spot is crystal clear: 2.0 PPS cover traffic. At this rate, the KS p-value jumps to 0.282 and the chi-squared p-value to 0.383 -- the adversary cannot distinguish active from idle nodes at any standard significance level. Below 2.0 PPS, the traffic patterns are statistically distinguishable. Above 2.0 PPS, something surprising happens: the p-values actually worsen.
Why does more cover traffic make things less private? At high cover rates, the Poisson process generating cover traffic becomes the dominant traffic source. But the cover traffic rate is configured per-node, so all nodes generate cover at the same rate. The real traffic is then a small perturbation on top of the cover baseline. A sophisticated adversary can subtract the expected cover traffic pattern and look at the residual -- which is the real traffic signal. At 2 PPS cover, the cover and real traffic are similar in magnitude, making them hard to separate. At 20 PPS cover, the real traffic is a tiny bump on a large cover baseline, and statistical tests can detect the bump.
This is a subtlety that most cover traffic analyses miss. More cover isn't always better. The optimal rate depends on the real traffic rate -- you want cover and real traffic to be similar in volume.
The bandwidth cost at the sweet spot is 5.15x -- for every real packet, you send about 4 cover packets. That's significant but not catastrophic. A relayer paying 25/month at 5x overhead. The CPU cost is also manageable: 2.41 seconds of CPU time per 15-second measurement window, or about 16% of one core.
For a DeFi relayer, the practical question is whether 5x bandwidth overhead is worth the active/idle indistinguishability. For high-value operations (institutional privacy, compliance flows), probably yes. For everyday transactions competing on cost, probably not. This is another reason why privacy will initially be a premium service rather than a default feature.
Tier 4: Attack Resistance
We didn't just measure performance -- we simulated attacks. Our privacy analytics binary runs several adversary models against a running mixnet. These results are not flattering. We publish them anyway, because somebody should.
n-1 Attack: 100% Success (This Is Expected)
The adversary controls all but one message in a mixing batch.
| Metric | Value |
|---|---|
| Attack type | n-1 (flood attack) |
| Adversary nodes | 14/15 |
| Success probability | 100% |
| Entropy under attack | 0.0 bits |
| Entropy reduction | 100% |
This is a fundamental limitation of the Loopix mixing model, not a bug in our implementation. The adversary floods a target node's queue with their own messages. When the node flushes the batch, all messages except one belong to the adversary -- the remaining message must be the target's. Defenses: rate limiting, PoW admission, and large anonymity sets (more real users). There is no cryptographic mitigation.
Every Sphinx-based mix network has this vulnerability. We state it because nobody else publishes their n-1 attack results.
Intersection Attack: 20 Epochs to Full Deanonymization
The adversary observes which users are online across multiple epochs and correlates presence with message delivery.
| Epochs Observed | Activity Probability | Success Rate | Entropy Remaining |
|---|---|---|---|
| 1 | 0.7 | 0% | 3.91 bits (above baseline) |
| 2 | 0.7 | 0% | 3.91 bits |
| 5 | 0.7 | 0% | 3.91 bits |
| 10 | 0.7 | 70% | 2.17 bits |
| 20 | 0.7 | 100% | 0.0 bits |
At p=0.7 activity probability (each user is online 70% of the time), the attack requires 20 epochs to fully converge. The math is straightforward: after k epochs, the adversary expects each non-target user to be absent at least once with probability 1-(0.7)^k. At k=20, 1-(0.7)^20 = 0.9992 -- nearly certain that every non-target user has been observed absent at least once, leaving only the true sender.
The defense is client-side cover traffic -- making it impossible to distinguish active from inactive users. If every client sends traffic at a constant rate regardless of actual activity, the adversary can't determine who is "online" and the intersection attack fails. We haven't implemented client-side cover traffic yet (see Part 6). This is the highest-priority missing defense.
Compromised Nodes: Graceful Degradation
When an adversary controls a fraction of nodes in the network:
| Compromised | Total Nodes | Full Deanonymization | Entropy Reduction | Partial Observation |
|---|---|---|---|---|
| 1 | 15 | 0.0% | 0.0% | 20.0% of traffic |
| 2 | 15 | 0.0% | 0.0% | 26.7% of traffic |
| 3 | 15 | 6.7% | 6.7% | 26.7% of traffic |
With 3 out of 15 nodes compromised (20% of the network), full path deanonymization succeeds only 6.7% of the time. The theoretical expectation for 3-hop paths with independent uniform selection is (3/15)^3 = 0.8%, but layered topology effects increase this. Even so, at 20% compromise, the adversary only fully deanonymizes 1 in 15 flows.
"Partial observation" means the adversary sees traffic at one or two hops but not all three. Observing 2 of 3 hops gives the adversary significant information (they know the entry and middle, or middle and exit), but doesn't fully reveal the sender-receiver link.
This highlights why Sybil resistance matters: an adversary who can cheaply add nodes can compromise a larger fraction. NOX plans staking-based admission to mitigate this. Currently, the NoxRegistry smart contract controls admission, but without economic stake requirements.
PoW Anti-Spam: Asymmetric Defense
We measured the cost of our SHA-256 Hashcash PoW across difficulty levels:
| Difficulty (bits) | Mean Solve Time | p99 Solve Time | Verify Time | Asymmetry Ratio |
|---|---|---|---|---|
| 0 | 0 us | 0 us | 17.7 ns | -- |
| 8 | 22 us | 174 us | 64.8 ns | 343x |
| 12 | 213 us | 602 us | 67.6 ns | 3,149x |
| 16 | 788 us | 2.3 ms | 65.7 ns | 11,987x |
| 20 | 9.9 ms | 69.9 ms | 74.3 ns | 133,596x |
| 24 | 164.7 ms | 807.9 ms | 78.4 ns | 2,101,660x |
At difficulty 16, a spammer must spend 788 microseconds per packet while the node verifies in 65.7 nanoseconds -- a 12,000x asymmetry. At difficulty 20, that's 10 milliseconds of computation per spam packet. For a legitimate user sending one packet, 10ms is negligible. For an adversary attempting to flood the network at thousands of PPS, it's prohibitive.
The verify throughput remains flat at ~15 million verifications per second regardless of difficulty, which means PoW checking never becomes a bottleneck. The asymmetry property is what makes PoW useful for anti-spam: the cost grows exponentially for the attacker but remains constant for the defender.
We currently run difficulty 16 in our benchmarks. For production, this would be dynamically adjusted based on load -- higher difficulty during detected flooding, lower during normal operation.
The adaptive difficulty design. Consider the economics: at difficulty 16, a spammer can generate ~1,269 packets per second per CPU core (1/788us per packet). To flood a node processing 369 PPS, the spammer needs fewer than 1 core. That's not a high bar. At difficulty 20, the spammer can generate only ~100 packets per second per core. To sustain a 369 PPS flood requires 4 CPU cores. Still feasible, but the cost scales.
The key insight from the data is in the p99 solve times. At difficulty 16, the p99 is 2.3ms -- occasional proof-of-work computations take 3x the mean. At difficulty 20, the p99 is 69.9ms -- 7x the mean. The exponential distribution of hash-based PoW means high-difficulty puzzles have extreme variance. A spammer targeting consistent flood rates at difficulty 20+ would need to pre-compute PoW solutions in batches to smooth out the variance, adding memory overhead and implementation complexity.
For a dynamic system: at normal load, use difficulty 12 (213 microseconds mean solve, barely noticeable for legitimate clients). When the node detects incoming packet rate exceeding 2x the expected rate, bump to difficulty 16. At 5x expected rate, bump to 20. At 10x, bump to 24 (165ms mean solve -- punishing for bulk spam, still tolerable for a single legitimate request). The asymmetry ratios (3,149x at difficulty 12 up to 2.1M at difficulty 24) ensure that the node's verification cost stays negligible even as the attacker's cost skyrockets.
We haven't implemented the adaptive difficulty yet. It's straightforward engineering (the PoW verify already takes difficulty as a parameter), but the tuning -- what rates trigger what difficulty levels, how quickly to escalate and de-escalate -- requires real-world traffic data that we don't have from localhost benchmarks.
Replay Detection: Zero False Negatives
| Implementation | Insert Throughput | Check Throughput | False Positives | False Negatives |
|---|---|---|---|---|
| Rotational Bloom Filter | 5.6M ops/sec | 8.0M ops/sec | 0 / 10,000 | 0 / 10,000 |
| Sled (disk-backed) | 398K ops/sec | 3.3M ops/sec | 0 / 10,000 | 0 / 10,000 |
The Bloom filter achieves 14x higher insert throughput and 2.4x higher check throughput than disk-backed detection. The zero false positives in this test run is slightly lucky -- the configured false positive rate is 0.1%, so over 10,000 checks you'd expect ~10 false positives. We got zero. Over millions of operations, the 0.1% rate would manifest.
The Bloom filter is the production choice for hot-path replay detection; Sled provides persistence for durability across restarts. We use a rotational design: two Bloom filters, swapped at epoch boundaries, so the memory footprint stays bounded regardless of how many packets the node has ever seen.
Let's unpack the Bloom filter memory budget, because this is the kind of operational detail that matters for node operators. The configured capacity is 10 million tags at 0.1% false positive rate. Using the optimal Bloom filter sizing formula (m = -n*ln(p) / (ln(2)^2)), this requires approximately 14.4 MB of memory per filter. With two filters in the rotational design (one active, one draining from the previous epoch), the total memory cost is ~29 MB -- about 20% of the idle node's 136 MB footprint.
At our measured 8.0M checks/sec, a node can verify replay status in 125 nanoseconds per packet. For context, the Sphinx per-hop processing costs 61.6 microseconds. Replay detection adds 0.2% overhead. It's essentially free.
The 100,000 unique tags and 10,000 replay attempts in this benchmark represent a concentrated attack scenario: 10% of all traffic is replay attempts. In production, replay attacks would be a much smaller fraction of total traffic (and each replay attempt requires a valid PoW solution, making bulk replay prohibitively expensive). The zero false negative rate is the critical safety property: no replayed packet ever slips through. False positives (legitimate packets incorrectly flagged as replays) are tolerable at 0.1% -- the client simply re-sends with a new tag.
What the Attack Results Tell Us About System Design
The attack simulation results, taken together, paint a coherent picture of the threat landscape:
n-1 attack (100% success): Defends against: large anonymity sets, rate limiting, PoW. This is the theoretical worst case -- an adversary who controls almost all traffic at a node. In practice, this requires either (a) an extremely low-traffic node where the adversary can flood it, or (b) Sybil attacks to inject many adversary-controlled messages. PoW makes (b) expensive. Large anonymity sets make (a) unlikely. But neither eliminates the attack -- they just raise the cost.
Intersection attack (20 epochs): Defends against: client-side cover traffic. This is the practical worst case for a patient adversary. Twenty observation epochs (where an "epoch" might be an hour or a day, depending on configuration) is not a long time for a motivated adversary. This is our highest-priority security gap. Until we implement client-side cover traffic, any user who is not online 100% of the time is vulnerable to intersection analysis.
Compromised nodes (6.7% at 20%): Defends against: Sybil resistance (staking), larger networks. This is actually reasonable. A 20% compromise rate is severe, and even then only 6.7% of flows are fully deanonymized. In a network with proper Sybil resistance (economic stake for admission), maintaining 20% compromise is prohibitively expensive.
The combined picture: The system's weakest points are (1) low-traffic nodes (n-1 attack) and (2) intermittent users (intersection attack). Both are mitigated by the same thing: more users sending more traffic more consistently. Privacy is a collective resource -- individual privacy improves when the network has more participants. This creates the positive feedback loop that all privacy systems depend on and all privacy systems struggle to bootstrap.
Tier 5: DeFi Pipeline
This is where NOX diverges from every other mixnet. No other system has a native DeFi pipeline. The data below comes from our micro_mainnet_sim -- real ZK proof generation (bb.js UltraHonk), real Sphinx packets through a real mixnet, real Anvil chain with real Solidity contracts, real contract execution and event parsing.
The DeFi Latency Budget
Before diving into specific measurements, let's map where time actually goes in a private DeFi operation. Understanding the latency budget tells you what to optimize and what to ignore.
A private swap on Xythum involves these sequential stages:
[Proof Generation] → [Sphinx Construction] → [Mixnet Transit] → [Simulation] → [On-Chain Execution] → [Block Confirmation]
~9-11 seconds ~222 microseconds ~100-300ms ~50ms ~2-5 seconds ~12 secondsStage 1: Proof Generation (9-11 seconds). The client generates a ZK proof using bb.js (Barretenberg's JavaScript UltraHonk backend). This is the dominant cost. The circuit has 13-31 public inputs depending on operation type. The proof must demonstrate knowledge of the note's plaintext, the spending key, the Merkle inclusion path, and the nullifier derivation -- all without revealing any of these to anyone.
In the paid mixnet path, this cost is offloaded to the relayer. The client sends the witness (inputs) through the mixnet, and the relayer generates the proof server-side. The relayer's prover is pre-warmed (332ms initialization, amortized across hundreds of requests), and the server has dedicated compute resources. This is why the paid mixnet path is actually faster than direct submission.
Stage 2: Sphinx Packet Construction (222 microseconds). Negligible. The client builds a 3-hop Sphinx packet containing the proof and transaction data. At 222 microseconds for a 3-hop packet, this is invisible in the latency budget.
Stage 3: Mixnet Transit (100-300ms). The packet traverses 3 mix nodes, each adding Poisson-distributed delay. At 1ms mean delay, the p50 one-way transit is ~70ms. At 50ms mean delay (which is what you'd want for timing unlinkability), it's ~180ms. The SURB response traverses 3 more hops for confirmation, adding another 100-200ms.
Stage 4: Contract Simulation (eth_simulateV1, ~50ms). Before submitting the transaction, the relayer simulates it using eth_simulateV1 -- a stateless RPC call that returns the execution result, gas used, and emitted events without actually modifying chain state. This catches reverts before they waste gas. The simulation takes about 50ms on Anvil (local testnet); it would be faster on a production node with a warm state trie.
Stage 5: On-Chain Execution (2-5 seconds). Transaction submission, mempool processing, and block inclusion. On mainnet, this varies wildly based on gas price and network congestion. On Anvil (our test environment), blocks are mined immediately, so this is just transaction propagation and execution time.
Stage 6: Block Confirmation (12 seconds per block). After inclusion, the user typically waits for 1-2 block confirmations for finality. On Ethereum mainnet, that's 12-24 seconds. On an L2 like Arbitrum or Base, confirmations are near-instant.
Total end-to-end: 25-40 seconds for a direct submission on L1. 15-25 seconds via paid mixnet (proof generation offloaded). On an L2, this drops to 5-15 seconds because block confirmations are faster.
The optimization priorities are clear: proof generation dominates in direct mode, and block confirmation dominates in both modes. Mixnet transit at 100-300ms is a rounding error. Faster Sphinx processing (e.g., going from 61.6us to 30us per hop) would save about 100 microseconds -- invisible in the overall budget. The paths to faster private DeFi are: faster proof generation (circuit optimization, native provers instead of bb.js), L2 deployment (faster confirmations), and proof aggregation (amortize verification gas across multiple operations).
To make this concrete: if we cut Sphinx per-hop time from 62us to 1us (a 62x improvement, basically impossible), the end-to-end latency for a private swap would decrease from ~15 seconds to ~14.9998 seconds. If we cut proof generation from 10 seconds to 2 seconds (a realistic 5x improvement from native prover), the end-to-end latency would decrease from ~15 seconds to ~7 seconds. The leverage isn't close.
This is actually a liberating realization for mixnet engineers. The Sphinx processing is not the bottleneck for DeFi applications. As long as per-hop processing stays under ~1 millisecond (giving you >1,000 hops/second throughput, far beyond anything we need), you can make crypto-algorithmic choices for security (e.g., SPRP, post-quantum KEM) without worrying about the DeFi latency impact. The latency budget has room.
Gas Profile: What Privacy Costs On-Chain
Each circuit type has a different gas footprint, determined by the number of public inputs, Merkle tree insertions, and nullifiers checked:
| Circuit | Gas Used | Proof Gen (ms) | Public Inputs | Merkle Inserts | Nullifiers |
|---|---|---|---|---|---|
| Deposit | 5.03M | 9,036 | 13 | 1 | 0 |
| Split | 5.58M | 10,697 | 24 | 2 | 1 |
| Join | 4.44M | 10,302 | 16 | 1 | 2 |
| Transfer | 6.09M | 10,914 | 31 | 2 | 1 |
| Withdraw | 4.16M | 9,247 | 18 | 1 | 1 |
| Public Claim | 4.34M | 9,225 | 13 | 1 | 0 |
| Gas Payment | 5.03M | 9,036 | 18 | 1 | 1 |
| Public Transfer | 0.56M | 0 | 0 | 1 | 0 |
Transfer is the most expensive (6.09M gas, 31 public inputs) because it creates two new notes (sender change + recipient) and spends one nullifier. A public transfer (no ZK proof, direct insertion) costs only 0.56M gas -- the ~10x premium for privacy is the cost of ZK verification on-chain.
Proof generation times are 9-11 seconds per circuit, dominated by the UltraHonk backend (bb.js). This is the user-facing latency cost: before any network activity, the wallet must generate the proof. The variation between circuits (9,036ms for deposit vs 10,914ms for transfer) reflects circuit complexity -- more gates and more public inputs mean more prover work.
The gas profile drives the economics. Each Merkle insertion costs roughly 500K gas (storage writes to update the tree). Each nullifier check costs roughly 300K gas (storage read to check the nullifier set, storage write to mark as spent). The fixed overhead of proof verification (the UltraHonk verifier contract) is about 3.5M gas. So a deposit (1 insert, 0 nullifiers, 13 inputs) = 3.5M + 0.5M + overhead = ~5M gas. A transfer (2 inserts, 1 nullifier, 31 inputs) = 3.5M + 1.0M + 0.3M + more input overhead = ~6M gas.
Where the Gas Goes: Anatomy of a ZK Verification
The gas numbers in the table above are totals. Let's decompose them to understand what drives the cost:
The UltraHonk verifier contract: ~3.5M gas, fixed across all circuits. This is the dominant cost. The verifier performs elliptic curve pairings on BN254 -- specifically, it checks that the proof polynomial commitments satisfy the UltraHonk relation. The EVM's ecPairing precompile (at address 0x08) costs 34,000 gas per pair plus 45,000 base cost. An UltraHonk verification involves multiple pairings plus extensive field arithmetic in the verifier contract's Solidity code.
Why 3.5M and not less? UltraHonk proofs are more efficient to generate than older Plonk variants (the prover is 2-3x faster), but the verification cost is slightly higher because of additional polynomial evaluations. We accept this tradeoff because proof generation time is the user-facing latency (9-11 seconds), while verification gas is a one-time on-chain cost that the user pays in ETH.
Merkle tree insertion: ~500K gas per insertion. Our contract uses the Lean IMT (Incremental Merkle Tree), which stores only the non-zero path from the leaf to the root. Each insertion updates O(log N) storage slots (where N is the number of leaves). At depth 20, that's up to 20 storage slots at 5,000 gas each (SSTORE to non-zero slot) = 100K gas just for storage, plus the hash computation per level. We use keccak256 for the on-chain tree (not Poseidon -- keccak is native to the EVM at 30 gas + 6 gas/word, while Poseidon would require a library deployment and cost 10-50x more on-chain).
Nullifier check: ~300K gas per nullifier. Reading the nullifier set (SLOAD: 2,100 gas for cold access, 100 for warm), verifying non-existence, then writing the nullifier (SSTORE: 20,000 gas for zero-to-nonzero). The 300K total includes the calldata decoding, ABI parameter handling, and the modifier checks in the contract.
Public input handling: Variable, ~50-150K gas depending on input count. Each public input is a BN254 field element (32 bytes) that must be decoded from calldata (16 gas per byte for non-zero calldata). The transfer circuit with 31 public inputs has notably higher input handling overhead than the deposit circuit with 13.
Stack depth management overhead: Our contract splits large operations into internal helper functions (_processChange, _processTransferMemo) because Solidity with viaIR: false has a stack depth limit of 16 local variables. These function calls add a small gas overhead (~2-5K per call) but are necessary to compile at all without viaIR: true. We can't enable viaIR because it breaks the auto-generated UltraHonk verifier contracts -- one of the more painful compatibility constraints in the system.
The Privacy Premium
How much more does a private operation cost compared to its public equivalent?
| Operation | Private Gas | Public Gas | Premium Ratio |
|---|---|---|---|
| ERC-20 transfer -> Private transfer | 6.09M | 65K | 93.6x |
| ERC-20 transfer -> Private deposit | 5.03M | 65K | 77.4x |
| Uniswap V3 swap -> Private withdraw | 4.16M | 180K | 23.1x |
A private transfer costs 93.6x more gas than a public ERC-20 transfer. At ETH=182.58 vs 180.63 premium for privacy. For a Uniswap swap, the premium is 23.1x (5.40).
This is the honest cost of on-chain privacy today. It's expensive. There's no way to sugarcoat a 93x gas premium.
But the comparison isn't "private vs public transfer" -- it's "private transfer vs no privacy at all." If your address holds 5 wrench attack, or regulatory exposure. For users whose threat model justifies 50 transfers, it obviously isn't.
The paths to cost reduction:
- L2 deployment. Arbitrum/Base gas prices are 10-100x lower than L1. The same 6M gas transfer would cost 180.
- Proof aggregation. Batch multiple operations into a single proof verification. If 10 transfers share one verification, the per-transfer gas drops from 6M to ~1.5M.
- Circuit optimization. Smaller circuits with fewer constraints generate smaller proofs that cost less to verify. This is active engineering work.
Direct vs Paid Mixnet: The Counterintuitive Latency Result
We measured end-to-end latency for DeFi operations through two transport modes:
| Operation | Direct E2E (ms) | Paid Mixnet E2E (ms) | Winner |
|---|---|---|---|
| Split | 10,667 | 5,933 | Mixnet (1.8x faster) |
| Join | -- | 5,475 | -- |
| Withdraw | 9,295 | 4,757 | Mixnet (2.0x faster) |
| Deposit | 9,005 (direct only) | -- | -- |
The paid mixnet path is faster than direct submission. This seems paradoxical -- the mixnet adds network hops and mixing delay, so shouldn't it be slower?
Here's why it isn't: in direct mode, the client generates the ZK proof locally (9-11 seconds), then submits. In paid mixnet mode, the client sends the raw witness through the mixnet, and the exit node (relayer) generates the proof using its own prover. The relayer's prover is pre-warmed (332ms initialization, amortized across requests), runs on server-grade hardware with dedicated compute, and the proof generation happens server-side where there's no browser/client overhead.
The timeline:
- Direct: Client generates proof (10s) + Client submits tx (50ms) + Chain confirms (~instant on Anvil) = ~10s
- Mixnet: Client builds Sphinx packet (0.2ms) + Mixnet transit (100ms) + Relayer generates proof (5s) + Relayer submits tx (50ms) + SURB response (200ms) = ~5.5s
The mixnet path saves 4-5 seconds by offloading proof generation to the relayer. The mixnet transit adds only 100-300ms. Net savings: ~45%.
This is a genuine architectural advantage of the relayer model: users with constrained devices (mobile, browser) get faster total latency by offloading proof generation to the relayer. You pay for it in gas (the paid mixnet transaction includes a gas payment proof, roughly doubling the gas), but the latency improvement is real.
The Profitability Engine
Every relayer must answer one question before submitting a transaction: "Will I make money on this?"
The profitability engine runs in the hot path between receiving a client request and submitting the transaction. It must be fast -- adding seconds of economic analysis to a latency-sensitive pipeline would defeat the purpose. Here's how it works and how fast it is:
-
Contract simulation (
eth_simulateV1): The relayer simulates the transaction against current chain state. This returns the gas that would be consumed and the events that would be emitted. The critical event isRewardsDeposited, which tells the relayer how much the client is paying for the service. Simulation time: ~50ms on Anvil. -
Price oracle query: The relayer fetches the current ETH/USD price from its oracle service (Binance, CoinGecko, or aggregate). This is cached with a 30-second TTL, so most calls hit the cache. Cache hit time: <1ms.
-
Profitability calculation:
fee_revenue_usd / cost_usd >= (1 + min_margin). The relayer computes: gasused * gasprice * eth_price = cost in USD. It compares this to the fee revenue (decoded from theRewardsDepositedevent) in USD. If the ratio exceeds 1.10 (10% minimum margin), the transaction is profitable. Calculation time: <1ms (arithmetic on f64). -
Decision: If profitable, submit. If not, reject with an error message to the client (delivered via SURB response). The client can retry with a higher fee or try another relayer.
Total profitability check latency: ~50ms, dominated by the eth_simulateV1 call. The economic computation itself is sub-microsecond.
The eth_simulateV1 API deserves a note. It's a relatively new Ethereum RPC method that performs stateless transaction simulation -- it doesn't modify any state, doesn't consume nonces, and returns full execution traces including emitted events. This is critical for the relayer because it needs to see the RewardsDeposited event before spending gas on the actual transaction. On Anvil, eth_simulateV1 also works around a limitation of debug_traceCall that ignores the withLog flag, making it the only way to get event logs from simulation.
Relayer Economics: Break-Even at 1.4 Transactions Per Day
At the 12% premium (1,200 basis points) that our profitability engine enforces:
| ETH Price | Gas Price | Avg Profit/TX | Break-Even (TXs/day) | Monthly Revenue (100 TX/day) |
|---|---|---|---|---|
| $2,000 | 1 gwei | $1.19 | 1.4 | $3,556 |
| $2,000 | 10 gwei | $11.85 | 0.14 | $35,563 |
| $3,000 | 10 gwei | $17.78 | 0.09 | $53,345 |
| $5,000 | 50 gwei | $148.18 | 0.01 | $444,541 |
A relayer running on a 2,000 ETH, 1 gwei). At 10 gwei gas, a single transaction per day covers the infrastructure cost with profit left over. At high gas prices (50 gwei, ETH 444K monthly revenue.
The economics scale with gas prices because the relayer's 12% premium is proportional to gas cost. Higher gas prices mean higher absolute margins per transaction. This is a fundamentally different economic model from Nym's token-based approach, where relay rewards depend on NYM token price rather than network utility. Our relayers are paid in proportion to the actual service they provide (gas execution), denominated in the currency of that service (ETH), with no intermediary token.
The 12% margin is configurable. A competitive market would push it down. A relayer accepting 5% margins at 10 gwei gas / 7.40 per transaction -- still profitable, but with lower margins making the break-even point higher (about 0.22 transactions per day).
The Gas Payment Mechanism
How does the relayer actually get paid? This deserves a brief technical explanation because it's unusual.
In most relayer systems (like Tornado Cash's relayers), the client specifies a relayer fee as part of the ZK proof, and the smart contract transfers the fee to the relayer during execution. NOX does something similar but with an important difference: the gas payment is itself a ZK-proven operation.
The client generates a separate gas_payment proof that authorizes spending from one of their private notes to the relayer's address. This proof is bundled with the main operation proof (e.g., split, transfer, withdraw) in a single transaction. The smart contract verifies both proofs atomically. If either proof is invalid, the entire transaction reverts. If both succeed, the relayer receives the gas payment from the UTXO pool while the main operation executes.
The gas payment proof costs 5.03M gas (roughly matching a deposit proof in complexity). This is added to the main operation's gas, which is why paid mixnet transactions are approximately 2x the gas of direct transactions (e.g., split: 5.58M direct vs 9.76M paid mixnet -- the difference is the gas payment proof verification plus the additional Merkle insertion).
The advantage of ZK-proven gas payments: the relayer's payment is guaranteed by the smart contract. The relayer can verify (via simulation) that the payment proof is valid before spending gas on the transaction. There's no trust assumption -- if the simulation succeeds, the relayer will get paid.
Revenue Sensitivity Analysis
The profitability analysis above uses fixed scenarios. Let's look at how revenue responds to the key variables:
Gas price sensitivity: Revenue is linearly proportional to gas price. At 1 gwei, a transfer generates 14.60. At 100 gwei (rare in 2026, common in 2021), it would generate $146. Relayers benefit from high gas prices -- which is counterintuitive, since high gas is bad for users. The 12% premium means relayers capture 12% of whatever the user pays in gas, so higher gas prices = higher absolute revenue.
ETH price sensitivity: Revenue is also linearly proportional to ETH price. At 14.60. At 36.50. Relayers are long ETH.
Transaction volume sensitivity: This is the real question for the business model. At 1 gwei / 50/month VPS. That's achievable in a healthy privacy pool with modest usage. But at 0.5 transactions per day, the relayer loses money. The business model requires sufficient demand for private transactions to sustain relayer infrastructure.
This creates a bootstrapping challenge: users need relayers to process transactions, and relayers need users to generate revenue. This is the same chicken-and-egg problem that every privacy pool faces, and it's one reason why most privacy pools have small anonymity sets. We don't have a magic solution. We're honest about that.
One possible path: subsidize relayer operations during the bootstrap phase. A protocol treasury could cover relayer infrastructure costs (the $50/month VPS) for the first year, removing the break-even requirement and allowing relayers to operate at any transaction volume. As the user base grows and transaction volume increases, the subsidy can be phased out. Tornado Cash never had this problem because relayers were already profitable at launch (ETH gas was high in 2019-2021), but current low-gas conditions make the bootstrap harder.
Another path: integrate with existing DeFi infrastructure. If a DEX aggregator routes private swaps through the Xythum protocol, the transaction volume comes from existing DeFi activity rather than from privacy-motivated users specifically. The privacy becomes a feature of the existing trade flow, not a separate product that needs its own user acquisition. This is the strategy we're pursuing.
A third path, more speculative: cross-chain relaying. If NOX relayers can submit transactions on multiple L2s (Arbitrum, Base, Optimism, zkSync), a single relayer captures volume across all chains. Instead of needing 1.4 transactions per day on one chain, a multi-chain relayer might see 0.3 transactions per day per chain across 5 chains -- same total volume, but with a much larger addressable market. The gas payment mechanism is chain-agnostic (it's a ZK proof about UTXO spending, not about a specific chain's state), so the same client-relayer protocol works on any EVM chain. The engineering challenge is contract deployment and bridging, not mixnet changes.
The Numbers Behind the Bootstrap
Let's quantify the bootstrap challenge more precisely using our economics data.
At the lowest gas conditions in our model ($2,000 ETH, 1 gwei gas):
- A deposit costs 5.03M gas = 0.00503 ETH = $10.06 in gas
- The relayer's 12% margin on 1.21
- A VPS costs 1.67/day
- Break-even: 1.21 = 1.38 transactions per day
At moderate conditions ($3,000 ETH, 10 gwei):
- A deposit costs 5.03M gas = 0.0503 ETH = $150.90 in gas
- The relayer's 12% margin is $18.11
- Break-even: 18.11 = 0.09 transactions per day (once every 11 days)
At high conditions ($5,000 ETH, 50 gwei):
- A deposit costs 5.03M gas = 0.2515 ETH = $1,257.50 in gas
- The relayer's 12% margin is $150.90
- Break-even: 150.90 = 0.01 transactions per day (once every 100 days)
The extreme sensitivity to gas conditions means the bootstrap problem is much worse in bear markets (low gas, low ETH price) and essentially nonexistent in bull markets (high gas, high ETH price). A relayer that launches during a bull market will be profitable from day one. A relayer that launches today (low gas conditions) needs real demand.
This is why the L2 strategy is compelling. On Arbitrum at 0.01 gwei effective gas price, a deposit costs ~0.018 per transaction. To break even at 1.67 / 0.15 + 0.005 for a public transfer. The 30x premium (vs L1's 93x premium) is more palatable.
The economics will ultimately determine whether this system gets used. The benchmark data informs the economics by establishing the gas costs, proof generation times, and infrastructure requirements. Without this data, the economic analysis would be guesswork. With it, potential relayers can make informed decisions about when and where to deploy.
Tier 6: Reliability Engineering
FEC Recovery: Making SURBs Actually Work
This is maybe the thing we're most proud of from an engineering perspective.
SURB (Single-Use Reply Block) responses in any mixnet are fragile. The response packet must traverse the entire return path -- 3 hops from exit node back to the client -- without any node dropping it. Each hop has some probability of packet loss (queue overflow, node restart, network hiccup). If any single hop drops the packet, the response is gone. The client never learns whether their transaction succeeded.
In our measurements, the raw SURB loss rate is 12.7%. That's... not great. One in eight responses vanishes.
For email, maybe you shrug it off and resend. For DeFi transaction receipts -- "did my $50,000 swap execute?" -- a 1-in-8 chance of never finding out is unacceptable.
Why Retransmission Doesn't Work for Anonymous Replies
The obvious fix is retransmission: if you don't get a response, ask again. This is what Katzenpost does with their ARQ (Automatic Repeat reQuest) protocol. But retransmission has a fundamental problem for anonymous communication: each retransmission request reveals the requester.
When a client retransmits a request, the exit node sees a second request with the same content from the same SURB return path. If the adversary controls the exit node, they now know: (1) the first response was lost, and (2) the same client is making a second request. Over multiple retransmissions, the adversary accumulates timing information about the client's retry behavior -- how long they wait before retrying, how many times they retry, whether their retry pattern correlates with specific transaction types.
More concretely: if the adversary controls both the exit node and one hop on the return path, they can intentionally drop the response to force a retransmission. Each forced retransmission gives the adversary another timing observation. After enough observations, they can correlate the client's retry pattern with their identity.
FEC eliminates this attack vector. With FEC, the exit node sends one batch of shards, and the client either reconstructs the response or it doesn't. No retry. No back-and-forth. No additional timing observations.
Reed-Solomon Forward Error Correction: The Implementation
Our FEC implementation uses Reed-Solomon erasure coding. Here's how it works:
-
Encoding: The exit node takes the response (e.g., a 300KB transaction receipt) and splits it into K=11 data shards of equal size. It then generates M=4 parity shards using polynomial interpolation over GF(2^8) (the Galois field of order 256). Total: 15 shards, each about 27KB.
-
Transmission: Each shard is sent as a separate SURB response packet through the mixnet. The 15 shards take 15 independent return paths (each with its own 3-hop SURB route).
-
Reception: The client collects arriving shards. As soon as it has any 11 of the 15 shards (any K of N), it can reconstruct the full response using Reed-Solomon decoding. It doesn't matter which 11 shards arrive -- any combination works.
-
Tolerance: The system tolerates up to 4 lost shards (M=4 parity shards lost). Since each shard traverses an independent path, shard losses are approximately independent. The probability that 5 or more shards are lost (out of 15, with 12.7% per-shard loss probability) is very low.
FEC recovery rates (11 data shards + 4 parity shards, 1,000 trials per loss rate):
| Packet Loss Rate | Without FEC | With FEC | Improvement |
|---|---|---|---|
| 0% | 100.0% | 100.0% | -- |
| 5% | 54.9% | 99.9% | 1.8x |
| 10% | 30.9% | 98.8% | 3.2x |
| 15% | 17.1% | 92.7% | 5.4x |
| 20% | 8.7% | 85.4% | 9.8x |
| 25% | 2.9% | 66.6% | 23.0x |
| 30% | 1.5% | 51.6% | 34.4x |
At 10% packet loss -- close to our measured 12.7% SURB loss rate -- FEC takes delivery from 30.9% to 98.8%. Without FEC, you need all 11 data shards to arrive (probability = 0.9^11 = 0.314, matching the measured 30.9%). With FEC, you need any 11 of 15 to arrive. The binomial probability of at least 11 successes out of 15 at p=0.9 is 0.987 -- matching our measured 98.8%.
The "without FEC" column is a common misconception. People think "10% loss means I get 90% of my data." No. With fragmented data, 10% loss means you get all 11 fragments only 31% of the time. This is why mixnet responses are so unreliable without FEC.
FEC Parameter Selection: The Redundancy Sweep
Choosing the right number of parity shards is a tradeoff. More parity shards = more reliable delivery but more bandwidth overhead. We ran a parameter sweep across 10 FEC ratios and 11 loss rates (5,000 trials each -- 550,000 total simulations):
| Loss Rate | 10% FEC (1 parity) | 30% FEC (3 parity) | 50% FEC (5 parity) | 70% FEC (7 parity) |
|---|---|---|---|---|
| 5% | 91.2% | 99.8% | 100.0% | 100.0% |
| 10% | 74.0% | 97.4% | 99.9% | 100.0% |
| 15% | 53.7% | 91.2% | 98.7% | 99.8% |
| 20% | 37.0% | 79.0% | 95.2% | 99.3% |
| 25% | 23.5% | 63.5% | 88.6% | 97.0% |
| 30% | 14.0% | 46.8% | 77.2% | 91.2% |
Our production configuration uses 30% FEC (11 data + 4 parity = 15 total shards, 1.36x bandwidth overhead). This achieves 97.4% delivery at 10% loss and 91.2% at 15% loss. We chose this over 50% FEC (which would give 99.9% at 10% loss) because the additional parity shards each consume a SURB, and SURBs are expensive to create and manage.
The optimal ratio sweep data is in fec_ratio_sweep.json with 550,000 data points. For our measured 12.7% loss rate, 30% FEC (4 parity shards) is the sweet spot: the marginal improvement from adding a 5th parity shard is small (~1.5% better delivery) while the bandwidth cost is meaningful (1.45x vs 1.36x overhead).
FEC vs ARQ: The Latency-Reliability Tradeoff
We compared FEC against ARQ (Automatic Repeat reQuest -- the "just resend it" approach that Katzenpost uses). With 11 data + 6 parity shards for FEC, and 5 max retries for ARQ, across 10,000 trials:
| Loss Rate | FEC Delivery | ARQ Delivery | ARQ Mean Round Trips | ARQ Bandwidth (shards) |
|---|---|---|---|---|
| 0% | 100.0% | 100.0% | 1.0 | 11.0 |
| 5% | 100.0% | 100.0% | 1.5 | 11.6 |
| 10% | 99.9% | 100.0% | 1.8 | 12.2 |
| 20% | 96.6% | 99.9% | 2.4 | 13.8 |
| 30% | 77.7% | 99.3% | 3.0 | 15.6 |
| 40% | 45.2% | 95.4% | 3.7 | 18.3 |
| 50% | 16.0% | 84.6% | 4.5 | 21.5 |
ARQ achieves higher delivery rates at extreme loss levels (95.4% at 40% loss vs FEC's 45.2%), but at a massive latency cost. At 20% loss, ARQ requires 2.4 round trips on average. In a mixnet where each round trip is 200-400ms (SURB RTT), that's 480-960ms of additional latency. FEC delivers in one round trip, no acknowledgment needed, no back-and-forth.
The crossover point is around 25% loss. Below that, FEC wins on latency with comparable reliability. Above that, ARQ wins on reliability but with multi-round-trip latency. For DeFi applications with block deadlines and MEV windows, FEC's single-round-trip property is the clear winner at typical loss rates (5-15%).
The bandwidth comparison is also revealing. FEC sends a fixed 17 shards regardless of loss rate. ARQ starts at 11 shards (baseline, no retries needed) and increases as retries are needed. At 35% loss, ARQ's mean bandwidth reaches 17 shards -- the break-even point. Above 35% loss, ARQ uses more bandwidth than FEC while also having higher latency.
As far as we know, no other mixnet uses FEC for SURB responses. Katzenpost uses ARQ. Nym doesn't appear to use either -- their SURB responses degrade silently under loss. If someone's beaten us to FEC in a mixnet, we'd genuinely love to know, but we looked hard and came up empty.
Why Reed-Solomon and Not Fountain Codes
A reasonable alternative to Reed-Solomon FEC is rateless erasure coding -- fountain codes like RaptorQ or LT codes. These codes have a useful property: the sender can generate an unlimited number of encoding symbols, and the receiver can reconstruct from any K symbols (slightly more than K, actually -- the overhead is about 2-5% for RaptorQ). This means you don't need to pre-commit to a fixed redundancy ratio.
We also considered a hybrid approach: send K data shards initially, then generate additional repair symbols on-demand if the receiver requests them (via a new SURB). This combines FEC's efficiency for low loss rates with ARQ's ability to recover from high loss. We didn't implement it because the additional complexity (managing multiple rounds of SURB exchanges) outweighed the benefit for our typical loss rates. But it remains a potential future optimization.
We chose Reed-Solomon over fountain codes for three reasons:
-
Fixed shard count. In our system, the exit node pre-generates SURB return packets for each shard. SURBs are expensive to create (each one includes routing headers for 3 hops). With Reed-Solomon, we know exactly how many SURBs we need (N = K + M). With fountain codes, the sender could theoretically generate unlimited shards, but the SURB budget is finite -- you can't embed unlimited return routes in the original request.
-
Simplicity. Reed-Solomon over GF(2^8) is conceptually simple and has battle-tested implementations. The
reed-solomon-erasureRust crate we use is widely deployed. Fountain codes add complexity (intermediate symbols, degree distributions, belief propagation decoding) that we didn't need for our shard counts (11-17). -
Small N. Reed-Solomon is optimal for small numbers of shards. At N=15 shards, the encoding and decoding cost is negligible (microseconds). At N=10,000, Reed-Solomon's O(N^2) decoding becomes expensive and fountain codes' O(N*log(N)) decoding wins. We're nowhere near that scale.
If we ever need to send responses that fragment into hundreds of shards (which would happen with very large responses or very high loss rates), fountain codes would be the right choice. For our current 300KB response / 15 shard configuration, Reed-Solomon is simpler and optimal.
FEC Implementation Details
The FEC codec is implemented in nox-core as the FecCodec struct. The key parameters:
- Response size: 300KB (approximately the size of a DeFi transaction receipt with all Merkle proofs, event logs, and confirmation data)
- Shard size: response_size / data_shards = 300KB / 11 ≈ 27KB per shard
- Shard packet size: Each shard fits within one Sphinx packet (32KB max payload). This is important -- if shards were larger than the Sphinx payload, we'd need packet fragmentation within FEC, adding another layer of complexity.
- SURB budget: Each shard requires one SURB for the return path. 15 SURBs per response. The client must include 15 SURBs in the original request, which means the request packet must be large enough to carry all 15 SURB routing headers. This is why 32KB packets matter -- with 2KB packets (Nym/Katzenpost default), you'd struggle to fit 15 SURBs.
The shard-to-Sphinx-packet packing is not trivial. Each shard gets a 4-byte header (shard index, total_data_shards, total_parity_shards, sequence_number) followed by the shard data. The receiver uses the headers to identify which shards have arrived and which are missing, then feeds the present shards to the Reed-Solomon decoder.
Real-World HTTP Proxying: What Privacy Actually Costs
All the numbers above are synthetic benchmarks -- controlled experiments measuring specific components. But what does the mixnet feel like for real HTTP requests? We proxied actual HTTPS requests through the mixnet (5 nodes, 3 hops, 5 runs per target) and compared against direct requests:
| Target | Category | Direct Median (ms) | Mixnet Median (ms) | Overhead Factor | Response Size |
|---|---|---|---|---|---|
| httpbin.org/get | API (JSON) | 378 | 1,097 | 2.9x | 221B |
| httpbin.org/bytes/1k | Binary (1KB) | 299 | 1,197 | 2.5x | 1,024B |
| httpbin.org/bytes/10k | Binary (10KB) | 248 | 1,300 | 2.5x | 10,240B |
| httpbin.org/bytes/100k | Binary (100KB) | 565 | 2,341 | 3.5x | 102,400B |
| httpbin.org/bytes/1m | Binary (1MB) | 602 | 2,057 | 2.7x | 102,400B* |
| Cloudflare DNS-over-HTTPS | DNS | 6 | 374 | 58x | 251B |
| httpbin.org/ip | API (minimal) | 388 | 1,223 | 2.8x | 32B |
*The 1MB request caps at 100KB received due to Sphinx packet payload limits.
The overhead factor tells the story. For typical API requests (1-10KB), the mixnet adds 2.5-2.9x latency. A request that takes 300ms directly takes about 1.1 seconds through the mixnet. For a DeFi application checking a price oracle or submitting a transaction, this is the actual cost of metadata privacy.
Two patterns stand out:
The constant floor. Small requests (Cloudflare DNS at 6ms direct) get hit hardest in percentage terms (58x overhead) because the mixnet adds a minimum ~250ms of latency regardless of payload size. This minimum comes from the 3-hop forward path, the exit node processing, the external HTTP request, and the 3-hop SURB return path. The overhead is roughly constant in absolute terms -- about 700-900ms -- which means it matters less as the base request gets slower.
The large payload ceiling. At 100KB, the overhead factor rises to 3.5x because the mixnet must fragment the response into multiple Sphinx packets. Each shard traverses the return path independently, and the client must wait for enough shards to arrive for Reed-Solomon reconstruction. The 1MB request demonstrates the current limit: only 100KB of the response is delivered because the SURB budget (allocated at request time) limits how many return shards can be sent.
For DeFi specifically, most on-chain operations involve small payloads (transaction submissions are <1KB, price queries are <500B). The 2.5-3x overhead puts private DeFi interactions at 1-1.5 seconds for typical API calls -- well within the tolerance of users who are already waiting 9-11 seconds for ZK proof generation. The mixnet latency is noise compared to the proof.
Competitive Comparison: The Full Picture
Feature Matrix
| Feature | NOX | Nym | Katzenpost | Tor | Loopix |
|---|---|---|---|---|---|
| Packet format | Sphinx | Sphinx + Outfox | Sphinx (NIKE + KEM) | Onion cells | Sphinx |
| Cover traffic | Server-side | Full | Full | None | Full (design) |
| SURB support | Yes + FEC | Yes | Yes + ARQ | No | Yes (design) |
| Replay protection | Bloom filter | Bloom + deferral | Tag-based | Circuit nonce | Tag-based |
| Forward secrecy | No | Yes (epoch) | Yes (epoch) | Yes (circuit) | No |
| Post-quantum | No | No | Yes (Xwing, MLKEM) | No | No |
| DeFi integration | Yes | No | No | No | No |
| FEC for SURBs | Yes | No | No | No | No |
| Published benchmarks | Yes (33 JSON) | No | Partial (CI) | Yes (Metrics) | Paper only |
| Production network | No | Yes (~700 nodes) | No | Yes (~7K relays) | No |
| Packet size | 32 KB | ~2 KB | Variable | 512 B | N/A |
The Damning Comparison: Published Benchmark Data
| Project | Per-Hop Benchmarks | E2E Throughput | Latency Distributions | Privacy Metrics | Attack Simulations | Raw Data Files |
|---|---|---|---|---|---|---|
| NOX | Yes (Criterion + integration) | Yes (multi-process, zero loss) | Yes (CDFs, 6 delay configs) | Yes (entropy, correlation, unlinkability) | Yes (5 attack types) | 33 JSON files |
| Nym | No | No | No | No | No | No |
| Katzenpost | Yes (18 cipher suites, CI) | No | No | No | No | Partial (CI artifacts) |
| Tor | N/A (different architecture) | Bandwidth only | Network-level (OnionPerf) | External researchers only | External researchers only | Yes (metrics portal) |
| Loopix | Paper only (2017) | Paper claim only (>300 msg/s) | "Seconds" (no percentiles) | Theoretical (Section 5) | Theoretical | No |
Nym has raised $94.5M and runs 550+ production nodes. They have zero published benchmark results. Not "limited." Not "partial." Zero. Their Sphinx crate has two Criterion benchmark functions that have never had results published. Their repository contains scaffolding for CPU cycle measurement that was never completed. Their documentation references performance in qualitative terms only.
For a network that people are supposed to trust with their metadata, this is a remarkable gap.
We want to preempt a defense we've heard before: "Nym doesn't need to publish benchmarks because they're a production network and the network speaks for itself." The problem with this argument is that you can't evaluate a privacy network by using it. If you send a message through Nym and it arrives, you know the network delivers messages. You don't know: (1) what per-hop latency the message experienced, (2) what anonymity set size protected it, (3) what timing correlation existed between your send and the delivery, (4) whether a passive observer could link your message to its recipient, or (5) whether the mixing was adequate for your threat model. These properties are invisible to end users. They can only be assessed through systematic measurement -- the kind of measurement Nym has never published.
The closest analog in traditional software engineering is security audits. We don't accept "the software works, therefore it's secure" as an argument. We require audits, penetration testing, and published vulnerability reports. Privacy infrastructure should be held to the same standard: "the mixnet delivers messages, therefore it provides privacy" is not sufficient. Show the entropy measurements. Show the timing correlation data. Show the attack resistance. Show something.
Competitor Deep-Dive
Nym ($94.5M raised, ~700 nodes, production since 2021): The closest architectural comparison to NOX -- both are Loopix-derived Sphinx mixnets with stratified topologies and Poisson mixing. What they have that we don't: a production network, Outfox (an AEAD-based alternative packet format for stratified topologies that avoids blinding factors), Noise protocol handshakes for all connections, odd/even key rotation for forward secrecy, and ecash-based bandwidth credentials. What they lack: any published performance data, DeFi integration, and FEC for SURB responses.
Outfox (Rial et al., 2025) deserves special attention. It's an alternative to Sphinx designed specifically for stratified topologies where the adversary doesn't control the infrastructure. By using AEAD (ChaCha20-Poly1305) instead of stream cipher + MAC, Outfox eliminates blinding factors and simplifies the packet format.
Here's a detailed performance comparison from the Outfox paper alongside our data:
| Packet Format | Per-Hop (us) | Key Exchange | Symmetric | Post-Quantum | Blinding Required |
|---|---|---|---|---|---|
| NOX Sphinx | 30.17 | X25519 ECDH | AES-CTR + HMAC | No | Yes |
| NOX Sphinx (integ) | 61.6 | X25519 ECDH | AES-CTR + HMAC | No | Yes |
| Outfox X25519 | ~31 | X25519 | ChaCha20-Poly1305 | No | No |
| Outfox ML-KEM-768 | ~243 | ML-KEM-768 | ChaCha20-Poly1305 | Yes | No |
| Katzenpost KEM | 55.7 | X25519 KEM | Stream + MAC + SPRP | No | No |
| Katzenpost Xwing | 172.6 | ML-KEM + X25519 | Stream + MAC + SPRP | Yes | No |
The Outfox X25519 number (~31 us) is nearly identical to our Criterion measurement (30.17 us). This makes sense -- both perform the same underlying X25519 scalar multiplication. The difference is that Outfox doesn't need blinding (saving the 22.9 us blinding step we measured in our per-hop breakdown), but it uses ChaCha20-Poly1305 AEAD instead of separate AES-CTR + HMAC (which adds a small amount back). The costs roughly cancel, landing at similar per-hop times.
The architectural difference is what matters: Outfox's no-blinding design means each hop can use a fresh key encapsulation rather than deriving shared secrets from blinded keys. This eliminates the group exponentiation that accounts for 37.2% of our per-hop cost. But Outfox achieves this by assuming a stratified topology with honest mix operators -- if an adversary controls a mix node, they can break the security properties that Sphinx's blinding factors provide. Our threat model (which assumes some fraction of nodes may be malicious) requires the stronger Sphinx guarantees.
If we were to implement Outfox alongside Sphinx as an alternative packet format (configurable by threat model), we'd expect per-hop processing around 35-40 microseconds in integration -- roughly a 35% latency improvement for users who trust the network infrastructure. This is a genuine option for deployment scenarios where the mix nodes are run by a known, trusted operator set (e.g., a permissioned network for institutional use). We haven't implemented it because our current focus is on the permissionless, adversarial-assumption design. But it's architecturally straightforward since both formats produce the same output: a decrypted payload at the exit node.
The post-quantum numbers are more consequential. Outfox ML-KEM-768 at 243 us/hop means a 3-hop path adds ~729 microseconds of Sphinx processing -- still under 1ms, still negligible in the DeFi latency budget (proof generation takes 9-11 seconds). Post-quantum mixnet communication is computationally feasible today. The reason we haven't prioritized it isn't performance -- it's that the post-quantum Sphinx security proofs are still being worked out, and we'd rather ship a well-understood classical design than an immature post-quantum one. Katzenpost, to their credit, is pushing ahead on this front. They'll have real-world post-quantum deployment data before we do.
Recent work by Cao & Green (2026, IACR ePrint 2026/101) exposes vulnerabilities in Nym's reputation scoring system. Low-stake nodes can manipulate the reputation system to degrade honest nodes' scores, eventually forcing honest nodes out of the topology and enabling de-anonymization. This is a fundamental design flaw in Nym's economic model, not a fixable bug. NOX doesn't use reputation scoring -- our NoxRegistry controls admission directly, sidestepping this attack vector (at the cost of centralized admission control, which has its own tradeoffs).
Katzenpost (Research project, Go, ~140 GitHub stars): The most academically rigorous implementation. Post-quantum cipher suites (Xwing, ML-KEM-768, CTIDH-512, FrodoKEM, sntrup4591761) via their pluggable HPQC library. Per-hop SPRP (wide-block cipher) for decryption oracle resistance -- a security property NOX lacks. Nightly CI benchmarks with regression detection. Their architecture document is a model of clarity.
What SPRP buys Katzenpost: if an adversary can submit chosen plaintexts for Sphinx processing (a "decryption oracle" attack), the SPRP makes the outputs computationally indistinguishable from random, even if the adversary observes the decrypted packets. Without SPRP (like NOX), an adversary with a decryption oracle can potentially distinguish packets by their content patterns. This is a real security property, not theater. We don't have it, and we should be honest about that.
What Katzenpost lacks: production deployment, DeFi integration, end-to-end performance data, FEC for SURB responses, and throughput benchmarks. Their benchmarks cover the Sphinx layer but nothing above it.
Tor (~7,000 relays, 17+ years, most deployed anonymity system): Fundamentally different architecture -- circuit-based onion routing, no mixing, no cover traffic. Included as a latency baseline, not a security comparison. Tor's metrics infrastructure (OnionPerf, since 2009) is the gold standard for real-world anonymity network monitoring. Their Tor Metrics portal publishes continuous measurements from vantage points on four continents, with historical data going back 15+ years.
Tor explicitly does not defend against a global passive adversary -- this is a deliberate design choice, not a weakness. Tor provides low-latency anonymous web browsing. Mix networks provide metadata privacy against global adversaries. They're different tools for different threat models.
Vuvuzela / Stadium (MIT/Stanford research, 2015/2017): A different approach entirely -- not Sphinx-based, not Loopix-based. Uses differential privacy noise and oblivious RAM to achieve provable metadata privacy. Stadium reports 68,000 messages per second for 1 million users with 37-second latency. The throughput is orders of magnitude higher than any Sphinx-based system, but the 37-second latency makes it unsuitable for interactive applications. The design also requires a fixed set of servers (not a decentralized open network), which limits deployability.
xx Network (David Chaum, cMix protocol, ~350 nodes): Uses precomputation to eliminate real-time asymmetric cryptography. The expensive operations happen in a precomputation phase (37.94 seconds for a batch of 1,000 messages), and the real-time phase uses only fast modular multiplications (3.58 seconds for the same batch). Fixed cascades (not free route selection). Reports 2-3 second latency. The precomputation approach is promising for high throughput but the fixed cascade architecture limits route diversity, and the precomputation phase creates a synchronization requirement that complicates dynamic network membership.
Let's work through the cMix numbers more carefully. The precomputation phase processes 1,000 messages in 37.94 seconds. That's 26.4 messages per second during precomputation. The real-time phase processes the same batch in 3.58 seconds -- 279 messages per second. But the real-time phase can't start until precomputation completes. If precomputation runs continuously, generating batches as fast as possible, the effective throughput is limited by the precomputation rate: 26.4 messages per second. If precomputation is done in advance (during idle periods), the burst throughput is 279 messages per second -- but only until the pre-computed batch is exhausted.
For comparison, our sustained throughput is 369 PPS with no precomputation requirement. We can process each packet independently as it arrives, with no batching requirement. cMix must collect 1,000 messages before processing them as a batch (the precomputation is batch-specific). This creates a latency-throughput tradeoff: smaller batches mean less precomputation time but more frequent precomputation phases. Larger batches amortize precomputation cost but add queuing latency (waiting for the batch to fill).
The fixed cascade is the more fundamental architectural difference. In cMix, messages traverse a fixed sequence of nodes (the "cascade"). All messages in a batch go through the same nodes in the same order. This means every message in the same batch has the same path through the network, which provides mixing within the batch but no route diversity. An adversary who can observe the cascade's inputs and outputs sees all messages enter and exit through the same nodes -- the anonymity comes entirely from the temporal mixing within the batch, not from path diversity.
Our route diversity finding (Section: "The Novel Route Diversity Finding") suggests this is a meaningful limitation. In stratified topologies with random route selection, path diversity provides most of the anonymity (96-98% of maximum entropy at 0ms delay). cMix's fixed cascade doesn't benefit from this effect.
Vuvuzela / Stadium (MIT/Stanford research, 2015/2017): A different approach entirely -- not Sphinx-based, not Loopix-based. Uses differential privacy noise and oblivious RAM to achieve provable metadata privacy. Stadium reports 68,000 messages per second for 1 million users with 37-second latency. The throughput is orders of magnitude higher than any Sphinx-based system, but the 37-second latency makes it unsuitable for interactive applications. The design also requires a fixed set of servers (not a decentralized open network), which limits deployability.
The 68,000 messages per second number sounds incredible until you unpack it. Stadium achieves this by parallelizing across multiple server groups, each handling an independent set of users. The per-group throughput is lower. And the 37-second latency is the full end-to-end time for a message to be delivered, including multiple rounds of noise addition, shuffling, and oblivious routing. For a DeFi swap that needs to confirm before the next block (12 seconds on Ethereum), 37 seconds is too slow.
Stadium's design also requires all participating servers to be online simultaneously and to run the protocol in lockstep. If one server fails, the entire round is invalidated and must be restarted. This is a fundamentally different failure model from our system, where individual node failures cause only localized packet loss (mitigated by FEC) rather than global protocol failure.
The Anonymity Trilemma Through Benchmark Data
Das et al. (IEEE S&P 2018) formalized the "Anonymity Trilemma": you can have at most two of (1) strong anonymity, (2) low bandwidth overhead, (3) low latency. Let's see where our benchmark data places NOX and its competitors in this trilemma space:
| System | Anonymity | Bandwidth Overhead | Latency | Trilemma Region |
|---|---|---|---|---|
| NOX (0ms) | H=3.25 bits | 1.0x (no cover) | p50=89ms | High anon + low BW + low lat |
| NOX (50ms) | H=3.21 bits | 1.0x (no cover) | p50=181ms | High anon + low BW + med lat |
| NOX + cover | H=3.25 bits | 2.59x (5 PPS) | p50=89ms | High anon + med BW + low lat |
| Tor | No mixing entropy | 1.0x | p50=85ms (EU) | Med anon + low BW + low lat |
| Loopix | Theoretical high | ~3-4x | "seconds" | High anon + med BW + high lat |
| Stadium | Provable | ~5x | 37 seconds | Max anon + high BW + high lat |
Our data suggests NOX at 0ms mixing delay sits in a region the trilemma says shouldn't exist: high anonymity (97.7% entropy), low bandwidth overhead (no cover traffic), and low latency (p50=89ms). This is because the route diversity finding partially circumvents one arm of the trilemma. The trilemma assumes that anonymity comes from mixing delay and cover traffic. If route diversity provides most of the anonymity, you can reduce mixing delay (improving latency) without proportionally reducing anonymity.
But -- and this is a big but -- this only works if the adversary is limited to passive observation of network inputs and outputs. The trilemma is proven for a specific adversary model (global passive adversary with unbounded observation time). Our entropy measurements test against a weaker adversary. The unlinkability results (KS test) show that a timing-correlation adversary can distinguish traffic at low delays. So the trilemma holds if you strengthen the adversary model. Our route diversity finding is real, but it doesn't break the trilemma -- it exploits a gap between the trilemma's assumed adversary and the adversary our entropy metric tests against.
This is why honest benchmarking requires publishing multiple privacy metrics. Entropy alone would make us look like we've broken the trilemma. Unlinkability alone would make 0ms delay look terrible. Together, they paint the accurate picture: route diversity provides strong anonymity set quality, but timing-based attacks require mixing delay to defend against.
Benchmark Artifacts
For reproducibility, here is the complete list of benchmark data files in scripts/bench/data/:
Cryptographic Primitives (Tier 1)
per_hop_breakdown.jsonPer-hop Sphinx processing breakdown (1,449 samples, 6 phases)
Network Performance (Tier 2)
throughput_sweep.jsonIn-process throughput at 5 injection rateslatency_cdf.json4,916 individual latency measurements, 1ms delaylatency_cdf_nodelay.jsonLatency CDF at 0ms delaylatency_vs_delay.jsonLatency percentiles across 6 delay configurationssurb_rtt.jsonSURB round-trip times (500 packets)surb_rtt_1ms.jsonSURB RTT at 1ms delaysurb_rtt_fec.jsonSURB RTT with FEC enabledhttp_proxy.jsonReal-world HTTP proxy latency measurements
Multi-Process Scale (Tier 3)
mp_throughput_sweep.jsonMulti-process throughput at 5 injection ratesmp_latency.jsonMulti-process latency distributionscaling.jsonScaling behavior from 5 to 50 nodesconcurrency_sweep.jsonConcurrency parameter tuning
Anonymity Metrics (Tier 4)
entropy.jsonShannon entropy across 9 delay configurationsentropy_vs_users.jsonEntropy scaling with user countentropy_vs_cover.jsonEntropy vs cover traffic rateunlinkability.jsonKS test p-values for unlinkabilitytiming_correlation.jsonTiming correlation analysiscover_traffic.jsonCover traffic overhead measurementscover_analysis.jsonActive/idle distinguishabilitycombined_anonymity.jsonCombined anonymity metricstraffic_levels.jsonTraffic level analysis
Attack Resistance (Tier 5)
attack_sim.jsonn-1, intersection, compromised node resultspow_dos.jsonPoW anti-spam measurements across 6 difficulty levelsreplay_detection.jsonBloom filter and Sled performance
DeFi Pipeline (Tier 6)
defi_pipeline.jsonEnd-to-end DeFi operation timingsgas_profile.jsonPer-circuit gas usage and proof generation timeseconomics.jsonRelayer economics across 72 scenarios (8 circuits x 3 ETH prices x 3 gas prices)fec_recovery.jsonFEC delivery rates across 9 loss ratesfec_vs_arq.jsonFEC vs ARQ comparison (10,000 trials per loss rate)fec_ratio_sweep.jsonFEC parameter sweep (550,000 simulations)operational.jsonStartup time, memory usage, disk I/Ocompetitors.jsonCompetitive reference data with citations
Total: 33 JSON files. Every file includes hardware spec, git commit hash, timestamp, and parameters. Every file can be regenerated with run_all.sh.
Charts are rendered from these JSON files using Python (matplotlib). Each data file produces a paired PNG + SVG chart. The chart generation scripts are in scripts/bench/charts/.
CI Integration
Currently, benchmarks are run manually. We're building CI integration (Sprint 1 of our release campaign) that will:
- Run the full benchmark suite on every merge to
main - Compare results against the previous run
- Flag regressions above 10% with a failing check
- Archive results as CI artifacts for historical tracking
Katzenpost's CI benchmark pipeline (nightly cron, benchmark-action/github-action-benchmark, 110% regression threshold) is the model we're following. They're ahead of us on CI integration. We give them credit for that.
How to Reproduce Everything
If you want to verify our claims, here's the exact process:
# Prerequisites: Rust 1.93+, Python 3.10+, matplotlib, Anvil (from Foundry)
# Clone and checkout the benchmark commit
git checkout db6ea3c
# Tier 1: Cryptographic micro-benchmarks (takes ~2 minutes)
cargo bench --bench crypto_bench --release
# Tier 2-3: Network benchmarks (takes ~10 minutes)
# Start with in-process benchmarks
cargo run --bin nox_bench --release -- throughput
cargo run --bin nox_bench --release -- latency --packets 5000
cargo run --bin nox_bench --release -- latency-vs-delay
cargo run --bin nox_bench --release -- surb-rtt
cargo run --bin nox_bench --release -- per-hop
cargo run --bin nox_bench --release -- operational
# Multi-process benchmarks (takes ~15 minutes)
cargo run --bin nox_multiprocess_bench --release -- throughput
cargo run --bin nox_multiprocess_bench --release -- scaling
# Tier 4: Privacy analytics (takes ~5 minutes)
cargo run --bin nox_privacy_analytics --release -- entropy
cargo run --bin nox_privacy_analytics --release -- unlinkability
cargo run --bin nox_privacy_analytics --release -- cover-traffic
cargo run --bin nox_privacy_analytics --release -- cover-analysis
cargo run --bin nox_privacy_analytics --release -- attack-sim
# Tier 5: DeFi pipeline (requires Anvil running, takes ~30 minutes)
# In one terminal: anvil --port 8545
cargo run --bin micro_mainnet_sim --release --features dev-node
# Tier 6: FEC and economics (takes ~5 minutes)
cargo run --bin nox_economics --release -- fec
cargo run --bin nox_economics --release -- economics
# Generate all charts
python3 scripts/bench/charts/generate_all.py
# Or just run everything:
./scripts/bench/run_all.shTotal reproduction time: approximately 1 hour on a modern 8-core machine. The longest step is micro_mainnet_sim (30 minutes), which runs real ZK proof generation for 8 circuit types.
Every JSON file includes a git_commit field. If your results differ significantly from ours (>15% for timing benchmarks), check that: (1) you're on the same commit, (2) you're compiling with --release, (3) your CPU supports AES-NI (affects symmetric crypto by 10-50x), and (4) nothing else is competing for CPU during the benchmark.
The State of Mixnet Benchmarking in 2026
Before we get to what our benchmarks don't show, let's step back and assess the field.
It's March 2026. Mix networks have been studied for 45 years since Chaum's 1981 paper. The Loopix paper is 9 years old. Nym has been in production for 5 years. And the state of empirical benchmarking across the field is... sparse.
Here's a complete inventory of publicly available mixnet performance data, across all implementations, as of today:
Tor (most data): OnionPerf continuous measurement since 2009. Circuit latencies from 6+ vantage points, updated daily. Relay bandwidth measurements. User count estimates. Directory authority statistics. All publicly accessible at metrics.torproject.org with downloadable CSVs. However: all measurements are at the network level (circuit latency, bandwidth). No per-operation crypto benchmarks. No privacy analytics. All anonymity research comes from external academics, not the Tor Project.
Katzenpost (partial): Nightly CI runs benchmarking Sphinx creation and unwrap for 18 cipher suites. Results visible in CI artifacts. Regression detection at 110% threshold. This is the best per-operation benchmark pipeline in the field. However: no throughput, no latency distributions, no privacy metrics, no attack simulations. The CI benchmarks cover one component (Sphinx processing) at one level (micro-benchmark).
Loopix (paper only): The 2017 USENIX Security paper reports ">300 msg/s" throughput and "seconds" latency. These numbers are 9 years old, from a Python implementation on 2017 AWS instances. No raw data. Not reproducible.
Nym (nothing): Two Criterion benchmark functions exist in nymtech/sphinx. Results never published. CPU cycle measurement scaffolding never completed. No throughput data. No latency data. No privacy data. Five years of production operation, $94.5M in funding, zero published benchmarks.
cMix / xx Network (paper only): The 2016 ePrint paper reports precomputation (37.94s for 1000 messages) and real-time (3.58s for 1000 messages) performance. These numbers are from a 2016 implementation. The production xx Network claims "2-3 second" latency in marketing materials but provides no systematic measurement.
NOX (this post): 33 JSON data files covering 6 tiers. All reproducible. All with hardware specs, git commits, and methodology. Published here for the first time.
That's it. That's the entire publicly available empirical performance data for mix networks in 2026.
If this seems thin for a field that's received hundreds of millions of dollars in funding and aims to protect billions of users' metadata, it is.
What Good Mixnet Benchmarking Would Look Like
The field needs an "MLPerf for mixnets" a standard methodology that every project follows. We propose six tiers: (1) micro-benchmarks with warm-up, outlier detection, and confidence intervals; (2) system throughput at multiple loss rates with latency distributions; (3) privacy metrics (Shannon entropy, KS test, timing correlation); (4) attack resistance including n-1 and intersection attacks; (5) raw data publication with hardware specs and git commits; (6) reproducibility via a single script. This maps directly to our 6-tier benchmark suite described above. The barrier is effort and willingness, not technology every component exists today in open-source frameworks. A community-maintained comparison site (like TechEmpower for web frameworks) would be transformative. We'd contribute our data on day one.
What the Benchmarks Don't Show
We've been explicit throughout this post about the limitations of individual measurements. But it's worth collecting all the caveats in one place, because the things we haven't measured are as important as the things we have.
What We Haven't Benchmarked
Long-term behavioral analysis. Our benchmarks are snapshots: 1,000-10,000 packets over seconds to minutes. A real adversary observes over weeks and months. We haven't measured how privacy degrades under sustained observation -- the LLMix-style cumulative leakage that entropy metrics miss. This is the hardest thing to benchmark (you need realistic user behavior models) and nobody does it well, but it's where the real privacy failures live.
Adversary with ML capabilities. Our attack simulations use simple statistical tests (KS test, entropy computation). A sophisticated adversary would use neural networks trained on traffic patterns. The MixMatch paper (Oldenburg et al., 2024, PoPETs Best Student Paper) demonstrated flow matching for mixnet traffic analysis. We haven't tested our system against ML-based attacks. We should.
Multi-week stability. Our longest test runs for about 30 minutes (the full micro_mainnet_sim). We haven't measured memory leaks over days of operation, Bloom filter accuracy degradation over millions of packets, or gossipsub mesh stability over weeks. Production systems need endurance testing, and we haven't done it.
Geographic diversity effects. All our benchmarks run on localhost. In a real deployment, nodes would be distributed across data centers with 1-100ms inter-node latency. The impact on throughput is obvious (network RTT becomes a bottleneck instead of CPU), but the impact on anonymity is less clear. Geographic timing correlations could reduce effective route diversity. We haven't quantified this.
Client-level privacy under realistic usage. Our entropy measurements assume uniform traffic from all senders. Real DeFi users have bursty, non-uniform behavior: a whale making one large trade per week looks very different from a trader making hundreds of small swaps per day. The MOCHA simulator suggests that client-level privacy can be significantly worse than message-level privacy under realistic usage patterns.
Forward secrecy and key compromise. We don't have forward secrecy (Nym and Katzenpost do). We haven't benchmarked or analyzed the impact of key compromise on historical traffic -- if a node's long-term key is compromised, all past traffic through that node is potentially decryptable. This is a known gap in our security model.
The forward secrecy gap is particularly concerning for a DeFi application. Consider: an adversary records encrypted Sphinx packets for months. Later, they compromise a node's long-term X25519 key (through a hack, a subpoena, or a disgruntled employee). With that key, they can retroactively decrypt every Sphinx layer that node processed. If the node was on the path for a $10M transfer, the adversary now knows the sender and recipient of that transfer -- months after it happened.
Nym mitigates this with odd/even key rotation: nodes cycle their keys at epoch boundaries, and old keys are deleted. Even if the current key is compromised, past traffic encrypted with deleted keys is unrecoverable. Katzenpost does the same with their PKI system. We don't do this yet because our epoch transition mechanism isn't implemented. It's straightforward engineering (generate new key, announce via gossipsub, old key expires after 2 epochs), but it requires coordination with the registry contract and topology manager. It's on the roadmap for post-launch.
Mobile and light client performance. All our benchmarks assume server-class hardware. A mobile client needs to construct Sphinx packets (222 microseconds on our Ryzen -- but what about an iPhone 15's A17 Pro?), manage SURB bundles, maintain cover traffic, and run the FEC decoder. We haven't profiled any of this on mobile hardware. The WASM compilation path for our crypto primitives exists (Poseidon2 and AES-128-CBC are pure Rust with no platform dependencies), but the performance characteristics could be dramatically different. Battery drain from cover traffic is a particular concern: even at the minimal 0.5 PPS rate that our cover analysis suggests, a mobile client would be sending a packet every 2 seconds continuously. On cellular data, this is both expensive and power-hungry.
Post-quantum resistance. We use X25519, which is breakable by a sufficiently large quantum computer. Katzenpost has post-quantum cipher suites; we don't. We haven't benchmarked the performance impact of switching to post-quantum cryptography, though Katzenpost's Xwing numbers (172.6 microseconds per hop) give a rough upper bound: about 5.7x our current cost.
The "harvest now, decrypt later" threat is real for mix networks. An adversary recording Sphinx packets today could decrypt them with a future quantum computer. For most web traffic, this isn't a big deal -- the content of a Google search from 2026 is unlikely to be valuable in 2036. But for financial transactions, it's different. Knowing that address 0xABC... sent $1M to address 0xDEF... in 2026 could be valuable for decades (for tax enforcement, sanctions compliance, or competitive intelligence). A privacy system protecting financial transactions arguably needs post-quantum guarantees sooner than a general-purpose messaging system.
We're watching the post-quantum KEM standardization process (ML-KEM is finalized in FIPS 203, but the hybrid designs for mixnets are still evolving). When the community converges on a standard approach for post-quantum Sphinx-like formats, we'll implement it. Katzenpost's Xwing (X25519 + ML-KEM-768) is the leading candidate -- it's a hedge that provides post-quantum security if ML-KEM is sound while maintaining classical security via X25519 if ML-KEM is broken. The 5.7x per-hop cost is acceptable given our DeFi latency budget analysis (Sphinx processing is <1% of end-to-end latency).
What the Numbers Can and Can't Tell You
Our benchmarks can tell you:
- How fast NOX processes Sphinx packets (30-62 microseconds per hop)
- How much throughput a single-machine deployment achieves (234-369 PPS)
- How latency distributes across mixing delay configurations
- What the anonymity set looks like under various configurations
- How much gas a private DeFi operation costs
- How reliably FEC delivers SURB responses
Our benchmarks cannot tell you:
- Whether NOX would survive a sophisticated, patient, well-resourced adversary
- Whether the system is stable over months of continuous operation
- Whether the privacy properties hold under realistic (non-uniform) user behavior
- Whether the system would perform well at 1,000-node scale with geographic distribution
- Whether the economic model would attract enough relayers to sustain a healthy network
We've presented the numbers with their caveats because honest benchmarking requires it. The alternative -- not publishing because the numbers aren't perfect -- is worse.
The Honest Assessment
Let's be unflinching about where we stand:
What we're confident about:
- Our Sphinx implementation is fast. 30 microseconds per hop puts us at or near the fastest Sphinx implementation measured to date (pending Nym publishing their numbers so we can actually compare). The 2x gap to integration cost is well-understood and well-instrumented.
- Our FEC system works. 98.8% delivery at 10% loss, up from 30.9% without FEC. This is a real engineering contribution that makes mixnet-based DeFi feasible.
- Our route diversity finding is real. The entropy data clearly shows that mixing delay contributes minimally to anonymity set quality in stratified topologies. Whether this finding is "novel" depends on whether it's been stated explicitly elsewhere (we haven't found it).
- Our economic model is viable. Relayers can profitably operate at current gas prices with modest transaction volume.
What we're uncertain about:
- Whether our privacy properties hold under ML-based adversaries (LLMix-class attacks). The MixMatch flow-matching approach achieved significant deanonymization against Loopix-style mixes in simulation. Our stratified topology and 3-hop constraint may provide resistance (fewer timing features to exploit), but we haven't tested this. Until someone runs an ML classifier against real NOX traffic captures, this is an open question.
- Whether our throughput scales to production-relevant node counts (500+ nodes). Our scaling data covers 5 to 50 nodes. The 50-node throughput spike (279 PPS vs 200-217 at smaller counts) suggests that scale helps by reducing per-node queue depth, but we don't know if this trend continues, plateaus, or reverses at 500 nodes where gossipsub overhead and topology coordination become significant factors.
- Whether the route diversity finding holds with geographic diversity and realistic network conditions. Our entropy measurements assume uniform inter-node latency (all on localhost). In a geographically distributed deployment, nodes in the same data center would have ~1ms RTT while cross-continent links would have 50-150ms RTT. This latency variance could create observable timing signatures that reduce effective route diversity even if the stratified topology provides theoretical uniformity.
- Whether the economic model attracts enough relayers to bootstrap a healthy network. Our profitability analysis shows positive margins at current gas prices, but it assumes transaction volume. The cold-start problem -- how do you get enough relayers when there aren't enough users, and enough users when there aren't enough relayers -- is an economics problem that no amount of benchmarking can answer.
What we know is weak:
- No forward secrecy. If a node's key is compromised, all historical traffic through that node is vulnerable. This is the gap that concerns us most. The "harvest now, decrypt later" threat means every packet we process today is a liability. Nym's odd/even key rotation is the right approach; we need to implement it.
- No post-quantum cryptography. Katzenpost is ahead here with Xwing (X25519 + ML-KEM-768). Their per-hop cost is 172.6 microseconds -- about 5.7x ours -- but this is well within our latency budget. We're waiting for community convergence on hybrid KEM designs for mixnet packet formats, but "waiting" is starting to feel like "procrastinating."
- No client-side cover traffic. Our intersection attack simulation converges to full deanonymization in 20 epochs. This is a known, well-studied weakness. The fix (clients send cover traffic during idle periods) is conceptually simple but operationally complex: it requires bandwidth from users who aren't actively transacting, which creates a tension with mobile/light clients where bandwidth and battery are constrained.
- No SPRP/wide-block cipher. A decryption oracle (adversary who can submit packets and observe whether they decrypt successfully) can distinguish real packets from crafted ones. This is a theoretical concern that becomes practical if any mix node is compromised. Lioness (used by Tor) or AEZ would close this gap, at a performance cost we haven't quantified.
- All benchmarks on localhost. No real-world distributed deployment data. This is the limitation we're most embarrassed about. Every throughput number, every latency percentile, every privacy metric in this post was measured with 0ms network RTT between nodes. Real deployments will be slower. How much slower? We don't know, and that's the honest answer.
These weaknesses aren't bugs -- they're conscious design decisions (except the last one, which is just a maturity limitation). We chose to prioritize DeFi integration, FEC reliability, and comprehensive benchmarking over forward secrecy and post-quantum resistance. Other projects made different choices. Users should evaluate both the strengths and weaknesses when choosing their tools.
The Performance Characteristics We'd Change
With the benefit of hindsight and a complete benchmark suite, here's what we'd design differently if starting from scratch:
Larger SURB bundles by default. Our current 30% FEC ratio (4 parity shards) was chosen conservatively. The SURB RTT data shows that more SURBs not only improve reliability but also latency (via the order-statistic effect). A 50% ratio (6 parity shards) would give us 99.9% delivery at 10% loss while further reducing p50 RTT. The cost is 6 more SURBs per response -- about 15% more bandwidth. Worth it.
Parameterized mixing delay per hop, not per network. Currently, all hops use the same Poisson delay parameter. But our entropy data shows the first hop contributes more to route diversity than the last hop (because it feeds into the most possible downstream paths). A design where the first hop has minimal delay (for latency) and the last hop has maximum delay (for timing unlinkability) could optimize both properties simultaneously. This is a well-known idea in the literature (Möller, Mladenov, 2003 -- "variable-rate stop-and-go mixes") but rarely implemented.
Pre-computed route caching. Our Sphinx packet construction takes 222 microseconds for 3 hops. This is negligible for single operations, but if a client is submitting a batch of 10 transactions (e.g., rebalancing a portfolio), it could pre-compute all 10 route selections and Sphinx headers in parallel. A BatchSphinxBuilder that amortizes the initialization cost would save about 30% per-packet for batches, based on our per-hop marginal cost analysis (76 us for hop 1, 74 us for hop 2 -- the first-hop cost includes initialization).
Native prover instead of bb.js subprocess. The biggest latency improvement would come from replacing the Node.js subprocess prover with a native Rust prover. Our proof generation takes 9-11 seconds through prove_cli.mjs (a Node.js process that loads bb.js, initializes the WASM backend, and runs the UltraHonk prover). A native implementation using the Barretenberg C++ library directly (via FFI) could realistically achieve 3-5 seconds -- the WASM overhead in bb.js is substantial. This single change would cut end-to-end DeFi latency by 30-50%.
We're pursuing the last point most aggressively because the leverage is highest. Proof generation dominates the latency budget, and a native prover would make the entire paid mixnet pipeline complete in under 5 seconds on server hardware.
Methodology
All benchmarks follow the 6-point methodology described above: hardware-specified, release-mode, warmed-up, statistically significant, raw-data-published, and script-reproducible. Privacy analytics use 1,000-10,000 trials per data point; FEC sweeps total 550,000 trials across 110 parameter combinations. Raw latency arrays are stored at full precision (4,916 individual measurements in latency_cdf.json alone) for independent analysis.
The Benchmark Numbers That Surprised Us
Before we list the caveats, it's worth reflecting on which results defied our expectations. In benchmarking, surprises are the most valuable data points -- they reveal gaps between mental models and reality.
Surprise 1: Route diversity dominating entropy (97.7% at 0ms delay). We expected entropy to be strongly correlated with mixing delay. It wasn't. This challenged our mental model of how Loopix-style systems work and led to the route diversity finding. If we'd only measured entropy at one delay setting (say, 50ms), we'd have missed this entirely. Sweeping 9 configurations was essential.
Surprise 2: Multi-process being faster than in-process (369 vs 234 PPS). We expected the IPC overhead of multi-process mode to make it slower. Instead, the resource isolation benefit dominated. Each node process gets fair CPU scheduling from the OS, which prevents the mutual starvation we saw when all nodes competed within one Tokio runtime. This was a concrete validation of the "process per node" deployment architecture.
Surprise 3: FEC improving latency by 25%, not just reliability. We implemented FEC purely for reliability. The latency improvement from the order-statistic effect (taking the K-th fastest of N paths) was a bonus we didn't predict. The p50 SURB RTT dropped from 329ms to 264ms -- a 65ms improvement that came for free. This is the kind of finding that only emerges from measuring the right thing.
Surprise 4: Cover traffic hurting entropy at high rates. We assumed more cover traffic would always help. It doesn't. At 10 PPS cover, entropy drops to 94.2% (from 100% at zero cover). The explanation -- that high cover rates make the cover pattern itself predictable, allowing an adversary to subtract it and find the residual real traffic -- is logical in retrospect but wasn't in any of the papers we'd read about cover traffic design.
Surprise 5: The timing correlation at 1ms delay (Pearson r = 0.9999). We knew 1ms delay wasn't much, but 0.9999 correlation was worse than expected. It means the Poisson mixing at 1ms delay is doing almost nothing for timing privacy. The system essentially passes packets through in arrival order with a tiny (millisecond-scale) jitter. Route diversity saves us from this being catastrophic (entropy is still 96.9% at 1ms), but it's a stark reminder that low delays provide route-diversity-based anonymity only, not timing-based anonymity.
Each of these surprises shaped a design decision. The route diversity finding informed our recommendation for low mixing delay in DeFi applications. The multi-process result validated our deployment architecture. The FEC latency improvement justified higher FEC ratios than reliability alone would suggest. The cover traffic result set our recommended rate at 2 PPS (the KS-test sweet spot) rather than "as much as you can afford." The timing correlation result made us honest about the 50ms minimum delay requirement for timing-based privacy.
What These Numbers Are Not
We've been explicit throughout this series, but it bears repeating:
-
These are local benchmarks, not geo-distributed measurements. All nodes run on a single machine. Real distributed deployment introduces network latency, packet loss, clock skew, and geographic diversity. Our numbers are best-case for cryptographic throughput and worst-case for the contention artifacts of co-located nodes.
-
Cross-system comparisons are inherently imperfect. Different hardware, different languages, different measurement methodology. The ratios indicate order-of-magnitude relationships, not precise competitive advantages. Our 4.8x advantage over Katzenpost NIKE could be 2x on the same hardware, or it could be 6x. The point is it's not 1x (same performance) or 0.5x (we're slower).
-
Localhost networking eliminates real network effects. TCP connections between processes on the same machine have ~100us RTT. Real internet links have 1-100ms RTT. Throughput and latency would both change in a distributed deployment -- throughput would decrease (network becomes the bottleneck instead of CPU), but latency might not change much (mixing delay already dominates at 50ms+).
-
Small-scale topology. We tested 5-50 nodes. Nym runs 700. Tor runs 7,000. Performance characteristics change with scale in non-obvious ways -- larger anonymity sets are better for privacy, but more nodes mean more hops, more mixing queues, and more opportunities for packet loss. We can't extrapolate our 50-node results to 700 nodes without testing at that scale.
-
Single-run gas measurements. Our gas profile data comes from a single run of each circuit type on Anvil. Gas usage is deterministic for a given contract and input, but the proof generation time can vary by 5-10% between runs depending on prover cache state and system load. We report single measurements, not averaged across multiple runs. The gas numbers are exact; the proof generation times are approximate.
-
Specific traffic pattern. Our benchmarks use uniform traffic from all senders at a constant rate. This is the best case for anonymity metrics. Real traffic is bursty and non-uniform, which would reduce effective anonymity set quality. We haven't modeled realistic DeFi traffic patterns (which would involve correlated bursts around market events, power-law transaction size distributions, and time-of-day effects).
Measurement Precision
A note on precision, because it matters for reproducibility:
-
Timing resolution:
std::time::Instanton Linux maps toclock_gettime(CLOCK_MONOTONIC), which has nanosecond resolution. Our fastest measurement (MAC verify at 330ns) is well above the timing floor. -
JSON precision: All timing values in our JSON files are stored as 64-bit floats (f64). For nanosecond-scale measurements, this gives 15-16 significant digits -- more than enough precision. Floating-point representation error is negligible.
-
Statistical confidence: Criterion.rs reports 95% confidence intervals for the mean. For our Sphinx benchmark, the 95% CI is approximately +/- 0.3 microseconds around the 30.17 microsecond mean. This means we're confident the true mean lies between 29.87 and 30.47 microseconds.
-
Outlier handling: Criterion classifies outliers as "mild" (1.5-3x IQR from quartiles) or "severe" (>3x). In our Sphinx benchmark, approximately 2% of iterations are classified as mild outliers, typically caused by OS scheduling interrupts or cache evictions. Criterion includes outliers in its statistics but reports the outlier percentage so you can assess their impact.
A Note on Comparing Across Papers
One challenge in the competitive comparison sections is that different projects measure under different conditions and publish with different levels of detail. Here's how we handle that, and the uncertainty it introduces:
Katzenpost numbers come from their nightly CI benchmarks. We know the benchmark code (sphinx_benchmark_test.go, BenchmarkSphinxUnwrap), and we know it uses Go's built-in benchmarking framework (similar rigor to Criterion). We don't know their CI hardware. We assume it's a standard GitHub Actions runner (2-core x86-64 VM, typically AMD EPYC, similar-ish to cloud VMs). If their CI uses beefier hardware, our advantage ratio shrinks. If it's wimpy VMs, our advantage grows. We estimate +/- 30% uncertainty on the ratios.
Loopix numbers come from a 2017 paper. The hardware is described as "AWS EC2" without instance type. We assume a contemporary instance (c4.xlarge or similar -- 4 vCPUs, 7.5 GiB). Python's interpretive overhead dominates (20-50x slower than compiled Rust for pure computation), so hardware differences contribute perhaps 2-3x of the 50x gap. The remaining 20-25x is genuinely from language runtime differences. Uncertainty on the 50x ratio: it's probably between 30x and 100x on equivalent hardware.
Nym numbers don't exist, so there's nothing to compare. If they published, we'd expect their per-hop time to be 30-60 microseconds (same crypto primitives, same language, likely similar implementation quality). Our advantage, if any, would be small. The interesting comparison would be in system-level metrics (throughput, latency, privacy) where architectural differences matter more than raw crypto speed.
cMix numbers are from a 2016 paper. The hardware is described as "Amazon EC2 c4.4xlarge" (16 vCPUs, 30 GiB). The batch-oriented architecture makes direct per-packet comparison misleading -- cMix amortizes costs across batches in a way that single-packet systems don't. The "3.58 seconds for 1,000 messages" in real-time phase is ~3.58ms per message, which is much slower than our 61.6 microseconds. But cMix's real-time phase uses no asymmetric cryptography (it was all done in precomputation), so the comparison is apples-to-oranges.
We present all competitor comparisons with explicit source citations and these uncertainty caveats. If any project disagrees with our characterization of their numbers, we'll update the comparison. The goal is accurate representation, not favorable optics.
NOX By the Numbers: The Complete Summary
For readers who scrolled to the end (we don't judge), here's every key measurement from this post in one place. All numbers were measured on AMD Ryzen 7 9800X3D, 30 GB RAM, Ubuntu 24.04/WSL2, Rust 1.86.0-nightly.
Cryptographic Primitives:
| Operation | Time | Method |
|---|---|---|
| Poseidon2 (1 field) | 1.73 µs | Criterion |
| AES-128-CBC encrypt (208B) | 330 ns | Criterion |
| AES-128-CBC decrypt (208B) | 365 ns | Criterion |
| BabyJubJub scalar mul | 18.2 µs | Criterion |
| X25519 ECDH | 5.87 µs | Criterion |
| Sphinx per-hop (micro) | 30.17 µs | Criterion |
| Sphinx per-hop (integration) | 61.6 µs | E2E test |
| Sphinx 3-hop creation | 222 µs | Criterion |
Network Performance:
| Metric | Value | Configuration |
|---|---|---|
| Single-process PPS | 234 | 15 nodes, 1ms delay |
| Multi-process PPS | 369-466 | 15 nodes, 5 workers |
| Peak concurrency | 50 workers | Plateaus at ~128 PPS |
| 50-node throughput | 279 PPS | Lower per-node queue depth |
| MP delivery rate | 100% | 0% loss at all tested rates |
| MP p50 latency | 202 ms | Mean 176ms, p99 305ms |
Privacy Metrics:
| Metric | Value | Configuration |
|---|---|---|
| Shannon entropy (0ms delay) | 3.25 bits | 97.7% of maximum |
| Shannon entropy (100ms delay) | 3.32 bits | 99.8% of maximum |
| Normalized entropy range | 95-99.8% | Across 2-15 users |
| Timing correlation (1ms) | r = 0.9999 | Near-perfect correlation |
| Timing correlation (50ms) | r = 0.39 | Weak correlation |
| KS p-value (50ms delay) | 0.706 | Indistinguishable from random |
| Cover traffic sweet spot | 2.0 PPS | KS p = 0.282 |
Reliability & FEC:
| Metric | Value | Configuration |
|---|---|---|
| FEC delivery (10% loss) | 98.8% | 30% ratio (4 parity) |
| No-FEC delivery (10% loss) | 30.9% | Baseline |
| FEC delivery (25% loss) | 53.2% | 30% ratio |
| SURB RTT with FEC | 264 ms (p50) | 25-28% faster than no-FEC |
| Replay detection (Bloom) | 5.6M inserts/sec | 14.4 MB per filter |
| PoW asymmetry (difficulty 16) | 41K:1 | Verify: 1.27 µs |
DeFi Pipeline:
| Metric | Value | Notes |
|---|---|---|
| Gas per circuit | 4.16-6.09M | Varies by circuit type |
| Proof generation | 9-11 sec | bb.js via Node.js subprocess |
| Paid mixnet E2E | 10.64 sec | Proof gen offloaded to relayer |
| Direct submission E2E | 12.89 sec | Client generates proof |
| HTTP proxy overhead | 2.5-3.5x | Real external endpoints |
| Min relayer margin | 10% | Configurable in profitability engine |
Attack Resistance:
| Attack | Result | Implication |
|---|---|---|
| n-1 attack | 100% success | Expected; requires global adversary |
| Intersection (20 epochs) | Full deanonymization | Cover traffic needed |
| 3/15 compromised nodes | 6.7% deanonymized | Scales with compromise fraction |
| Combined anonymity (independent) | 11M+ effective set | ZK-UTXO × mixnet multiplier |
| Combined anonymity (correlated) | ~11 effective set | Correlation destroys both layers |
These are our numbers. They come from 33 data files containing over 600,000 individual measurements. Every one is reproducible by running ./run_all.sh on comparable hardware.
The Transparency Argument
Let us make the meta-point explicitly.
Privacy infrastructure has a transparency problem. Projects ask users to trust them with sensitive metadata, but refuse to publish the performance and security data that would let users evaluate that trust.
"Security through obscurity" is a discredited concept in cryptography. The same principle should apply to performance claims. If your system is fast, prove it with numbers. If your system resists traffic analysis, prove it with entropy measurements. If your system has weaknesses -- and every system does -- say so, loudly, rather than hoping nobody checks.
We're a smaller project with no production deployment. Our benchmarks are from localhost simulations. We've been very clear about that. But we've published everything: the raw data files (33 JSON), the charts (36 pairs), the methodology, the attack results, the caveats. You can verify every claim we make by running run_all.sh and regenerating the data yourself.
We're publishing our attack results not because they make us look good -- a 100% n-1 attack success rate is not a selling point. We're publishing them because the alternative is pretending the attacks don't exist. Every mixnet is vulnerable to some subset of these attacks. The question is whether the implementors are honest about it.
The bar for privacy infrastructure should include proving, with data, that the system does what it claims. If your mixnet has high entropy, show the entropy measurements. If your Sphinx processing is fast, show the Criterion output. If your attack resistance has holes -- and every system's does -- quantify the holes and explain the mitigations.
We've shown ours. We're waiting for the others.
The Uncomfortable Conversation About $94.5 Million
Nym has raised $94.5 million in funding. They operate a production network with 550+ mix nodes. They have a team of experienced cryptographers and systems engineers. They have the resources to build the most comprehensive benchmark suite in the history of anonymity networks.
They have published zero benchmark results.
Not "insufficient" results. Not "preliminary" results. Zero. In five years of operation. With nearly a hundred million dollars in funding.
We're not implying malice. There are legitimate operational reasons a project might not publish benchmarks: competitive concerns, resource allocation tradeoffs, the difficulty of measuring a live production network without affecting it. But "it's hard" isn't an excuse when you've raised $94.5M. "It's competitive" isn't an excuse when you're asking people to trust you with their metadata.
The DeFi ecosystem has learned (painfully, repeatedly) that "trust us" is not an acceptable security model. We went through the era of "just trust the smart contract" -- and then exploits happened. Now every serious DeFi protocol publishes audit reports, has bug bounties, and open-sources its code. The privacy infrastructure space hasn't had its reckoning yet. Projects operate on vibes and whitepapers rather than data and audits.
We're not waiting for that reckoning. We're publishing now, warts and all.
An Invitation
Every data file referenced in this post is in our repository. Every chart can be regenerated. Every claim can be verified.
If you're building a competing mixnet, we genuinely invite you to publish your data the same way. Use the same metrics. Use different metrics. Use whatever methodology you think is appropriate. Just publish something. Let the community compare.
If you find errors in our data, we want to know. Open an issue. We'll investigate, and if we're wrong, we'll correct the record publicly. That's what transparency means.
If you've done the measurements but haven't published them -- maybe the numbers are embarrassing, maybe you haven't gotten around to writing them up, maybe you're worried about competitive implications -- consider this: the community benefits more from honest data than from silence. The numbers you're embarrassed about are probably similar to ours. We published our 100% n-1 attack success rate and the world didn't end.
The Cost of Silence
There's an argument that publishing benchmarks is "premature" for early-stage projects. "We'll publish when we're ready." "The numbers would be misleading without context." "We're focused on building, not marketing."
We've heard all of these. They're not wrong, exactly. Early-stage benchmarks are misleading if taken as production performance. Context is important. Building is the priority.
But consider the alternative. Tornado Cash operated for three years before being sanctioned. During that time, its relayer economics, privacy set sizes, and anonymity properties were studied extensively by external researchers -- because the contract and its usage data were public. The community could assess whether Tornado Cash actually provided the privacy it claimed. The answer, for small-value transactions, was often "not much" (small anonymity sets, timing correlation between deposits and withdrawals). This public scrutiny was uncomfortable for the project but invaluable for users.
Now imagine the same scenario for a mixnet. A privacy infrastructure project operates for three years, handles millions of messages, and is then found to have a critical timing vulnerability that an adversary has been exploiting since month one. If the project had published timing correlation data (like our Pearson r = 0.9999 at 1ms delay), the vulnerability would have been obvious. Without that data, the vulnerability is invisible until someone discovers it the hard way.
Publishing benchmarks -- even embarrassing ones, especially embarrassing ones -- is a form of security review. Every number we publish is a claim that can be tested, challenged, and disproven. That makes us stronger, not weaker. The projects that publish nothing are protecting themselves from embarrassment, not protecting their users from harm.
What We're Asking For
We're not asking competing projects to match our 33 data files. We're asking for something more basic:
-
One per-hop timing number. With hardware spec. Measured with a real benchmark framework. Published somewhere findable.
-
One throughput number. PPS at zero loss. With node count and configuration. Measured, not estimated.
-
One latency distribution. At least p50 and p99. At a stated mixing delay. With packet count.
-
One privacy metric. Entropy, or unlinkability, or timing correlation. Any one. Measured on actual code, not derived from theory.
That's four numbers. Each one can be measured in an afternoon. Published in a blog post or a GitHub issue. Linked from the project's README.
If every active mixnet project published these four numbers, the field would have a baseline for comparison that doesn't currently exist. Users could make informed decisions. Researchers could calibrate their models against real systems. And the projects themselves would benefit from the external scrutiny.
We've shown ours. The four numbers: 61.6 microseconds per hop. 369 PPS at zero loss. p50 = 69.7ms at 1ms delay. H = 3.25 bits (97.7% of maximum) at 0ms delay.
Your turn.
Speaking of holes: the next post is about ours.
This is Part 5 of a 7-part series on metadata privacy for DeFi. Part 6: "The Things We Haven't Built Yet" is an honest audit of our own protocol's gaps.
References:
- Piotrowska, A., Hayes, J., Elahi, T., Meiser, S., & Danezis, G. (2017). "The Loopix Anonymity System." USENIX Security Symposium.
- Danezis, G. & Goldberg, I. (2009). "Sphinx: A Compact and Provably Secure Mix Format." IEEE Symposium on Security and Privacy.
- Das, D., Diaz, C., Kiayias, A., & Zacharias, T. (2024). "Are Continuous Stop-and-Go Mixnets Provably Secure?" PoPETs 2024.
- Oldenburg, L. et al. (2024). "MixMatch: Flow Matching for Mixnet Traffic." PoPETs 2024 (Best Student Paper).
- Mavroudis, V. & Elahi, T. (2025). "Quantifying Mix Network Privacy Erosion with Generative Models (LLMix)." arXiv:2506.08918.
- Rahimi, M. (2025). "MOCHA: Mixnet Optimization Considering Honest Client Anonymity." IACR ePrint 2025/861.
- Das, D. et al. (2018). "Anonymity Trilemma: Strong Anonymity, Low Bandwidth Overhead, Low Latency -- Choose Two." IEEE S&P 2018.
- Cao, Y. & Green, M. (2026). "Analysis and Attacks on the Reputation System of Nym." IACR ePrint 2026/101.
- Rial, A., Piotrowska, A., Halpin, H. et al. (2025). "Outfox: A Postquantum Packet Format for Layered Mixnets." WPES / arXiv:2412.19937.
- Chaum, D. et al. (2016). "cMix: Mixing with Minimal Real-Time Asymmetric Cryptographic Operations." IACR ePrint 2016/008.
- Diaz, C., Halpin, H., & Kiayias, A. (2021). "The Nym Network: The Next Generation of Privacy Infrastructure." Nym Technologies SA.
- van den Hooff, J. et al. (2015). "Vuvuzela: Scalable Private Messaging Resistant to Traffic Analysis." SOSP 2015.
- Tyagi, N. et al. (2017). "Stadium: A Distributed Metadata-Private Messaging System." SOSP 2017.
- Bernstein, D. J. (2006). "Curve25519: New Diffie-Hellman Speed Records." PKC 2006.
- Criterion.rs documentation. https://bheisler.github.io/criterion.rs/book/
- Tor Metrics Project. https://metrics.torproject.org
- Chaum, D. (1981). "Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms." Communications of the ACM, 24(2), 84-90.
- Möller, U. & Cottrell, L. (2003). "Mixmaster Protocol Version 3." IETF Draft. (Variable-rate stop-and-go mixing strategies.)
- Grassi, L. et al. (2023). "Poseidon2: A Faster Version of the Poseidon Hash Function." IACR ePrint 2023/323.
Data Sources:
- NOX benchmarks:
scripts/bench/data/(33 JSON files), commitdb6ea3c, 2026-03-02 - Katzenpost:
github.com/katzenpost/katzenpost,sphinx_benchmark_test.go, nightly CI - Nym:
nymtech/sphinx,benches/benchmarks.rs(benchmark code exists, zero published results) - Tor:
metrics.torproject.org/onionperf-latencies.html, February 2026 - Loopix: USENIX Security 2017, Section 7
- Outfox: arXiv:2412.19937, Table 1 (per-hop processing times)
- cMix: IACR ePrint 2016/008, Section 7 (precomputation and real-time phases)
- Vuvuzela/Stadium: SOSP 2015/2017 proceedings