Benchmarks
These are the benchmark figures shown on the agentOS marketing page. All numbers are computed from the same data source used by the marketing page. For independent sandbox comparison data, see the ComputeSDK benchmarks.
Cold start
Section titled “Cold start”Time from requesting an execution to first code running. Measured using the sleep workload (a minimal VM running an idle Node.js process). Sandbox baseline: E2B, the fastest mainstream sandbox provider as of March 30, 2026. See ComputeSDK benchmarks for independent sandbox comparison data.
| Metric | agentOS | Fastest sandbox (E2B) |
|---|---|---|
| Cold start p50 | 4.8 ms | 440 ms |
| Cold start p95 | 5.6 ms | 950 ms |
| Cold start p99 | 6.1 ms | 3,150 ms |
Memory per instance
Section titled “Memory per instance”Measured via staircase benchmarking:
- Warmup. A throwaway VM is created, started, and destroyed before measurement begins. This pays one-time costs (module cache, JIT compilation) that are amortized away in any real deployment where the host process is long-lived.
- Baseline. GC is forced twice (
--expose-gc), then RSS is sampled across the entire process tree by reading/proc/[pid]/statmfor the host process and all descendants. This captures child processes (e.g. V8 isolates running as separate processes) thatprocess.memoryUsage().rsswould miss. - Staircase. VMs are added one at a time. After each VM starts and settles, GC is forced and RSS is sampled again. The delta from the previous sample is the incremental cost of that VM.
- Average. The per-VM cost is the mean of all step deltas.
- Teardown. All VMs are disposed and the reclaimed RSS is recorded.
RSS is a process-wide metric that includes thread stacks and OS-mapped pages beyond the VM itself, so the reported figure is an upper bound on the true per-VM cost.
Sandbox baseline: Daytona, the cheapest mainstream sandbox provider as of March 30, 2026. Default sandbox: 1 vCPU + 1 GiB RAM.
Full coding agent
Section titled “Full coding agent”Pi coding agent session with MCP servers and mounted file systems.
| Metric | agentOS | Cheapest sandbox (Daytona) |
|---|---|---|
| Memory per instance | ~131 MB | ~1024 MB |
Simple shell command
Section titled “Simple shell command”Minimal shell workload running simple commands.
| Metric | agentOS | Cheapest sandbox (Daytona) |
|---|---|---|
| Memory per instance | ~22 MB | ~1024 MB |
Cost per execution-second
Section titled “Cost per execution-second”Assumes one agent per sandbox (needed for isolation) and 70% host utilization for self-hosted hardware (the industry-standard HPA scaling threshold). Cost formula: server cost per second / concurrent executions per server, where concurrent executions = floor(server RAM / agent memory) × 0.7.
Sandbox baseline: Daytona at $0.0504/vCPU-h + $0.0162/GiB-h with a 1 vCPU + 1 GiB minimum. Source: daytona.io/pricing.
Full coding agent
Section titled “Full coding agent”| Host tier | agentOS | Cheapest sandbox | Difference |
|---|---|---|---|
| AWS ARM | $0.00000058/s | $0.000018/s | 32x cheaper |
| AWS x86 | $0.00000072/s | $0.000018/s | 26x cheaper |
| Hetzner ARM | $0.000000066/s | $0.000018/s | 281x cheaper |
| Hetzner x86 | $0.00000011/s | $0.000018/s | 171x cheaper |
Simple shell command
Section titled “Simple shell command”| Host tier | agentOS | Cheapest sandbox | Difference |
|---|---|---|---|
| AWS ARM | $0.000000073/s | $0.000018/s | 254x cheaper |
| AWS x86 | $0.000000090/s | $0.000018/s | 205x cheaper |
| Hetzner ARM | $0.000000011/s | $0.000018/s | 1738x cheaper |
| Hetzner x86 | $0.000000017/s | $0.000018/s | 1061x cheaper |
Test environment
Section titled “Test environment”| Component | Details |
|---|---|
| CPU | 12th Gen Intel i7-12700KF, 12 cores / 20 threads @ 3.7 GHz, 25 MB cache |
| RAM | 2× 32 GB DDR4 @ 2400 MT/s |
| Node.js | v24.13.0 |
| OS | Linux 6.1.0 (Debian), x86_64 |
Sandbox baselines
Section titled “Sandbox baselines”| Comparison | Provider | Why this provider |
|---|---|---|
| Cold start | E2B | Fastest mainstream sandbox provider on ComputeSDK as of March 30, 2026 |
| Memory and cost | Daytona | Cheapest mainstream sandbox provider as of March 30, 2026 ($0.0504/vCPU-h + $0.0162/GiB-h) |
Self-hosted hardware tiers: AWS t4g.micro (ARM, $0.0084/h, 1 GiB), AWS t3.micro (x86, $0.0104/h, 1 GiB), Hetzner CAX11 (ARM, €3.29/mo, 4 GiB), Hetzner CX22 (x86, €5.39/mo, 4 GiB). All on-demand pricing.
Reproducing
Section titled “Reproducing”agentOS benchmarks live in the agent-os repository under scripts/benchmarks/.
Prerequisites
Section titled “Prerequisites”- Node.js (see
.nvmrc) andpnpm, with dependencies installed:pnpm install - A Rust toolchain (
cargo) — the benchmarks build and run the native release sidecar - A reasonably idle machine: cold-start latency tails are sensitive to background CPU and GC jitter
Run everything
Section titled “Run everything”From the repository root:
./scripts/benchmarks/run-benchmarks.shThe script builds the TypeScript packages and an optimized (release) sidecar, points the SDK at it via AGENT_OS_SIDECAR_BIN, and writes one JSON result per benchmark to scripts/benchmarks/results/:
| Result file | Feeds marketing input |
|---|---|
coldstart-sleep.json | COLDSTART_P50/P95/P99_MS |
memory-sleep.json | MEMORY_SHELL_MB (result.avgPerVmRssBytes / 1024²) |
memory-pi-session.json | MEMORY_AGENT_MB (result.avgPerVmRssBytes / 1024²) |
Copy those numbers into website/src/data/bench.ts; every figure and multiplier on this page recomputes from them.
Run a single benchmark
Section titled “Run a single benchmark”Each benchmark is a standalone tsx entrypoint. Build first (pnpm build and cargo build --release -p agent-os-sidecar), then:
export AGENT_OS_SIDECAR_BIN="$PWD/target/release/agent-os-sidecar"
# Cold start (sleep workload)pnpm exec tsx scripts/benchmarks/coldstart.bench.ts --workload=sleep --iterations=2000
# Memory — simple shell commandpnpm exec tsx --expose-gc scripts/benchmarks/memory.bench.ts --workload=sleep --count=20
# Memory — full coding agentpnpm exec tsx --expose-gc scripts/benchmarks/memory.bench.ts --workload=pi-session --count=10JSON goes to stdout; a human-readable table and progress go to stderr.
Sample sizes
Section titled “Sample sizes”Percentiles are nearest-rank: sorted[ceil(p/100 · n) − 1]. With too few samples the tail is meaningless — at n = 30, p99 is literally the single slowest run and p95 is the second slowest. Use enough iterations that the reported percentile is averaged over real tail samples, not one outlier:
| Statistic | Minimum iterations |
|---|---|
| p50 (median) | ~30 |
| p95 | ~200 |
| p99 | ~1,000 |
run-benchmarks.sh uses --iterations=2000 for cold start so p95/p99 are trustworthy. Memory per VM is a mean of per-VM step deltas with low variance, so --count=20 (shell) / --count=10 (agent) is sufficient.
The
pi-sessionmemory workload needs a working in-VM agent runtime (the Pi adapter process must launch inside the VM). On builds where the agent runtime is unavailable it will fail its process check rather than report a number.
Methodology
Section titled “Methodology”Every benchmark creates the sidecar once up front (AgentOs.createSidecar()) and leases all VMs from it. VMs are incremental tenants of one shared sidecar process — not one process each — so the figures measure the marginal cost of a VM, not a fresh process. (AgentOs.create() with no sidecar option already uses the shared default-pool sidecar, so this is the default everywhere, including RivetKit actors.)
Before any measured iteration, each benchmark does a cold run — a throwaway VM that is created, started, and (for cold start) snapshotted. This pays the one-time process spawn + bootstrap so the recorded numbers reflect the warm, steady-state incremental per-VM cost, never the first VM. The release sidecar is required: a debug build is several times slower and inflates the numbers.