ovm.sh / synthetic vs real validation
research note · 2026-06-29

Synthetic vs real session-store validation

OVM can benchmark Codex and Claude against disposable synthetic histories. This note checks whether that synthetic store is close enough to use as research evidence.

verdict

Good store-pressure analogue. Not yet a transcript-shape clone.

The corrected fixture now preserves local JSONL file counts and the heavy-tailed byte distribution. That makes it useful for startup and backfill pressure tests. It still does not synthesize Codex function-call events, function-call outputs, or Claude tool sidecar files, so parser-shape conclusions need a separate fixture upgrade.

method

Real stores were measured from the local macOS session directories. Synthetic stores were generated with ovm-benchmark fixture --match-real --scale 0.1 --count 20 into a scratch home. No private transcript text is copied into the fixture.

01

Real inputs

~/.codex/sessions: 1,938 JSONL files, 6.8 GB.

~/.claude/projects: 8,838 JSONL files, 5.7 GB.

02

Synthetic target

Preserve count, total bytes, median size, and p90 size using sorted real file-size targets.

03

Known gap

Synthetic JSONL currently contains message events only. Tool-call event shape is not represented yet.

store files bytes avg file p50 file p90 file
Codex real 1,938 6,813 MB 3,516 KB 378 KB 5,051 KB
Codex synthetic 0.1 194 296 MB 1,526 KB 408 KB 4,849 KB
Claude real 8,838 5,716 MB 647 KB 102 KB 751 KB
Claude synthetic 0.1 884 520 MB 588 KB 172 KB 819 KB
product environment cold warm samples interpretation
Codex synthetic 0.1 scratch home 7,012 ms 5,332 / 3,078 / 6,506 ms Captures cold/backfill direction; warm is still noisier than real.
Codex real home, warm not run 2,558 / 2,103 / 2,249 ms Real warm launch is faster than the synthetic scratch run.
Claude synthetic 0.1 scratch home 1,219 ms 1,105 / 1,245 / 1,620 ms Close enough for startup-size pressure checks.
Claude real home, warm not run 1,275 / 1,837 / 2,458 ms Same order of magnitude as the synthetic run.
sample files sampled function_call function_call_output tool_use tool_result
Codex real 200 67,653 33,814 255 18
Codex synthetic 200 0 0 0 0
Claude real 200 0 0 136 65
next

Use this for store-size research, then add tool-call synthesis.

The current report should be read as validation for synthetic storage pressure and startup measurement. The next research upgrade is to add structured Codex function-call events and Claude tool-use sidecars to the fixture, then rerun this comparison.