research note · 2026-06-29

Synthetic vs real session-store validation

OVM can benchmark Codex and Claude against disposable synthetic histories. This note checks whether that synthetic store is close enough to use as research evidence.

verdict

Good store-pressure analogue. Not yet a transcript-shape clone.

The corrected fixture now preserves local JSONL file counts and the heavy-tailed byte distribution. That makes it useful for startup and backfill pressure tests. It still does not synthesize Codex function-call events, function-call outputs, or Claude tool sidecar files, so parser-shape conclusions need a separate fixture upgrade.

method

Real stores were measured from the local macOS session directories. Synthetic stores were generated with ovm-benchmark fixture --match-real --scale 0.1 --count 20 into a scratch home. No private transcript text is copied into the fixture.

Real inputs

~/.codex/sessions: 1,938 JSONL files, 6.8 GB.

~/.claude/projects: 8,838 JSONL files, 5.7 GB.

Synthetic target

Preserve count, total bytes, median size, and p90 size using sorted real file-size targets.

Known gap

Synthetic JSONL currently contains message events only. Tool-call event shape is not represented yet.

shape check

store	files	bytes	avg file	p50 file	p90 file
Codex real	1,938	6,813 MB	3,516 KB	378 KB	5,051 KB
Codex synthetic 0.1	194	296 MB	1,526 KB	408 KB	4,849 KB
Claude real	8,838	5,716 MB	647 KB	102 KB	751 KB
Claude synthetic 0.1	884	520 MB	588 KB	172 KB	819 KB

startup check

product	environment	cold	warm samples	interpretation
Codex	synthetic 0.1 scratch home	7,012 ms	5,332 / 3,078 / 6,506 ms	Captures cold/backfill direction; warm is still noisier than real.
Codex	real home, warm	not run	2,558 / 2,103 / 2,249 ms	Real warm launch is faster than the synthetic scratch run.
Claude	synthetic 0.1 scratch home	1,219 ms	1,105 / 1,245 / 1,620 ms	Close enough for startup-size pressure checks.
Claude	real home, warm	not run	1,275 / 1,837 / 2,458 ms	Same order of magnitude as the synthetic run.

transcript-shape gap

sample	files sampled	function_call	function_call_output	tool_use	tool_result
Codex real	200	67,653	33,814	255	18
Codex synthetic	200	0	0	0	0
Claude real	200	0	0	136	65

Use this for store-size research, then add tool-call synthesis.

The current report should be read as validation for synthetic storage pressure and startup measurement. The next research upgrade is to add structured Codex function-call events and Claude tool-use sidecars to the fixture, then rerun this comparison.