This document explains what benchmarks/bench.py measures and why the suites are structured the way they are, based on the Slack thread:
- “LLM outputs sometimes break JSON or wrap it (‘json입니다~ …’, markdown fences, etc.)”
- “Huge JSON (GB-scale root arrays / big data) might want parallel parsing”
- “A probabilistic parser that can return multiple candidates (Top‑K) is valuable”
TL;DR:
agentjsonis not trying to beatorjsonon valid JSON microbenchmarks. The primary win is success rate and drop-in adoption in LLM pipelines.
Because agentjson provides a top-level orjson drop-in module, you must use two separate environments if you want to compare with the real orjson package:
# Env A: real orjson
python -m venv .venv-orjson
source .venv-orjson/bin/activate
python -m pip install orjson ujson
python benchmarks/bench.py
# Env B: agentjson (includes the shim)
python -m venv .venv-agentjson
source .venv-agentjson/bin/activate
python -m pip install agentjson ujson
python benchmarks/bench.pyTune run sizes with env vars:
BENCH_MICRO_NUMBER=20000 BENCH_MICRO_REPEAT=5 \
BENCH_MESSY_NUMBER=2000 BENCH_MESSY_REPEAT=5 \
BENCH_TOPK_NUMBER=500 BENCH_TOPK_REPEAT=5 \
BENCH_LARGE_MB=5,20 BENCH_LARGE_NUMBER=3 BENCH_LARGE_REPEAT=3 \
BENCH_NESTED_MB=5,20 BENCH_NESTED_NUMBER=1 BENCH_NESTED_REPEAT=3 \
BENCH_NESTED_FORCE_PARALLEL=0 \
BENCH_CLI_MMAP_MB=512 \
python benchmarks/bench.py- PR‑006 (CLI mmap): run
cli_mmap_suiteby settingBENCH_CLI_MMAP_MB.- This runs the Rust CLI twice on the same large file: default mmap vs
--no-mmap. - It uses
/usr/bin/timewhen available to capture max RSS.
- This runs the Rust CLI twice on the same large file: default mmap vs
- PR‑101 (parallel delimiter indexer): use
large_root_array_suiteand increaseBENCH_LARGE_MB(e.g.200,1000) to find the crossover where parallel indexing starts paying off. - PR‑102 (nested huge value / corpus): use
nested_corpus_suiteto benchmarkscale_target_keys=["corpus"]withallow_parallelon/off.
On macOS this suite records wall time (the /usr/bin/time -v max-RSS path is Linux-friendly).
Example run (Env B, BENCH_CLI_MMAP_MB=256, 2025-12-14):
| Mode | Elapsed |
|---|---|
mmap(default) |
1.27 s |
read(--no-mmap) |
1.38 s |
Interpretation: mmap’s main win is avoiding upfront heap allocation / extra copies on huge files; it may or may not be faster depending on OS page cache and IO patterns.
Real LLM outputs often include things that break strict JSON parsers:
- Markdown fences:
```json ... ``` - Prefix/suffix junk:
"json 입니다~ {...} 감사합니다" - “Almost JSON”: single quotes, unquoted keys, trailing commas
- Python literals:
True/False/None - Missing commas
- Smart quotes:
“ ” - Missing closing brackets/quotes
The benchmark uses a fixed set of 10 cases (see LLM_MESSY_CASES in benchmarks/bench.py).
success: no exception was raisedcorrect: parsed value equals the expected Python objectbest time / case: the best observed per-case attempt time (lower is better)
Input (not strict JSON):
preface```json
{"a":1}
```suffix
Strict parsers:
import json
import orjson # real orjson
json.loads('preface```json\n{"a":1}\n```suffix') # -> JSONDecodeError
orjson.loads('preface```json\n{"a":1}\n```suffix') # -> JSONDecodeErrorWith agentjson as an orjson drop-in (same call site):
import os
import orjson # provided by agentjson (drop-in)
os.environ["JSONPROB_ORJSON_MODE"] = "auto"
orjson.loads('preface```json\n{"a":1}\n```suffix') # -> {"a": 1}This is the core PR pitch: don’t change code, just switch the package and flip a mode when needed.
| Library / mode | Success | Correct | Best time / case |
|---|---|---|---|
json (strict) |
0/10 | 0/10 | n/a |
ujson (strict) |
0/10 | 0/10 | n/a |
orjson (strict, real) |
0/10 | 0/10 | n/a |
agentjson (drop-in orjson.loads, mode=auto) |
10/10 | 10/10 | 23.5 µs |
agentjson.parse(mode=auto) |
10/10 | 10/10 | 19.5 µs |
agentjson.parse(mode=probabilistic) |
10/10 | 10/10 | 19.5 µs |
When input is ambiguous, a “repair” might have multiple plausible interpretations.
agentjson can return multiple candidates (top_k) with confidence scores, so downstream code can:
- validate with schema/business rules,
- pick the best candidate,
- or decide to re-ask an LLM only when confidence is low.
Input (ambiguous suffix):
{"a":1,"b":2,"c":3, nonsense nonsense
A plausible “best effort” interpretation is to truncate the garbage suffix and return:
{"a": 1, "b": 2, "c": 3}But another plausible interpretation is to treat nonsense as a missing key/value pair and produce something else.
In the benchmark run, this case shows up exactly as:
- Top‑1 hit misses (not the expected value),
- but Top‑K hit (K=5) succeeds (the expected value is present in the candidate list).
| Metric | Value |
|---|---|
| Top‑1 hit rate | 7/8 |
| Top‑K hit rate (K=5) | 8/8 |
| Avg candidates returned | 1.25 |
| Avg best confidence | 0.57 |
| Best time / case | 38.2 µs |
This suite generates a single large root array like:
[{"id":0,"value":"test"},{"id":0,"value":"test"}, ...]and measures how long loads(...) takes for sizes like 5MB and 20MB.
For comparing json/ujson/orjson, use Env A (real orjson). In Env B, import orjson is the shim.
| Library | 5 MB | 20 MB |
|---|---|---|
json.loads(str) |
53.8 ms | 217.2 ms |
ujson.loads(str) |
45.9 ms | 173.7 ms |
orjson.loads(bytes) (real) |
27.0 ms | 116.2 ms |
benchmarks/bench.py also measures agentjson.scale(serial|parallel) (Env B). On 5–20MB inputs the crossover depends on your machine; it’s intended for much larger payloads (GB‑scale root arrays).
This is the “realistic Slack payload” shape:
{ "corpus": [ ... huge ... ], "x": 0 }Why this matters:
- Parallelizing a root array is useful, but many real payloads wrap the big array under a key (
corpus,rows,events, …). - The
scale_target_keys=["corpus"]option exists to target that nested value.
benchmarks/bench.py includes nested_corpus_suite which compares:
agentjson.scale_pipeline(no_target)— baseline (no targeting)agentjson.scale_pipeline(corpus, serial)— targeting without forcing parallelagentjson.scale_pipeline(corpus, parallel)— optional: targeting with forced parallel (allow_parallel=True, setBENCH_NESTED_FORCE_PARALLEL=1)
Important nuance:
- This suite uses DOM mode (
scale_output="dom") sosplit_modeshows whether nested targeting triggered (seerust/src/scale.rs::try_nested_target_split). - Wiring nested targeting into tape mode (
scale_output="tape") is the next-step work for true “huge nested value without DOM” workloads.