gwt-context

v0.3 Benchmark Report

Status as of 2026-04-29: bounded Qwen/OpenAI-compatible sanity for v0.3.0-rc1.

This report summarizes the latest ignored benchmark artifacts under .benchmarks/qwen-sanity/. Raw JSON results are intentionally not committed.

Matrix

Command:

npm run qwen:sanity -- --run --max-tasks 5

The wrapper executes bounded bus-on and bus-off slices for RULER multi-hop and LongBench Pro synthetic tasks using deterministic local GWT embeddings.

Benchmark Mode Tasks GWT acc Baseline acc Evidence recall Evidence precision Bus accepted / inhibited
RULER advisor attend / bus on 2 100.0% 100.0% 100.0% 31.0% 2 / 4
RULER advisor attend / bus off 2 100.0% 100.0% 100.0% 31.0% 0 / 0
LongBench count/filter/aggregate/top_k/synthesis attend / bus on 5 100.0% 100.0% 100.0% 32.1% 5 / 5
LongBench count/filter/aggregate/top_k/synthesis attend / bus off 5 100.0% 80.0% 100.0% 35.7% 0 / 0

Release Gates

Gate Status Actual Threshold
RULER pass 100.0% 90.0%
count pass 100.0% 90.0%
filter pass 100.0% 90.0%
aggregate pass 100.0% 90.0%
synthesis pass 100.0% 80.0%
top_k pass 100.0% 80.0%

The local test baseline for this RC polish is 206 passed.

Bus Deltas

Benchmark Tasks Accuracy delta Tool-call delta Accepted delta
RULER advisor 2 +0.0% +0.00 +2
LongBench Pro 5 +0.0% +0.00 +5

Bus inhibition reasons were present in the summary:

Subscriber timeout/error counts were 0 / 0 in every slice.

Notable Delta

In the bus-off LongBench slice, baseline missed one synthesis task:

This is a useful sanity signal, not a broad benchmark claim. The sample is still small and should be expanded before final v0.3.0.

Artifact References

Latest ignored local artifacts used for this report: