Status as of 2026-04-29: release candidate for local MCP use and bounded OpenAI-compatible benchmark evaluation.
python -m gwt_context and gwt-context.python -m gwt_context.smoke or
gwt-context-smoke.gwt_attend applies accepted proposal kinds through
public ports or evidence-plan metadata.GWT_MIN_ACTIVATION; weak candidates
remain preconscious instead of filling empty slots automatically.ok, timeout, and error statuses, plus
a dedicated gwt_bus_inspect MCP tool.python -m tests.benchmarks.bus_matrix --run --max-tasks N.Required local gates before release:
python scripts/release_gate.py
Release thresholds are tracked in docs/release-thresholds.md.
Boundary checks:
rg -n "cycle\.(workspace|buffer|goal_manager)|_cycle\.(workspace|buffer|goal_manager)|from gwt_context\.infrastructure|tests\.benchmarks" src/gwt_context/mcp src/gwt_context/application || true
git ls-files | rg -n "(tests/benchmarks/(results|reports)|tests/benchmarks/.*\.log|\.env$|supervisor|\.benchmarks|\.worktrees)" || true
git grep -n "proxy\.runpod\.net" HEAD -- . ':!research' || true
Bus matrix summary:
python -m tests.benchmarks.bus_matrix --summarize tests/benchmarks/results/*.json
python -m tests.benchmarks.analyze_results tests/benchmarks/results/*.json
For a bounded model-backed sanity run against the configured .env endpoint:
python scripts/qwen_sanity.py --run --max-tasks 2
Latest max_tasks=5 Qwen sanity on 2026-04-29:
| Slice | Bus | Tasks | GWT | Baseline | Avg GWT Tool Calls | Bus accepted/inhibited |
|---|---|---|---|---|---|---|
| RULER advisor | on | 2 | 100% | 100% | 3.0 | 2 / 4 |
| RULER advisor | off | 2 | 100% | 100% | 3.0 | 0 / 0 |
| LongBench count/filter/aggregate/top_k/synthesis | on | 5 | 100% | 100% | 3.0 | 5 / 5 |
| LongBench count/filter/aggregate/top_k/synthesis | off | 5 | 100% | 80% | 3.0 | 0 / 0 |
Dogfood evidence is recorded in docs/dogfood-report.md. The strongest
defensible GWT claim and remaining caveats are recorded in
docs/honest-gwt-report.md. The current benchmark summary is recorded in
docs/benchmark-report-v0.3.md.
Bounded model-backed smoke uses production attend mode and deterministic local
GWT embeddings:
GWT_EMBEDDING_PROVIDER=hash GWT_EMBEDDING_MODEL=hash GWT_EMBEDDING_DIM=64 \
BENCHMARK_CONCURRENCY=1 \
BENCHMARK_ATTEND_BROADCAST_BUS=1 \
python -m tests.benchmarks.ruler_multi_hop \
--chain-type advisor --hops 2 --distractors 3 10 \
--tasks-per-config 1 --max-tasks 2 --gwt-mode attend
Repeat with --chain-type workplace and --chain-type discovery, then run:
GWT_EMBEDDING_PROVIDER=hash GWT_EMBEDDING_MODEL=hash GWT_EMBEDDING_DIM=64 \
BENCHMARK_CONCURRENCY=1 \
BENCHMARK_ATTEND_BROADCAST_BUS=1 \
python -m tests.benchmarks.longbench_pro \
--task-types count filter aggregate top_k synthesis \
--records 12 --tasks-per-config 1 --max-tasks 5 --gwt-mode attend
Latest Qwen release matrix on 2026-04-28:
| Slice | Bus | Tasks | GWT | Baseline | Avg GWT Tool Calls | Bus accepted/inhibited |
|---|---|---|---|---|---|---|
| RULER advisor/workplace/discovery | on | 12 | 100% | 100% | 3.0 | 12 / 24 |
| RULER advisor/workplace/discovery | off | 12 | 100% | 100% | 3.0 | 0 / 0 |
| LongBench count/filter/aggregate/top_k/synthesis | on | 10 | 100% | 100% | 3.0 | 10 / 10 |
| LongBench count/filter/aggregate/top_k/synthesis | off | 10 | 100% | 100% | 3.0 | 0 / 0 |
The release matrix was run with RULER chain types
advisor/workplace/discovery, hops 2,3, distractors 3,10, and LongBench
records 12,30 across all five task types. Bus subscriber timeout/error count
was 0 / 0 in all bus-on reports.
GWT_EMBEDDING_PROVIDER=hash GWT_EMBEDDING_MODEL=hash GWT_EMBEDDING_DIM=64 \
BENCHMARK_CONCURRENCY=4 BENCHMARK_ATTEND_BROADCAST_BUS=1 \
python -m tests.benchmarks.ruler_multi_hop \
--chain-type advisor --hops 2 3 --distractors 3 10 \
--tasks-per-config 1 --gwt-mode attend
Repeat for workplace, discovery, then repeat all commands with
BENCHMARK_ATTEND_BROADCAST_BUS=0; run LongBench with:
python -m tests.benchmarks.longbench_pro \
--task-types count filter aggregate top_k synthesis \
--records 12 30 --tasks-per-config 1 --gwt-mode attend
The current release matrix shows bus resolution activity without accuracy or
tool-call regression after deterministic resolve_answer suppresses
lower-priority recall queries.
Benchmark JSON outputs are generated under ignored tests/benchmarks/results/
and must not be committed.
To measure the bus itself, repeat the same commands with
BENCHMARK_ATTEND_BROADCAST_BUS=0 and compare accuracy, tool calls, and trace
proposal counts against the default bus-on run.