gwt-context

Release Readiness

Status as of 2026-04-29: release candidate for local MCP use and bounded OpenAI-compatible benchmark evaluation.

What Is Ready

Verification Gates

Required local gates before release:

python scripts/release_gate.py

Release thresholds are tracked in docs/release-thresholds.md.

Boundary checks:

rg -n "cycle\.(workspace|buffer|goal_manager)|_cycle\.(workspace|buffer|goal_manager)|from gwt_context\.infrastructure|tests\.benchmarks" src/gwt_context/mcp src/gwt_context/application || true
git ls-files | rg -n "(tests/benchmarks/(results|reports)|tests/benchmarks/.*\.log|\.env$|supervisor|\.benchmarks|\.worktrees)" || true
git grep -n "proxy\.runpod\.net" HEAD -- . ':!research' || true

Bus matrix summary:

python -m tests.benchmarks.bus_matrix --summarize tests/benchmarks/results/*.json
python -m tests.benchmarks.analyze_results tests/benchmarks/results/*.json

For a bounded model-backed sanity run against the configured .env endpoint:

python scripts/qwen_sanity.py --run --max-tasks 2

Latest max_tasks=5 Qwen sanity on 2026-04-29:

Slice Bus Tasks GWT Baseline Avg GWT Tool Calls Bus accepted/inhibited
RULER advisor on 2 100% 100% 3.0 2 / 4
RULER advisor off 2 100% 100% 3.0 0 / 0
LongBench count/filter/aggregate/top_k/synthesis on 5 100% 100% 3.0 5 / 5
LongBench count/filter/aggregate/top_k/synthesis off 5 100% 80% 3.0 0 / 0

Dogfood evidence is recorded in docs/dogfood-report.md. The strongest defensible GWT claim and remaining caveats are recorded in docs/honest-gwt-report.md. The current benchmark summary is recorded in docs/benchmark-report-v0.3.md.

Qwen Smoke

Bounded model-backed smoke uses production attend mode and deterministic local GWT embeddings:

GWT_EMBEDDING_PROVIDER=hash GWT_EMBEDDING_MODEL=hash GWT_EMBEDDING_DIM=64 \
BENCHMARK_CONCURRENCY=1 \
BENCHMARK_ATTEND_BROADCAST_BUS=1 \
python -m tests.benchmarks.ruler_multi_hop \
  --chain-type advisor --hops 2 --distractors 3 10 \
  --tasks-per-config 1 --max-tasks 2 --gwt-mode attend

Repeat with --chain-type workplace and --chain-type discovery, then run:

GWT_EMBEDDING_PROVIDER=hash GWT_EMBEDDING_MODEL=hash GWT_EMBEDDING_DIM=64 \
BENCHMARK_CONCURRENCY=1 \
BENCHMARK_ATTEND_BROADCAST_BUS=1 \
python -m tests.benchmarks.longbench_pro \
  --task-types count filter aggregate top_k synthesis \
  --records 12 --tasks-per-config 1 --max-tasks 5 --gwt-mode attend

Latest Qwen release matrix on 2026-04-28:

Slice Bus Tasks GWT Baseline Avg GWT Tool Calls Bus accepted/inhibited
RULER advisor/workplace/discovery on 12 100% 100% 3.0 12 / 24
RULER advisor/workplace/discovery off 12 100% 100% 3.0 0 / 0
LongBench count/filter/aggregate/top_k/synthesis on 10 100% 100% 3.0 10 / 10
LongBench count/filter/aggregate/top_k/synthesis off 10 100% 100% 3.0 0 / 0

The release matrix was run with RULER chain types advisor/workplace/discovery, hops 2,3, distractors 3,10, and LongBench records 12,30 across all five task types. Bus subscriber timeout/error count was 0 / 0 in all bus-on reports.

GWT_EMBEDDING_PROVIDER=hash GWT_EMBEDDING_MODEL=hash GWT_EMBEDDING_DIM=64 \
BENCHMARK_CONCURRENCY=4 BENCHMARK_ATTEND_BROADCAST_BUS=1 \
python -m tests.benchmarks.ruler_multi_hop \
  --chain-type advisor --hops 2 3 --distractors 3 10 \
  --tasks-per-config 1 --gwt-mode attend

Repeat for workplace, discovery, then repeat all commands with BENCHMARK_ATTEND_BROADCAST_BUS=0; run LongBench with:

python -m tests.benchmarks.longbench_pro \
  --task-types count filter aggregate top_k synthesis \
  --records 12 30 --tasks-per-config 1 --gwt-mode attend

The current release matrix shows bus resolution activity without accuracy or tool-call regression after deterministic resolve_answer suppresses lower-priority recall queries.

Benchmark JSON outputs are generated under ignored tests/benchmarks/results/ and must not be committed.

To measure the bus itself, repeat the same commands with BENCHMARK_ATTEND_BROADCAST_BUS=0 and compare accuracy, tool calls, and trace proposal counts against the default bus-on run.

Remaining Non-Blockers