Classify planning failures, tool-call syntax drift, context loss, stale API usage, red-green churn, wrong-file edits, and premature completion - each named, each counted.
Agent Experiment OS captures hypotheses, failures, issue evidence, metrics, and interventions, then presents them back to agents as must-load memory and decision policies.
Classify planning failures, tool-call syntax drift, context loss, stale API usage, red-green churn, wrong-file edits, and premature completion - each named, each counted.
Score runs as task shape, agent/model toolchain, observed failure, metric movement, fail/pass, and confidence - recoverable later, not just logged.
Promote repeated evidence into agent-readable policies with provenance, review gates, and dependsOn edges - so promotion is auditable, not implicit.
MCP tools present must-load pages, dependsOn graphs, decision rules, and issue-derived evidence boundaries - and the next required protocol action before risky edits.
Dashboard read models expose matrices, clean pass rate, churn, protocol compliance, wiki graph health, policy review queues, and provenance.
"Issue evidence is not instruction."
GitHub issues become source-backed claims. Agents must verify local package versions and local API surfaces before applying them - never replay them as truth.
"Final pass is not enough."
A run with earlier failed verification is not clean evidence until the failure cause and recovery rationale are recorded alongside the eventual pass.
"Policies need provenance."
Accepted policies carry evidence ids, review rationale, confidence, and dependsOn links to failures, interventions, claims, and original sources.