Turn agent runs into experimental knowledge.

Agent Experiment OS captures hypotheses, failures, issue evidence, metrics, and interventions, then presents them back to agents as must-load memory and decision policies.

View repository Explore the system Read the thesis

§ 01Why it exists

Evals tell you what passed. Experiment OS preserves why it happened.

A · taxonomy

Failure taxonomy

Classify planning failures, tool-call syntax drift, context loss, stale API usage, red-green churn, wrong-file edits, and premature completion - each named, each counted.

7 classesbrowse ↗

B · memory

Experiment memory

Score runs as task shape, agent/model toolchain, observed failure, metric movement, fail/pass, and confidence - recoverable later, not just logged.

structured runsschema ↗

C · policies

Decision policies

Promote repeated evidence into agent-readable policies with provenance, review gates, and dependsOn edges - so promotion is auditable, not implicit.

37 activeread policies ↗

§ 02System shape

Hypothesis → run → failure → interpretation → intervention → policy.

01Hypothesis

02Test design

03Run

04Observed failures

05Metric movement

06Interpretation

07Intervention

08Policy

For agents

Read at task start

MCP tools present must-load pages, dependsOn graphs, decision rules, and issue-derived evidence boundaries - and the next required protocol action before risky edits.

For humans

Read at review time

Dashboard read models expose matrices, clean pass rate, churn, protocol compliance, wiki graph health, policy review queues, and provenance.

§ 03Research stance

Not memory for memory's sake. Evidence becomes useful only after protocol and review.

Issue evidence

From evidence to claim.

"Issue evidence is not instruction."

GitHub issues become source-backed claims. Agents must verify local package versions and local API surfaces before applying them - never replay them as truth.

Final pass

A green run lies.

"Final pass is not enough."

A run with earlier failed verification is not clean evidence until the failure cause and recovery rationale are recorded alongside the eventual pass.

iii

Provenance

Promotion is auditable.

"Policies need provenance."

Accepted policies carry evidence ids, review rationale, confidence, and dependsOn links to failures, interventions, claims, and original sources.