A Agent Experiment OS / v0.4
research preview - open evidence model
Runs 1,284 Policies 37 Updated 2026 · 04 · 28
MCP-native research infrastructure / coding agents

Turn agent runs into experimental knowledge.

Agent Experiment OS captures hypotheses, failures, issue evidence, metrics, and interventions, then presents them back to agents as must-load memory and decision policies.

§ 01Why it exists

Evals tell you what passed. Experiment OS preserves why it happened.

A · taxonomy
Failure taxonomy

Classify planning failures, tool-call syntax drift, context loss, stale API usage, red-green churn, wrong-file edits, and premature completion - each named, each counted.

7 classesbrowse ↗
B · memory
Experiment memory

Score runs as task shape, agent/model toolchain, observed failure, metric movement, fail/pass, and confidence - recoverable later, not just logged.

structured runsschema ↗
C · policies
Decision policies

Promote repeated evidence into agent-readable policies with provenance, review gates, and dependsOn edges - so promotion is auditable, not implicit.

§ 02System shape

Hypothesis run failure interpretation intervention policy.

01Hypothesis
02Test design
03Run
04Observed failures
05Metric movement
06Interpretation
07Intervention
08Policy

For agents

Read at task start

MCP tools present must-load pages, dependsOn graphs, decision rules, and issue-derived evidence boundaries - and the next required protocol action before risky edits.

For humans

Read at review time

Dashboard read models expose matrices, clean pass rate, churn, protocol compliance, wiki graph health, policy review queues, and provenance.

§ 03Research stance

Not memory for memory's sake. Evidence becomes useful only after protocol and review.

i

Issue evidence

From evidence to claim.

"Issue evidence is not instruction."

GitHub issues become source-backed claims. Agents must verify local package versions and local API surfaces before applying them - never replay them as truth.

ii

Final pass

A green run lies.

"Final pass is not enough."

A run with earlier failed verification is not clean evidence until the failure cause and recovery rationale are recorded alongside the eventual pass.

iii

Provenance

Promotion is auditable.

"Policies need provenance."

Accepted policies carry evidence ids, review rationale, confidence, and dependsOn links to failures, interventions, claims, and original sources.