Autonomous optimization loop for any text asset with a scalar metric.
Inspired by Karpathy's autoresearch. Packaged for OpenClaw.
⭐ If this saves you time, a star helps others find it.
Give AutoResearch a file and a metric command:
autoresearch run \
--file ./my-prompt.md \
--metric "python eval-prompt.py" \
--budget 30It runs a loop:
- Read your current file
- Agent proposes ONE targeted change
- Measure the metric after applying it
- Keep if improved, revert if worse
- Repeat until budget exhausted
Wake up to a report showing which experiments worked, what changed, and the best version found.
npm install -g autoresearch-openclaw
# or
npx autoresearch-openclaw initautoresearch initCreates ~/.autoresearch/config.json and prints your dashboard.
autoresearch demo --budget 5Optimizes a sample cold outreach template (no setup required).
# First, set up git in your project directory
cd my-project && git init && git add . && git commit -m "initial"
# Then run autoresearch
autoresearch run \
--file ./outreach-template.md \
--metric "bash measure-reply-rate.sh" \
--budget 20┌──────────────────────────────────────────────────────────────┐
│ AutoResearch Loop │
├──────────────────────────────────────────────────────────────┤
│ │
│ 1. Baseline metric → Read file, measure starting point │
│ │
│ 2. Iteration (1..N) → For each iteration: │
│ - Read current file │
│ - Agent proposes ONE change │
│ - Apply change, measure metric │
│ - If improved → git commit (keep) │
│ - If worse → git checkout (revert) │
│ - Log to results.tsv │
│ │
│ 3. Final report → Summary of all experiments │
│ → Best version snapshot │
│ │
└──────────────────────────────────────────────────────────────┘
The AI agent receives:
- The current file content
- Optimization instructions (domain-specific)
- History of previous iterations (to avoid retry loops)
The agent must propose exactly ONE change with a clear hypothesis.
autoresearch init Initialize workspace
autoresearch run --file <path> \
--metric "command" [--budget 20] Run optimization
autoresearch status Show dashboard
autoresearch sessions list List all past runs
autoresearch sessions show <id> Show session details
autoresearch demo [--budget 5] Try the demo
autoresearch skills List bundled skills
Optimize system prompts against evaluation scores.
autoresearch run --skill prompt-optimizer \
--file system-prompt.md \
--metric "python eval.py"Optimize cold email templates for reply rates.
autoresearch run --skill outreach-optimizer \
--file outreach.md \
--metric "bash measure-reply-rate.sh"Optimize blog posts and landing copy.
autoresearch run --skill content-optimizer \
--file blog-post.md \
--metric "python score-engagement.py"Tune YAML/JSON config files against benchmarks.
autoresearch run --skill config-tuner \
--file config.yaml \
--metric "bash benchmark.sh" \
--goal minimizeOptimize prediction market strategies.
autoresearch run --skill prediction-optimizer \
--file strategy.md \
--metric "python evaluate-brier.py" \
--goal minimizeLocated at ~/.autoresearch/config.json:
{
"version": "1.0.0",
"model": "claude",
"defaultBudget": 20,
"resultsDir": "/home/user/.autoresearch/results"
}Each session creates:
results.tsv— Experiment log: iteration, before/after metrics, kept/reverted, change descriptionreport.md— Human-readable summarybest.md— Snapshot of best-performing file versionhypothesis_log.jsonl— Agent hypotheses (for learning across runs)
Example results.tsv:
iteration metric_before metric_after kept change_desc
1 0.12 0.14 kept Shortened subject line from 12 to 8 words
2 0.14 0.13 reverted Increased emotional appeal score
3 0.14 0.17 kept Added personalization tokens {{first_name}}
═════════════════════════════════════════════════════════════
AutoResearch Dashboard
───────────────────────────────────────────────────────────────
Today: 3 sessions, best gain +42%
All time: 18 sessions, avg gain +28%
Active Runs
───────────────────────────────────────────────────────────────
outreach-mar27 iter 12/30 0.23 → 0.41 (+78%) ⏳ running
prompt-v2 iter 5/20 0.60 → 0.67 (+12%) ⏳ running
Recent Sessions
───────────────────────────────────────────────────────────────
prediction-mar25 0.35 → 0.67 (+91%) (30 iters)
outreach-mar24 0.12 → 0.28 (+133%) (30 iters)
═════════════════════════════════════════════════════════════
AutoResearch works with OpenClaw agents. Bundled skills are auto-discovered:
- Install autoresearch-openclaw
- Skills appear in OpenClaw agent menus
- Agents trigger with: "optimize my prompt", "improve this template", etc.
✅ Budget cap enforced — loop never exceeds specified iterations
✅ Git-based revert — every bad change is instantly reversed
✅ No file deletion — only edits to specified file
✅ Atomic commits — each improvement committed separately
✅ Reversible — full experiment history in git log
autoresearch-openclaw (npm package)
├── bin/autoresearch.js CLI entry
├── src/cli/
│ ├── index.js Command router
│ └── commands/
│ ├── init.js
│ ├── run.js
│ ├── status.js
│ ├── demo.js
│ ├── skills.js
│ └── sessions.js
└── src/utils/
├── config.js
└── version.js
# 1. Create template
cat > outreach.md << 'EOF'
Subject: Quick idea for {{company}}
Hi {{first_name}},
I help SaaS teams save 10+ hours per week with automation.
Would you be open to a 15-min call?
Best,
Rayo
EOF
# 2. Create metric script
cat > measure.sh << 'EOF'
#!/bin/bash
# Return mock reply rate (0-1)
python3 -c "
import hashlib
with open('$1') as f:
h = hashlib.md5(f.read().encode()).hexdigest()
base = 0.10 + (int(h[:8], 16) % 1000) / 4000
# Reward conciseness
if len(open('$1').read().split()) < 100:
base += 0.05
print(f'{min(0.4, base):.4f}')
"
EOF
chmod +x measure.sh
# 3. Run autoresearch
git init && git add . && git commit -m "initial"
autoresearch run \
--file outreach.md \
--metric "bash measure.sh" \
--budget 20 \
--session "outreach-jan"
# 4. Check results
autoresearch sessions show outreach-jan
cat ~/.autoresearch/results/outreach-jan/best.md# Create eval script that scores your prompt
cat > eval.py << 'EOF'
import json
from anthropic import Anthropic
def score_prompt(prompt_file):
with open(prompt_file) as f:
prompt = f.read()
# Test the prompt against a few cases
client = Anthropic()
score = 0
for test in ["simple math", "logic puzzle", "creative writing"]:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=100,
system=prompt,
messages=[{"role": "user", "content": f"Try this: {test}"}]
)
# Score based on response quality (simplified)
score += 0.3 if len(response.content[0].text) > 50 else 0.1
return score / 3
if __name__ == "__main__":
import sys
print(f"{score_prompt(sys.argv[1]):.4f}")
EOF
# Run autoresearch
autoresearch run \
--file system-prompt.md \
--metric "python eval.py" \
--budget 25 \
--skill prompt-optimizer- Requires a scalar metric command (returns one float)
- Works best with small files (<10KB)
- AI agent calls are sequential (not parallel)
- Metric evaluation can be slow (budget × metric_time)
- Node.js >=18
- Git (for safe revert)
- Claude CLI or OpenAI API (for the agent)
Issues and PRs welcome at github.com/rmarji/autoresearch-openclaw.
MIT — see LICENSE