docs: add SWE-bench section to CLAUDE.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-08-03 11:10:23 +02:00 · 2026-02-15 18:32:04 +08:00
parent 10c57c0f7a
commit 45acb965ba
1 changed files with 20 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -190,6 +190,26 @@ Logged events: `run_start`, `run_end`, `llm_call`, `llm_result`, `tool_start`, `

 Each line is a JSON object with `ts` (timestamp) and `event` (type), suitable for AI-assisted log analysis. Full event reference: `packages/core/src/agent/run-log.ts`.

+## SWE-bench (Agent Benchmark)
+
+Run the Multica agent against [SWE-bench](https://www.swebench.com/), the standard benchmark for evaluating AI coding agents on real GitHub issues.
+
+```bash
+# Download dataset
+python scripts/swe-bench/download-dataset.py --dataset lite --limit 5
+
+# Run agent against tasks
+npx tsx scripts/swe-bench/run.ts --limit 5 --provider kimi-coding
+
+# Analyze results
+npx tsx scripts/swe-bench/analyze.ts
+
+# Official evaluation (requires Docker)
+bash scripts/swe-bench/evaluate.sh
+```
+
+Scripts are in `scripts/swe-bench/`. Full guide: `docs/swe-bench.md`.
+
 ## E2E Testing (Agent-Driven)

 E2E tests are executed and analyzed by the Coding Agent (Claude Code), not by vitest. The Coding Agent runs the Multica agent via CLI, reads the structured run-log, and intelligently analyzes intermediate behavior and results.