feat(scheduler): DB-backed execution-record scheduler [MUL-2957]

2026-06-16 19:29:26 +02:00 · 2026-06-05 13:46:26 +08:00
parent 63b847ee48
commit 3caba86b09
25 changed files with 3300 additions and 259 deletions
--- a/SELF_HOSTING_ADVANCED.md
+++ b/SELF_HOSTING_ADVANCED.md
@@ -184,74 +184,35 @@ cd server && go run ./cmd/migrate up

 ## Usage Dashboard Rollup

-The Usage and Runtime dashboards read from `task_usage_hourly`, a derived table populated by `rollup_task_usage_hourly()`. The function is **not** scheduled out of the box on the default self-host stack: the bundled `pgvector/pgvector:pg17` image ships without `pg_cron`, and the backend does not run the rollup in-process either. Until something calls it on a schedule, raw `task_usage` rows will keep arriving while the dashboard stays at zero.
+The Usage and Runtime dashboards read from `task_usage_hourly`, a derived table populated by `rollup_task_usage_hourly()`. As of MUL-2957 the backend runs this rollup **in-process** on every replica via a DB-backed scheduler (`sys_cron_executions`); a fresh self-host install needs no operator action — the bundled `pgvector/pgvector:pg17` image works without changes.

-Pick one of the supported paths:
+### How the in-process scheduler works

-### Option A — External cron / systemd-timer
-
-The simplest path. Schedule `SELECT rollup_task_usage_hourly()` every five minutes from any out-of-band timer (host crontab, systemd timer, sidecar container, Kubernetes CronJob). It is idempotent and watermark-driven — overlapping runs are no-ops on an internal advisory lock, and a missed tick catches up on the next run.
-
-Docker Compose:
-
-```bash
-# /etc/cron.d/multica-rollup
-*/5 * * * * root docker compose -f /path/to/multica/docker-compose.selfhost.yml \
-  exec -T postgres psql -U multica -d multica \
-  -c "SELECT rollup_task_usage_hourly();" >/dev/null
-```
-
-Kubernetes (one-off `CronJob`):
-
-```yaml
-apiVersion: batch/v1
-kind: CronJob
-metadata:
-  name: multica-usage-rollup
-spec:
-  schedule: "*/5 * * * *"
-  concurrencyPolicy: Forbid
-  jobTemplate:
-    spec:
-      template:
-        spec:
-          restartPolicy: OnFailure
-          containers:
-            - name: psql
-              image: postgres:17-alpine
-              command:
-                - psql
-                - "$(DATABASE_URL)"
-                - -c
-                - "SELECT rollup_task_usage_hourly();"
-              env:
-                - name: DATABASE_URL
-                  valueFrom:
-                    secretKeyRef:
-                      name: multica-secrets
-                      key: DATABASE_URL
-```
-
-### Option B — Postgres with `pg_cron`
-
-If you'd rather have Postgres schedule itself, swap the bundled image for one that ships both `pgvector` and `pg_cron` (e.g. `supabase/postgres`, or a custom build of `pgvector/pgvector` with `pg_cron` added). `pg_cron` requires `shared_preload_libraries=pg_cron` in `postgresql.conf`, which only takes effect on Postgres restart — set it before bringing the container up.
-
-Then register the job once:
+Every backend replica ticks every 30 seconds and tries to claim the current 5-minute UTC plan in `sys_cron_executions`. The unique key `(job_name, scope_kind, scope_id, plan_time)` makes the claim a single-winner contest across all replicas, so multi-instance deployments do not double-write. The handler then calls `SELECT rollup_task_usage_hourly()`; the SQL function holds advisory lock `4246` internally, so a stray `pg_cron` job or manual call can run alongside the scheduler without ever colliding on the rollup itself. Inspect the audit table for steady-state operation:

 ```sql
-CREATE EXTENSION IF NOT EXISTS pg_cron;
-SELECT cron.schedule(
-  'rollup_task_usage_hourly',
-  '*/5 * * * *',
-  $$SELECT rollup_task_usage_hourly()$$
-);
+SELECT plan_time, status, attempt, runner_id,
+       error_code, error_msg, started_at, finished_at
+  FROM sys_cron_executions
+ WHERE job_name = 'rollup_task_usage_hourly'
+ ORDER BY plan_time DESC
+ LIMIT 20;
 ```

-`pg_cron.database_name` defaults to `postgres`; if your Multica database has a different name, point `pg_cron` at it via that GUC or run `cron.schedule_in_database(...)` instead.
+### Compatibility — existing `pg_cron` registrations

-### Option C — Backfill historical data first
+If you previously registered the rollup as a `pg_cron` job (`SELECT cron.schedule('rollup_task_usage_hourly', '*/5 * * * *', …)`), it is safe to leave it in place: advisory lock 4246 prevents double-writes, and the loser path no-ops cleanly. To drop the redundant entry once the in-process scheduler is up:

-`rollup_task_usage_hourly()` only processes new buckets after it starts running. If you already have `task_usage` rows from before the rollup was scheduled — most commonly when upgrading from `v0.3.4` to `v0.3.5+`, or on a fresh install that has been collecting usage for a while — run `backfill_task_usage_hourly` once to seed historical buckets, then set up Option A or Option B for ongoing rollups.
+```sql
+SELECT cron.unschedule('rollup_task_usage_hourly')
+  FROM cron.job WHERE jobname = 'rollup_task_usage_hourly';
+```
+
+External cron / systemd / Kubernetes `CronJob` setups that call `SELECT rollup_task_usage_hourly()` directly are also still valid — they were the only option before MUL-2957 and remain a supported compatibility path. They are no longer the recommended setup; new deployments should rely on the in-process scheduler.
+
+### Standalone backfill command
+
+`rollup_task_usage_hourly()` only processes new buckets after it starts running. If you already have `task_usage` rows from before the rollup was claimed for the first time — most commonly when upgrading from `v0.3.4` to `v0.3.5+` on a database that already has months of usage — you can run `backfill_task_usage_hourly` to seed historical buckets:

 ```bash
 # Docker Compose
@@ -263,7 +224,7 @@ kubectl -n multica exec deploy/multica-backend -- \
  ./backfill_task_usage_hourly --sleep-between-slices=2s
 ```

-The command walks `task_usage`'s full time range in monthly slices and calls the same idempotent primitive the cron path uses, so it's safe to re-run, to interrupt with Ctrl-C, and to run concurrently with an already-scheduled rollup. Flags:
+The command walks `task_usage`'s full time range in monthly slices and calls the same idempotent primitive the in-process scheduler uses, so it's safe to re-run, to interrupt with Ctrl-C, and to run concurrently with the scheduler (advisory lock 4246 serialises them). Flags:

 | Flag | Description |
 |---|---|
@@ -275,17 +236,9 @@ After backfill completes, the rollup-state watermark is stamped to `now() - 5 mi

 ### `v0.3.4 → v0.3.5+` upgrade order

-Migration `103` adds a fail-closed guard that refuses to drop the legacy daily rollups until `task_usage_hourly` has caught up. If you run `migrate up` straight through on a database with existing `task_usage` rows, it aborts with:
+Migration `103` adds a fail-closed guard that refuses to drop the legacy daily rollups until `task_usage_hourly` has caught up. As of MUL-2957 the migrate command runs an idempotent monthly-slice backfill (under advisory lock 4246) **automatically** immediately before applying migration `103`, so v0.3.4 → v0.3.5+ upgrades complete in a single `migrate up` invocation — no operator step is required.

-```text
-ERROR: refusing to drop legacy daily rollups:
-  task_usage_hourly_rollup_state.watermark_at (1970-01-01 ...) trails
-  task_usage latest event (...) by more than 01:00:00 — backfill is
-  incomplete or pg_cron is not running. Run cmd/backfill_task_usage_hourly
-  (and let pg_cron catch up) before re-running migrate
-```
-
-Recovery is straightforward: run `backfill_task_usage_hourly` (Option C above), then re-run `migrate up` (or restart the backend container — migrations run automatically on startup). **Fresh installs are exempt** — the guard short-circuits when `task_usage` is empty, and migrations succeed, but the dashboard will still stay at zero until you set up Option A or Option B.
+If you are upgrading from a binary that pre-dates MUL-2957 (or the auto-hook fails for an environmental reason), recovery is the manual path: run `backfill_task_usage_hourly` against the database, then re-run `migrate up` (or restart the backend container — migrations run automatically on startup). **Fresh installs are exempt** — the guard short-circuits when `task_usage` is empty, and the in-process scheduler picks up new buckets from the first tick.

 ## Manual Setup (Without Docker Compose)

--- a/apps/docs/content/docs/self-host-quickstart.ja.mdx
+++ b/apps/docs/content/docs/self-host-quickstart.ja.mdx
@@ -192,44 +192,24 @@ multica.example.com {

 Cloud と同じ流れです — [Cloud クイックスタート → ステップ5-6](/cloud-quickstart#5-create-an-agent)を参照してください。

-## 7. 使用量ロールアップのスケジューリング（使用量ダッシュボードに必須）
+## 7. 使用量ロールアップ（オペレーターの操作は不要）

-<Callout type="warning">
-使用量 / ランタイムのダッシュボードは、`rollup_task_usage_hourly()` が埋める派生テーブル `task_usage_hourly` からデータを読み取ります。バンドルされた `pgvector/pgvector:pg17` の Postgres イメージには **`pg_cron` が含まれておらず**、バックエンドもロールアップをインプロセスで実行しません。`rollup_task_usage_hourly()` をスケジューリングするものが何もないと、生の `task_usage` 行は届き続けるのに、ダッシュボードは永遠にゼロのままになります。
+<Callout type="info">
+使用量 / ランタイムのダッシュボードは、`rollup_task_usage_hourly()` が埋める派生テーブル `task_usage_hourly` からデータを読み取ります。MUL-2957 以降、バックエンドは DB バックドのスケジューラー経由でこのロールアップをインプロセスで実行するようになり、`pg_cron` は不要になりました。外部 cron / systemd タイマーも推奨セットアップではなくなっています。バンドルされた `pgvector/pgvector:pg17` イメージは変更なしで動作します。
 </Callout>

-サポートされているオプションのいずれか1つを選んでください — 1つあれば十分です。
+インプロセススケジューラーは 30 秒おきにティックし、`sys_cron_executions` テーブルを介して 5 分ごとの UTC プランをクレームします。複数のバックエンドレプリカでも安全です — 一意キー `(job_name, scope_kind, scope_id, plan_time)` により、各プランで勝者は 1 つだけです。新規デプロイでは何の設定も不要です。

-**オプション A — 外部 cron / systemd-timer（最もシンプル）。** 任意の帯域外スケジューラから5分ごとにロールアップを実行します。冪等でウォーターマーク駆動なので、取りこぼしたティックは追いつきます。
-
-```bash
-# /etc/cron.d/multica-rollup — every 5 minutes
-*/5 * * * * root docker compose -f /path/to/multica/docker-compose.selfhost.yml \
-  exec -T postgres psql -U multica -d multica \
-  -c "SELECT rollup_task_usage_hourly();" >/dev/null
-```
-
-**オプション B — Postgres を `pg_cron` を同梱したイメージに置き換える。** `docker-compose.selfhost.yml` の `pgvector/pgvector:pg17` を、`pgvector` と `pg_cron` の両方を備えたイメージ（`supabase/postgres`、またはカスタムビルド）に置き換え、`shared_preload_libraries=pg_cron` を設定して再起動してから、ジョブを一度登録します。
+**互換性 — 既存の `pg_cron` 登録。** 以前 rollup を `pg_cron` ジョブとして登録していた（`SELECT cron.schedule('rollup_task_usage_hourly', '*/5 * * * *', …)`）場合でも、削除する必要はありません — SQL 関数が内部で advisory lock 4246 を保持するため、アプリのスケジューラーと `pg_cron` が二重書き込みすることはありません。冗長なエントリを削除するには:

 ```sql
-CREATE EXTENSION IF NOT EXISTS pg_cron;
-SELECT cron.schedule(
-  'rollup_task_usage_hourly',
-  '*/5 * * * *',
-  $$SELECT rollup_task_usage_hourly()$$
-);
+SELECT cron.unschedule('rollup_task_usage_hourly')
+  FROM cron.job WHERE jobname = 'rollup_task_usage_hourly';
 ```

-**オプション C — まず履歴をバックフィルする（アップグレード経路）。** `v0.3.4 → v0.3.5+` へアップグレード中で、既存の `task_usage` 行がある場合、migration `103` は hourly テーブルがシードされるまで `refusing to drop legacy daily rollups: ...` とともに `migrate up` を中断します。バンドルされたバックフィルを一度実行してから、オプション A または B を設定してください。
+**`v0.3.4 → v0.3.5+` からのアップグレード。** 以前のリリースでは、migration 103 を適用する前にオペレーターが手動で `cmd/backfill_task_usage_hourly` を実行する必要があり、そうしないと migration の fail-closed ガードが `migrate up` を中断していました。MUL-2957 以降、これは自動化されました — migrate コマンドが migration 103 を適用する直前に冪等な月単位スライスのバックフィル（advisory lock 4246 の下）を実行してから処理を続行します。忙しい DB では `--sleep-between-slices=2s` で読み取り負荷を絞るためにスタンドアロンの backfill を実行することもできますが、もはや必須ではありません。

-```bash
-docker compose -f docker-compose.selfhost.yml exec backend \
-  ./backfill_task_usage_hourly --sleep-between-slices=2s
-```
-
-`--sleep-between-slices=2s` は、忙しい DB での読み取り負荷を調整します。完了後、バックエンドのコンテナを再起動すると（起動時に migration が実行されます）アップグレードが完了します。
-
-完全なリファレンス — Kubernetes の `CronJob` テンプレートとアップグレード順序を含む — は、リポジトリの [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup) にあります。
+完全なリファレンス（運用ノートと Kubernetes デプロイ形態を含む）は、リポジトリの [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup) にあります。

 ## Kubernetes デプロイ（代替手段）

@@ -283,8 +263,8 @@ multica setup self-host \
 - **バックエンドが起動しない**: `docker compose -f docker-compose.selfhost.yml logs backend` でコンテナのログを確認してください。たいていは `.env` の不正な `DATABASE_URL` または `JWT_SECRET` が原因です
 - **認証コードが届かない**: メールバックエンドが構成されていない場合（Resend も SMTP もない）→ `docker compose logs backend` で `[DEV] Verification code` を探してください
 - **WebSocket が接続できない**: 公開デプロイでは、`FRONTEND_ORIGIN` を実際のフロントエンドのドメインに必ず設定する必要があります。[トラブルシューティング → WebSocket が接続できない](/troubleshooting#websocket-wont-connect)を参照してください
- **使用量 / ランタイムのダッシュボードがゼロのまま**: `rollup_task_usage_hourly()` がスケジューリングされていません — 上記の [ステップ7](#7-schedule-the-usage-rollup-required-for-the-usage-dashboard)と[トラブルシューティング → 使用量ダッシュボードがゼロと表示される](/troubleshooting#usage-dashboard-stays-at-zero)を参照してください
- **`migrate up` が `refusing to drop legacy daily rollups` で失敗する**: `v0.3.4 → v0.3.5+` のアップグレード経路ガードです。まず `backfill_task_usage_hourly` を実行してください — [ステップ7 → オプション C](#7-schedule-the-usage-rollup-required-for-the-usage-dashboard)を参照してください
+- **使用量 / ランタイムのダッシュボードがゼロのまま**: `rollup_task_usage_hourly()` がスケジューリングされていません — 上記の [ステップ7](#7-usage-rollup-no-operator-action-required)と[トラブルシューティング → 使用量ダッシュボードがゼロと表示される](/troubleshooting#usage-dashboard-stays-at-zero)を参照してください
+- **`migrate up` が `refusing to drop legacy daily rollups` で失敗する**: `v0.3.4 → v0.3.5+` のアップグレード経路ガードです。MUL-2957 以降、migrate コマンドは migration 103 を適用する前に自動でバックフィルを実行します — [ステップ7](#7-usage-rollup-no-operator-action-required)を参照してください

 ## 次のステップ

--- a/apps/docs/content/docs/self-host-quickstart.ko.mdx
+++ b/apps/docs/content/docs/self-host-quickstart.ko.mdx
@@ -192,44 +192,24 @@ multica.example.com {

 Cloud와 동일한 흐름입니다 — [Cloud 빠른 시작 → 5-6단계](/cloud-quickstart#5-create-an-agent)를 참고하세요.

-## 7. 사용량 롤업 스케줄링(사용량 대시보드에 필수)
+## 7. 사용량 롤업(운영자 작업 불필요)

-<Callout type="warning">
-사용량 / 런타임 대시보드는 `rollup_task_usage_hourly()`가 채우는 파생 테이블 `task_usage_hourly`에서 데이터를 읽습니다. 번들된 `pgvector/pgvector:pg17` Postgres 이미지에는 **`pg_cron`이 포함되어 있지 않으며**, 백엔드도 롤업을 인프로세스로 실행하지 않습니다. `rollup_task_usage_hourly()`를 스케줄링하는 것이 없으면, 원시 `task_usage` 행은 계속 들어오는데 대시보드는 영원히 0에 머무릅니다.
+<Callout type="info">
+사용량 / 런타임 대시보드는 `rollup_task_usage_hourly()`가 채우는 파생 테이블 `task_usage_hourly`에서 데이터를 읽습니다. MUL-2957부터 백엔드는 DB 기반 스케줄러를 통해 인프로세스로 이 롤업을 실행하므로 더 이상 `pg_cron`이 필요하지 않으며, 외부 cron / systemd 타이머도 권장 설정이 아닙니다. 번들된 `pgvector/pgvector:pg17` 이미지가 변경 없이 동작합니다.
 </Callout>

-지원되는 옵션 중 하나를 고르세요 — 하나만 있으면 됩니다.
+인프로세스 스케줄러는 30초마다 틱하면서 `sys_cron_executions` 테이블을 통해 5분 단위 UTC 플랜을 클레임합니다. 백엔드 레플리카가 여러 개여도 안전합니다 — 고유 키 `(job_name, scope_kind, scope_id, plan_time)` 덕분에 각 플랜에서 단 하나만이 승자가 됩니다. 신규 배포에는 어떤 설정도 필요 없습니다.

-**옵션 A — 외부 cron / systemd-timer(가장 간단함).** 임의의 외부 스케줄러에서 5분마다 롤업을 실행합니다. 멱등하고 워터마크 기반이므로, 놓친 틱은 따라잡습니다.
-
-```bash
-# /etc/cron.d/multica-rollup — every 5 minutes
-*/5 * * * * root docker compose -f /path/to/multica/docker-compose.selfhost.yml \
-  exec -T postgres psql -U multica -d multica \
-  -c "SELECT rollup_task_usage_hourly();" >/dev/null
-```
-
-**옵션 B — Postgres를 `pg_cron`이 포함된 이미지로 교체.** `docker-compose.selfhost.yml`의 `pgvector/pgvector:pg17`을 `pgvector`와 `pg_cron`을 모두 갖춘 이미지(`supabase/postgres` 또는 커스텀 빌드)로 교체하고, `shared_preload_libraries=pg_cron`을 설정한 뒤 재시작하고, 작업을 한 번 등록합니다.
+**호환성 — 기존 `pg_cron` 등록.** 이전에 rollup을 `pg_cron` 잡으로 등록했었다면(`SELECT cron.schedule('rollup_task_usage_hourly', '*/5 * * * *', …)`) 굳이 제거할 필요는 없습니다 — SQL 함수가 내부적으로 advisory lock 4246을 잡기 때문에 앱 스케줄러와 `pg_cron`이 이중 쓰기를 할 수 없습니다. 중복 항목을 제거하려면:

 ```sql
-CREATE EXTENSION IF NOT EXISTS pg_cron;
-SELECT cron.schedule(
-  'rollup_task_usage_hourly',
-  '*/5 * * * *',
-  $$SELECT rollup_task_usage_hourly()$$
-);
+SELECT cron.unschedule('rollup_task_usage_hourly')
+  FROM cron.job WHERE jobname = 'rollup_task_usage_hourly';
 ```

-**옵션 C — 먼저 히스토리 백필(업그레이드 경로).** `v0.3.4 → v0.3.5+`로 업그레이드하는 중이고 기존 `task_usage` 행이 있다면, migration `103`이 hourly 테이블이 시드될 때까지 `refusing to drop legacy daily rollups: ...`와 함께 `migrate up`을 중단합니다. 번들된 백필을 한 번 실행한 다음, 옵션 A 또는 B를 설정하세요.
+**`v0.3.4 → v0.3.5+` 업그레이드.** 이전 릴리스에서는 migration 103을 적용하기 전에 운영자가 직접 `cmd/backfill_task_usage_hourly`를 실행해야 했고, 그러지 않으면 fail-closed 가드가 `migrate up`을 중단했습니다. MUL-2957부터 이 작업은 자동입니다 — migrate 명령이 migration 103을 적용하기 직전에(advisory lock 4246 보호 아래에서) 멱등한 월별 슬라이스 백필을 실행한 뒤 계속 진행합니다. 바쁜 DB에서는 여전히 `--sleep-between-slices=2s`로 읽기 부하를 조절하기 위해 스탠드얼론 backfill을 실행할 수 있지만 더 이상 필수는 아닙니다.

-```bash
-docker compose -f docker-compose.selfhost.yml exec backend \
-  ./backfill_task_usage_hourly --sleep-between-slices=2s
-```
-
-`--sleep-between-slices=2s`는 바쁜 DB에서 읽기 부하를 조절합니다. 완료된 후 백엔드 컨테이너를 재시작하면(시작 시 migration이 실행됨) 업그레이드가 완료됩니다.
-
-전체 레퍼런스 — Kubernetes `CronJob` 템플릿과 업그레이드 순서 포함 — 는 저장소의 [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup)에 있습니다.
+전체 레퍼런스(운영 노트와 Kubernetes 배포 형태 포함)는 저장소의 [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup)에 있습니다.

 ## Kubernetes 배포(대체 방안)

@@ -283,8 +263,8 @@ multica setup self-host \
 - **백엔드가 시작되지 않음**: `docker compose -f docker-compose.selfhost.yml logs backend`로 컨테이너 로그를 확인하세요. 보통 `.env`의 잘못된 `DATABASE_URL` 또는 `JWT_SECRET`이 원인입니다
 - **인증 코드를 받지 못함**: 이메일 백엔드가 구성되지 않은 경우(Resend도 SMTP도 없음) → `docker compose logs backend`에서 `[DEV] Verification code`를 찾으세요
 - **WebSocket이 연결되지 않음**: 공개 배포에서는 반드시 `FRONTEND_ORIGIN`을 실제 프런트엔드 도메인으로 설정해야 합니다. [문제 해결 → WebSocket이 연결되지 않음](/troubleshooting#websocket-wont-connect)을 참고하세요
- **사용량 / 런타임 대시보드가 0에 머무름**: `rollup_task_usage_hourly()`가 스케줄링되지 않고 있습니다 — 위의 [7단계](#7-schedule-the-usage-rollup-required-for-the-usage-dashboard)와 [문제 해결 → 사용량 대시보드가 0으로 표시됨](/troubleshooting#usage-dashboard-stays-at-zero)을 참고하세요
- **`migrate up`이 `refusing to drop legacy daily rollups`로 실패함**: `v0.3.4 → v0.3.5+` 업그레이드 경로 가드입니다. 먼저 `backfill_task_usage_hourly`를 실행하세요 — [7단계 → 옵션 C](#7-schedule-the-usage-rollup-required-for-the-usage-dashboard)를 참고하세요
+- **사용량 / 런타임 대시보드가 0에 머무름**: `rollup_task_usage_hourly()`가 스케줄링되지 않고 있습니다 — 위의 [7단계](#7-usage-rollup-no-operator-action-required)와 [문제 해결 → 사용량 대시보드가 0으로 표시됨](/troubleshooting#usage-dashboard-stays-at-zero)을 참고하세요
+- **`migrate up`이 `refusing to drop legacy daily rollups`로 실패함**: `v0.3.4 → v0.3.5+` 업그레이드 경로 가드입니다. MUL-2957부터 migrate 명령이 migration 103을 적용하기 전에 백필을 자동으로 실행합니다 — [7단계](#7-usage-rollup-no-operator-action-required)를 참고하세요

 ## 다음 단계

--- a/apps/docs/content/docs/self-host-quickstart.mdx
+++ b/apps/docs/content/docs/self-host-quickstart.mdx
@@ -193,44 +193,24 @@ After bringing the proxy up, set `FRONTEND_ORIGIN=https://multica.example.com` i

 Same flow as Cloud — see [Cloud quickstart → Steps 5-6](/cloud-quickstart#5-create-an-agent).

-## 7. Schedule the usage rollup (required for the Usage dashboard)
+## 7. Usage rollup (no operator action required)

-<Callout type="warning">
-The Usage / Runtime dashboards read from a derived `task_usage_hourly` table populated by `rollup_task_usage_hourly()`. The bundled `pgvector/pgvector:pg17` Postgres image **does not include `pg_cron`**, and the backend does not run the rollup in-process either. If nothing schedules `rollup_task_usage_hourly()`, raw `task_usage` rows keep arriving while the dashboard stays at zero forever.
+<Callout type="info">
+The Usage / Runtime dashboards read from a derived `task_usage_hourly` table populated by `rollup_task_usage_hourly()`. As of MUL-2957 the backend runs this rollup in-process via the DB-backed scheduler — `pg_cron` is no longer required, and external cron / systemd timers are no longer the recommended setup. The bundled `pgvector/pgvector:pg17` image works without changes.
 </Callout>

-Pick one of the supported options — only one is needed.
+The in-process scheduler ticks every 30 seconds and claims a 5-minute UTC plan via the `sys_cron_executions` table. Multiple backend replicas are safe — the unique key `(job_name, scope_kind, scope_id, plan_time)` means only one wins each plan. No setup is needed for new deployments.

-**Option A — External cron / systemd-timer (simplest).** Run the rollup every 5 minutes from any out-of-band scheduler. It's idempotent and watermark-driven, so missed ticks catch up:
-
-```bash
-# /etc/cron.d/multica-rollup — every 5 minutes
-*/5 * * * * root docker compose -f /path/to/multica/docker-compose.selfhost.yml \
-  exec -T postgres psql -U multica -d multica \
-  -c "SELECT rollup_task_usage_hourly();" >/dev/null
-```
-
-**Option B — Swap Postgres for an image that ships `pg_cron`.** Replace `pgvector/pgvector:pg17` in `docker-compose.selfhost.yml` with an image that has both `pgvector` and `pg_cron` (`supabase/postgres`, or a custom build), set `shared_preload_libraries=pg_cron`, restart, then register the job once:
+**Compatibility — existing `pg_cron` registrations.** If you previously registered the rollup as a `pg_cron` job (`SELECT cron.schedule('rollup_task_usage_hourly', '*/5 * * * *', …)`), you do not need to remove it — the SQL function holds advisory lock 4246 internally, so the app scheduler and `pg_cron` cannot double-write. To drop the redundant entry:

 ```sql
-CREATE EXTENSION IF NOT EXISTS pg_cron;
-SELECT cron.schedule(
-  'rollup_task_usage_hourly',
-  '*/5 * * * *',
-  $$SELECT rollup_task_usage_hourly()$$
-);
+SELECT cron.unschedule('rollup_task_usage_hourly')
+  FROM cron.job WHERE jobname = 'rollup_task_usage_hourly';
 ```

-**Option C — Backfill history first (upgrade path).** If you're upgrading from `v0.3.4 → v0.3.5+` and have existing `task_usage` rows, migration `103` will abort `migrate up` with `refusing to drop legacy daily rollups: ...` until the hourly table is seeded. Run the bundled backfill once, then set up Option A or B:
+**Upgrade from `v0.3.4 → v0.3.5+`.** The previous release asked operators to run `cmd/backfill_task_usage_hourly` manually before applying migration 103, otherwise the migration's fail-closed guard would abort `migrate up`. As of MUL-2957 this is automatic: the migrate command runs an idempotent monthly-slice backfill (under advisory lock 4246) immediately before applying migration 103, then continues. You may still run the standalone backfill on a busy DB to throttle read pressure with `--sleep-between-slices=2s`, but it is no longer required.

-```bash
-docker compose -f docker-compose.selfhost.yml exec backend \
-  ./backfill_task_usage_hourly --sleep-between-slices=2s
-```
-
-`--sleep-between-slices=2s` throttles read pressure on a busy DB. After it finishes, restart the backend container (migrations run on startup) and the upgrade completes.
-
-Full reference — including the Kubernetes `CronJob` template and the upgrade order — lives in the repo's [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup).
+Full reference — including operations notes and the Kubernetes deployment shape — lives in the repo's [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup).

 ## Kubernetes deployment (alternative)

@@ -284,8 +264,8 @@ The full reference — three login modes, the `backend` ExternalName workaround
 - **Backend won't start**: check container logs with `docker compose -f docker-compose.selfhost.yml logs backend`; usually it's a bad `DATABASE_URL` or `JWT_SECRET` in `.env`
 - **Verification code not received**: no email backend is configured (neither Resend nor SMTP) → look for `[DEV] Verification code` in `docker compose logs backend`
 - **WebSocket won't connect**: for public deployments you must set `FRONTEND_ORIGIN` to your real frontend domain; see [Troubleshooting → WebSocket won't connect](/troubleshooting#websocket-wont-connect)
- **Usage / Runtime dashboard stays at zero**: `rollup_task_usage_hourly()` isn't being scheduled — see [Step 7](#7-schedule-the-usage-rollup-required-for-the-usage-dashboard) above and [Troubleshooting → Usage dashboard shows zero](/troubleshooting#usage-dashboard-stays-at-zero)
- **`migrate up` fails with `refusing to drop legacy daily rollups`**: upgrade-path guard from `v0.3.4 → v0.3.5+`. Run `backfill_task_usage_hourly` first — see [Step 7 → Option C](#7-schedule-the-usage-rollup-required-for-the-usage-dashboard)
+- **Usage / Runtime dashboard stays at zero**: `rollup_task_usage_hourly()` isn't being scheduled — see [Step 7](#7-usage-rollup-no-operator-action-required) above and [Troubleshooting → Usage dashboard shows zero](/troubleshooting#usage-dashboard-stays-at-zero)
+- **`migrate up` fails with `refusing to drop legacy daily rollups`**: upgrade-path guard from `v0.3.4 → v0.3.5+`. As of MUL-2957 the migrate command runs the backfill automatically before applying migration 103 — see [Step 7](#7-usage-rollup-no-operator-action-required)

 ## Next steps

--- a/apps/docs/content/docs/self-host-quickstart.zh.mdx
+++ b/apps/docs/content/docs/self-host-quickstart.zh.mdx
@@ -192,44 +192,24 @@ multica.example.com {

 流程和 Cloud 一样——见 [Cloud 快速上手 → 5-6 步](/cloud-quickstart#5-创建智能体)。

-## 7. 调度用量汇总任务（Usage Dashboard 必需）
+## 7. 用量汇总（无需运维操作）

-<Callout type="warning">
-Usage / Runtime 看板读的是派生表 `task_usage_hourly`，需要 `rollup_task_usage_hourly()` 周期性运行才能填充。**默认的 `pgvector/pgvector:pg17` 镜像不带 `pg_cron`**，后端进程内部也不会跑这个 rollup——什么都没调度的话，原始 `task_usage` 行会继续写入，但 dashboard 会一直停在 0，不会报错。
+<Callout type="info">
+Usage / Runtime 看板读的是派生表 `task_usage_hourly`，由 `rollup_task_usage_hourly()` 周期性填充。从 MUL-2957 起，后端通过 DB 后端的调度器在进程内运行该 rollup —— 不再需要 `pg_cron`，外部 cron / systemd timer 也不再是推荐方案。默认的 `pgvector/pgvector:pg17` 镜像无需改动即可工作。
 </Callout>

-三种支持路径，三选一即可。
+进程内调度器每 30 秒 tick 一次，通过 `sys_cron_executions` 表认领 5 分钟一档的 UTC plan。多 backend 副本同时跑也安全 —— 唯一键 `(job_name, scope_kind, scope_id, plan_time)` 保证每个 plan 只有一个赢家。新部署不需要任何额外配置。

-**Option A —— 外部 cron / systemd-timer（最简单）。** 在任意外部调度器上每 5 分钟跑一次 rollup。函数是幂等的、按 watermark 推进，丢一两个 tick 下次能补上：
-
-```bash
-# /etc/cron.d/multica-rollup —— 每 5 分钟跑一次
-*/5 * * * * root docker compose -f /path/to/multica/docker-compose.selfhost.yml \
-  exec -T postgres psql -U multica -d multica \
-  -c "SELECT rollup_task_usage_hourly();" >/dev/null
-```
-
-**Option B —— 换成自带 `pg_cron` 的 Postgres 镜像。** 把 `docker-compose.selfhost.yml` 里的 `pgvector/pgvector:pg17` 换成同时带 `pgvector` 和 `pg_cron` 的镜像（比如 `supabase/postgres`，或自己 build 一份），把 `shared_preload_libraries=pg_cron` 配上、重启 Postgres，然后注册一次任务：
+**兼容性 —— 已注册的 `pg_cron` 任务。** 如果你之前用 `pg_cron` 注册过 rollup（`SELECT cron.schedule('rollup_task_usage_hourly', '*/5 * * * *', …)`），不删也行 —— SQL 函数内部持有 advisory lock 4246，应用调度器和 `pg_cron` 不会并发双写。要清掉冗余项可以执行：

 ```sql
-CREATE EXTENSION IF NOT EXISTS pg_cron;
-SELECT cron.schedule(
-  'rollup_task_usage_hourly',
-  '*/5 * * * *',
-  $$SELECT rollup_task_usage_hourly()$$
-);
+SELECT cron.unschedule('rollup_task_usage_hourly')
+  FROM cron.job WHERE jobname = 'rollup_task_usage_hourly';
 ```

-**Option C —— 先回填历史（升级路径）。** 如果你是从 `v0.3.4` 升级到 `v0.3.5+` 且数据库里已经有 `task_usage` 行，migration `103` 会以 `refusing to drop legacy daily rollups: ...` 报错并中止 `migrate up`，直到 hourly 表被 seed 过。先跑一次内置的 backfill 命令，然后再配 Option A 或 Option B 让新数据持续流进来：
+**从 `v0.3.4 → v0.3.5+` 升级。** 上一版要求运维在应用 migration 103 之前手动跑 `cmd/backfill_task_usage_hourly`，否则 fail-closed 守卫会中止 `migrate up`。从 MUL-2957 起这一步是自动的：migrate 命令会在应用 migration 103 之前（advisory lock 4246 保护下）运行幂等的按月切片 backfill，然后继续。在繁忙的数据库上你仍可以用 `--sleep-between-slices=2s` 跑独立 backfill 来限制读压力，但已不是必需。

-```bash
-docker compose -f docker-compose.selfhost.yml exec backend \
-  ./backfill_task_usage_hourly --sleep-between-slices=2s
-```
-
-`--sleep-between-slices=2s` 用来在繁忙的数据库上限制读压力。回填跑完后重启后端容器（migration 在启动时自动跑），升级就能继续。
-
-完整参考（含 Kubernetes `CronJob` 模板和升级顺序）见仓库的 [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup)。
+完整参考（含运维注意事项和 Kubernetes 部署形态）见仓库的 [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup)。

 ## Kubernetes 部署（替代方案）

@@ -283,8 +263,8 @@ multica setup self-host \
 - **后端起不来**：看容器日志 `docker compose -f docker-compose.selfhost.yml logs backend`；常见是 `.env` 里 `DATABASE_URL` 或 `JWT_SECRET` 有问题
 - **验证码收不到**：没配任何邮件后端（Resend 和 SMTP 都没设） → 从 `docker compose logs backend` 里找 `[DEV] Verification code`
 - **WebSocket 连不上**：公网部署必须设 `FRONTEND_ORIGIN` 成你真实的前端域名；见 [故障排查 → WebSocket 连不上](/troubleshooting#websocket-连不上)
- **Usage / Runtime 看板一直是 0**：没人调度 `rollup_task_usage_hourly()` —— 见上面的 [第 7 步](#7-调度用量汇总任务usage-dashboard-必需) 和 [故障排查 → Usage 看板一直是 0](/troubleshooting#usage-看板一直是-0)
- **`migrate up` 报 `refusing to drop legacy daily rollups`**：`v0.3.4 → v0.3.5+` 升级路径的 fail-closed guard。先跑 `backfill_task_usage_hourly` —— 见 [第 7 步 → Option C](#7-调度用量汇总任务usage-dashboard-必需)
+- **Usage / Runtime 看板一直是 0**：没人调度 `rollup_task_usage_hourly()` —— 见上面的 [第 7 步](#7-用量汇总无需运维操作) 和 [故障排查 → Usage 看板一直是 0](/troubleshooting#usage-看板一直是-0)
+- **`migrate up` 报 `refusing to drop legacy daily rollups`**：`v0.3.4 → v0.3.5+` 升级路径的 fail-closed guard。从 MUL-2957 起 migrate 命令在应用 migration 103 之前会自动跑 backfill —— 见 [第 7 步](#7-用量汇总无需运维操作)

 ## 下一步

--- a/apps/docs/content/docs/troubleshooting.ja.mdx
+++ b/apps/docs/content/docs/troubleshooting.ja.mdx
@@ -180,9 +180,9 @@ docker exec <container> env | grep -E 'APP_ENV|MULTICA_DEV_VERIFICATION_CODE'

 **考えられる原因**:

-1. **`rollup_task_usage_hourly()` が一切スケジュールされていない** — 使用量 / ランタイムのダッシュボードは派生テーブル `task_usage_hourly` から読み取り、このテーブルはその関数によって埋められます。同梱の `pgvector/pgvector:pg17` イメージには `pg_cron` が含まれておらず、バックエンドもプロセス内で rollup を実行しません。外部スケジューラのない新規セルフホストインストールでは、これがデフォルトの状態です。
-2. **`pg_cron` はインストールされているが誤ったデータベースを指している** — `pg_cron.database_name` のデフォルト値は `postgres` です。Multica のデータベース名が異なる場合、スケジュールされたジョブは `rollup_task_usage_hourly()` を一切見つけられません。
-3. **スケジューラは動作しているが rollup が静かにエラーを出している** — 例えば cron エントリ内部の DB ロール / search_path が誤っている。
+1. **`rollup_task_usage_hourly()` がクレームされていない** — 使用量 / ランタイムのダッシュボードは派生テーブル `task_usage_hourly` から読み取り、このテーブルはその関数によって埋められます。MUL-2957 以降、バックエンドは DB バックドのスケジューラー（`sys_cron_executions`）を介してこの rollup をインプロセスで実行します。古いビルド、未適用の migration `113`、またはレプリカが残っていない長時間のバックエンド停止があると、最近の SUCCESS 行のないテーブルが残ることがあります。
+2. **`pg_cron` は互換性のために構成されているが誤ったデータベースを指している** — `pg_cron.database_name` のデフォルトは `postgres` です。Multica データベース名が異なる場合、スケジュールされたジョブは `rollup_task_usage_hourly()` を一切見つけられません。インプロセススケジューラーはこれに依存しませんが、もしインプロセススケジューラーを除去して `pg_cron` に依存している場合、DB 名は一致しなければなりません。
+3. **ハンドラーがクレームされているが静かにエラーを出している** — マイグレーションが部分的にしか適用されていないために SQL 関数が欠落している、あるいは DB ロール / search_path が誤って構成されている、など。`sys_cron_executions` の FAILED 監査行を確認してください。

 **診断方法**:

@@ -191,24 +191,30 @@ docker exec <container> env | grep -E 'APP_ENV|MULTICA_DEV_VERIFICATION_CODE'
 SELECT count(*) AS raw_rows FROM task_usage;
 SELECT count(*) AS hourly_rows FROM task_usage_hourly;

-- Confirm pg_cron is (or isn't) available.
-SELECT * FROM pg_available_extensions WHERE name = 'pg_cron';
-SHOW shared_preload_libraries;
-
-- If pg_cron is installed, check the schedule + last run.
-SELECT jobname, schedule, database, active FROM cron.job;
-SELECT jobname, status, return_message, start_time, end_time
-  FROM cron.job_run_details ORDER BY start_time DESC LIMIT 10;
+-- Inspect the in-process scheduler's audit log.
+SELECT plan_time, status, attempt, runner_id,
+       error_code, error_msg, started_at, finished_at
+  FROM sys_cron_executions
+ WHERE job_name = 'rollup_task_usage_hourly'
+ ORDER BY plan_time DESC
+ LIMIT 20;

 -- Watermark — if this is 1970-01-01, the rollup has never run.
 SELECT watermark_at FROM task_usage_hourly_rollup_state;
+
+-- Compatibility path: if you previously registered pg_cron, confirm
+-- it is (or isn't) available and pointing at the right database.
+SELECT * FROM pg_available_extensions WHERE name = 'pg_cron';
+SHOW shared_preload_libraries;
+SELECT jobname, schedule, database, active FROM cron.job;
 ```

 **解決方法**:

- rollup を手動で一度呼び出して動作するか確認してください: `SELECT rollup_task_usage_hourly();` — ダッシュボードを再読み込みしてください。数値が表示されれば、欠けているのはスケジューラだけです。
- [セルフホストクイックスタート → 使用量 rollup のスケジューリング](/self-host-quickstart#7-schedule-the-usage-rollup-required-for-the-usage-dashboard)からサポートされる方式のいずれかを選んでください: 外部 cron / systemd-timer / Kubernetes CronJob、または Postgres を `pg_cron` を含むイメージに置き換える。
- スケジュール設定より前の履歴がすでにある場合は、バックエンドコンテナ内部で `backfill_task_usage_hourly` を実行し、ウォーターマーク以前のバケットを埋めてください。
+- 少なくとも 1 つのバックエンドレプリカでスケジューラーが実際に動作していることを確認してください — 30 秒ごとに `sys_cron_executions` の `rollup_task_usage_hourly` に SUCCESS 行が追加されているはずです。
+- SQL パスを検証するため、rollup を手動で一度呼び出してください: `SELECT rollup_task_usage_hourly();` — ダッシュボードを再読み込みしてください。数値が表示されれば SQL 関数は問題なく、スケジューラーのクレーム経路に問題があります。
+- migration `113_sys_cron_executions` がまだ適用されていない場合は、バックエンドを再起動してマイグレーションを実行するか、手動で `migrate up` を呼び出してください。
+- インプロセススケジューラー以前のレガシー `pg_cron` 履歴がある場合でも、SQL 関数は内部で advisory lock 4246 を保持するため二重書き込みは発生しません — オプションの `cron.unschedule` クリーンアップについては [セルフホストクイックスタート → 使用量ロールアップ](/self-host-quickstart#7-usage-rollup-no-operator-action-required) を参照してください。

 ## マイグレーション `103` が `refusing to drop legacy daily rollups` で失敗する

@@ -224,9 +230,11 @@ ERROR: refusing to drop legacy daily rollups:

 **考えられる原因**: これはマイグレーション `103` の fail-closed ガードです。`task_usage_hourly` が生の `task_usage` に追いつくまで、レガシーの daily rollup の削除を拒否します。既存の行が存在し、rollup のウォーターマークがまだ epoch に留まっているとき — つまり、まだどの履歴も hourly テーブルに rollup されていないとき — にこのガードが発動します。

+MUL-2957 以降、migrate コマンドは migration `103` を適用する直前に冪等な月別スライス backfill（advisory lock 4246 の下）を自動で実行するため、v0.3.4 → v0.3.5+ への直接アップグレードは単一の `migrate up` 呼び出しで完了します。それでもこのエラーが表示される場合は、MUL-2957 以前のバイナリを使用しているか、フック自体が失敗しています — 直前の `task_usage hourly rollup hook` 行で migrate ログを確認してください。
+
 **解決方法**:

-1. 同じデータベースに対して backfill を実行してください（冪等であり、中断しても安全で、再実行しても安全です）:
+1. MUL-2957 以前のバイナリを使用しており、まずバイナリをアップグレードできない場合は、同じデータベースに対してスタンドアロンの backfill を実行してください（冪等であり、中断しても安全で、再実行しても安全です）:

   ```bash
   # Docker Compose
@@ -239,7 +247,7 @@ ERROR: refusing to drop legacy daily rollups:
   ```

 2. アップグレードを再実行してください — バックエンドコンテナを再起動するだけで十分で、マイグレーションは起動時に実行されます。これでガードが最新のウォーターマークを確認し、`103` の適用を許可します。
-3. ウォーターマークが進み続けるように、継続的な rollup スケジュール（cron / `pg_cron`）を設定してください — [セルフホストクイックスタート → 使用量 rollup のスケジューリング](/self-host-quickstart#7-schedule-the-usage-rollup-required-for-the-usage-dashboard)を参照してください。
+3. インプロセススケジューラーがウォーターマークを進め続けます — [セルフホストクイックスタート → 使用量ロールアップ](/self-host-quickstart#7-usage-rollup-no-operator-action-required) を参照してください。

 `--sleep-between-slices=2s` は、数年分の履歴を持つプロダクションデータベースにとって控えめなデフォルト値です。直近 N か月のみを保持し、それより古いバケットを永久に放棄してもかまわない場合は `--months-back N --force-partial` を使用してください。

--- a/apps/docs/content/docs/troubleshooting.ko.mdx
+++ b/apps/docs/content/docs/troubleshooting.ko.mdx
@@ -180,9 +180,9 @@ docker exec <container> env | grep -E 'APP_ENV|MULTICA_DEV_VERIFICATION_CODE'

 **가능한 원인**:

-1. **`rollup_task_usage_hourly()`가 전혀 스케줄링되지 않음** — 사용량 / 런타임 대시보드는 파생 테이블 `task_usage_hourly`에서 읽으며, 이 테이블은 해당 함수로 채워집니다. 번들된 `pgvector/pgvector:pg17` 이미지에는 `pg_cron`이 포함되어 있지 않으며, 백엔드도 프로세스 내에서 rollup을 실행하지 않습니다. 외부 스케줄러 없이 새로 설치한 자체 호스팅에서는 이것이 기본 상태입니다.
-2. **`pg_cron`이 설치되었지만 잘못된 데이터베이스를 가리킴** — `pg_cron.database_name`의 기본값은 `postgres`입니다. Multica 데이터베이스 이름이 다르면 스케줄된 작업이 `rollup_task_usage_hourly()`를 전혀 보지 못합니다.
-3. **스케줄러는 실행되지만 rollup이 조용히 오류를 냄** — 예를 들어 cron 항목 내부의 DB 역할 / search_path가 잘못됨.
+1. **`rollup_task_usage_hourly()`가 클레임되지 않음** — 사용량 / 런타임 대시보드는 파생 테이블 `task_usage_hourly`에서 읽으며, 이 테이블은 해당 함수로 채워집니다. MUL-2957부터 백엔드는 DB 기반 스케줄러(`sys_cron_executions`)를 통해 인프로세스로 rollup을 실행합니다. 오래된 빌드, 적용되지 않은 migration `113`, 또는 레플리카가 남아있지 않은 장기간의 백엔드 중단이 있으면 최근 SUCCESS 행이 없는 테이블이 남을 수 있습니다.
+2. **`pg_cron`이 호환성 용도로 구성되었지만 잘못된 데이터베이스를 가리킴** — `pg_cron.database_name`의 기본값은 `postgres`입니다. Multica 데이터베이스 이름이 다르면 스케줄된 작업이 `rollup_task_usage_hourly()`를 전혀 보지 못합니다. 인프로세스 스케줄러는 이에 의존하지 않지만, 인프로세스 스케줄러를 제거하고 `pg_cron`에 의존한다면 DB 이름이 일치해야 합니다.
+3. **핸들러가 클레임되지만 조용히 오류를 냄** — 예: 마이그레이션이 일부만 적용되어 SQL 함수가 누락되었거나, DB 역할 / search_path가 잘못 구성됨. `sys_cron_executions`의 FAILED 감사 행을 확인하세요.

 **진단 방법**:

@@ -191,24 +191,30 @@ docker exec <container> env | grep -E 'APP_ENV|MULTICA_DEV_VERIFICATION_CODE'
 SELECT count(*) AS raw_rows FROM task_usage;
 SELECT count(*) AS hourly_rows FROM task_usage_hourly;

-- Confirm pg_cron is (or isn't) available.
-SELECT * FROM pg_available_extensions WHERE name = 'pg_cron';
-SHOW shared_preload_libraries;
-
-- If pg_cron is installed, check the schedule + last run.
-SELECT jobname, schedule, database, active FROM cron.job;
-SELECT jobname, status, return_message, start_time, end_time
-  FROM cron.job_run_details ORDER BY start_time DESC LIMIT 10;
+-- Inspect the in-process scheduler's audit log.
+SELECT plan_time, status, attempt, runner_id,
+       error_code, error_msg, started_at, finished_at
+  FROM sys_cron_executions
+ WHERE job_name = 'rollup_task_usage_hourly'
+ ORDER BY plan_time DESC
+ LIMIT 20;

 -- Watermark — if this is 1970-01-01, the rollup has never run.
 SELECT watermark_at FROM task_usage_hourly_rollup_state;
+
+-- Compatibility path: if you previously registered pg_cron, confirm
+-- it is (or isn't) available and pointing at the right database.
+SELECT * FROM pg_available_extensions WHERE name = 'pg_cron';
+SHOW shared_preload_libraries;
+SELECT jobname, schedule, database, active FROM cron.job;
 ```

 **해결 방법**:

- rollup을 수동으로 한 번 호출하여 동작하는지 확인하세요: `SELECT rollup_task_usage_hourly();` — 대시보드를 새로고침하세요. 숫자가 나타나면 빠진 것은 스케줄러뿐입니다.
- [자체 호스팅 빠른 시작 → 사용량 rollup 스케줄링](/self-host-quickstart#7-schedule-the-usage-rollup-required-for-the-usage-dashboard)에서 지원되는 방식 중 하나를 선택하세요: 외부 cron / systemd-timer / Kubernetes CronJob, 또는 Postgres를 `pg_cron`이 포함된 이미지로 교체.
- 스케줄 설정 이전의 이력이 이미 있다면, 백엔드 컨테이너 내부에서 `backfill_task_usage_hourly`를 실행하여 워터마크 이전의 버킷을 채우세요.
+- 적어도 하나의 백엔드 레플리카에서 스케줄러가 실제로 실행 중인지 확인하세요 — 30초마다 `sys_cron_executions`의 `rollup_task_usage_hourly`에 SUCCESS 행이 추가되어야 합니다.
+- SQL 경로를 검증하기 위해 rollup을 수동으로 한 번 호출하세요: `SELECT rollup_task_usage_hourly();` — 대시보드를 새로고침하세요. 숫자가 나타나면 SQL 함수는 정상이며, 문제는 스케줄러 클레임 경로에 있습니다.
+- migration `113_sys_cron_executions`가 아직 적용되지 않았다면 백엔드를 재시작해 마이그레이션을 실행하거나 수동으로 `migrate up`을 호출하세요.
+- 인프로세스 스케줄러 이전의 레거시 `pg_cron` 이력이 있어도 SQL 함수가 내부적으로 advisory lock 4246을 잡고 있어 두 경로가 이중 쓰기할 수 없습니다 — 선택적 `cron.unschedule` 정리는 [자체 호스팅 빠른 시작 → 사용량 롤업](/self-host-quickstart#7-usage-rollup-no-operator-action-required)을 참고하세요.

 ## 마이그레이션 `103`이 `refusing to drop legacy daily rollups`로 실패함

@@ -224,9 +230,11 @@ ERROR: refusing to drop legacy daily rollups:

 **가능한 원인**: 이것은 마이그레이션 `103`의 fail-closed 가드입니다. `task_usage_hourly`가 원시 `task_usage`를 따라잡을 때까지 레거시 daily rollup 삭제를 거부합니다. 기존 행이 존재하고 rollup 워터마크가 여전히 epoch에 머물러 있을 때 — 즉 아직 어떤 이력도 hourly 테이블로 rollup되지 않았을 때 — 이 가드가 발동합니다.

+MUL-2957부터 migrate 명령은 migration `103`을 적용하기 직전에 멱등한 월별 슬라이스 backfill(advisory lock 4246 보호)을 자동으로 실행하므로, v0.3.4 → v0.3.5+ 직접 업그레이드는 단일 `migrate up` 호출로 완료됩니다. 그래도 이 오류가 보인다면, MUL-2957 이전 바이너리를 사용 중이거나 훅 자체가 실패한 것입니다 — migrate 로그에서 `task_usage hourly rollup hook` 로그를 확인하세요.
+
 **해결 방법**:

-1. 같은 데이터베이스에 대해 backfill을 실행하세요(멱등하며, 중단해도 안전하고, 다시 실행해도 안전합니다):
+1. MUL-2957 이전 바이너리를 사용 중이고 바이너리를 먼저 업그레이드할 수 없다면, 같은 데이터베이스에 대해 스탠드얼론 backfill을 실행하세요(멱등하며, 중단해도 안전하고, 다시 실행해도 안전합니다):

   ```bash
   # Docker Compose
@@ -239,7 +247,7 @@ ERROR: refusing to drop legacy daily rollups:
   ```

 2. 업그레이드를 다시 실행하세요 — 백엔드 컨테이너를 재시작하는 것으로 충분하며, 마이그레이션은 시작 시 실행됩니다. 이제 가드가 최신 워터마크를 확인하고 `103`을 적용하도록 허용합니다.
-3. 워터마크가 계속 진행되도록 지속적인 rollup 스케줄(cron / `pg_cron`)을 설정하세요 — [자체 호스팅 빠른 시작 → 사용량 rollup 스케줄링](/self-host-quickstart#7-schedule-the-usage-rollup-required-for-the-usage-dashboard)을 참고하세요.
+3. 인프로세스 스케줄러가 이후 워터마크를 계속 진행시킵니다 — [자체 호스팅 빠른 시작 → 사용량 롤업](/self-host-quickstart#7-usage-rollup-no-operator-action-required)을 참고하세요.

 `--sleep-between-slices=2s`는 수년 치 이력이 있는 프로덕션 데이터베이스에서 적절한 기본값입니다. 최근 N개월만 보관하고 더 오래된 버킷을 영구히 포기해도 괜찮다면 `--months-back N --force-partial`을 사용하세요.

--- a/apps/docs/content/docs/troubleshooting.mdx
+++ b/apps/docs/content/docs/troubleshooting.mdx
@@ -180,9 +180,9 @@ Check your inbox (including spam) for the real verification code.

 **Likely causes**:

-1. **`rollup_task_usage_hourly()` is never scheduled** — the Usage / Runtime dashboards read from the derived `task_usage_hourly` table, which is populated by that function. The bundled `pgvector/pgvector:pg17` image does not include `pg_cron`, and the backend does not run the rollup in-process either. On a fresh self-host install with no external scheduler, this is the default state.
-2. **`pg_cron` is installed but pointing at the wrong database** — `pg_cron.database_name` defaults to `postgres`; if your Multica database has a different name, the scheduled job never sees `rollup_task_usage_hourly()`.
-3. **The scheduler is running but the rollup is silently erroring** — e.g. wrong DB role / search_path inside the cron entry.
+1. **`rollup_task_usage_hourly()` is never being claimed** — the Usage / Runtime dashboards read from the derived `task_usage_hourly` table, populated by that function. Since MUL-2957 the backend runs the rollup in-process via the DB-backed scheduler (`sys_cron_executions`); a stale build, a missing migration `113`, or a sustained backend outage with no replicas left running can leave the table without a recent SUCCESS row.
+2. **`pg_cron` is configured for compatibility but pointing at the wrong database** — `pg_cron.database_name` defaults to `postgres`; if your Multica database has a different name, the scheduled job never sees `rollup_task_usage_hourly()`. The in-process scheduler does not depend on this, but if you removed the in-process scheduler and rely on `pg_cron`, the DB name must match.
+3. **The handler is being claimed but silently erroring** — e.g. the SQL function is missing because migrations were partially applied, or DB role / search_path is misconfigured. Check the FAILED audit rows in `sys_cron_executions`.

 **How to diagnose**:

@@ -191,24 +191,30 @@ Check your inbox (including spam) for the real verification code.
 SELECT count(*) AS raw_rows FROM task_usage;
 SELECT count(*) AS hourly_rows FROM task_usage_hourly;

-- Confirm pg_cron is (or isn't) available.
-SELECT * FROM pg_available_extensions WHERE name = 'pg_cron';
-SHOW shared_preload_libraries;
-
-- If pg_cron is installed, check the schedule + last run.
-SELECT jobname, schedule, database, active FROM cron.job;
-SELECT jobname, status, return_message, start_time, end_time
-  FROM cron.job_run_details ORDER BY start_time DESC LIMIT 10;
+-- Inspect the in-process scheduler's audit log.
+SELECT plan_time, status, attempt, runner_id,
+       error_code, error_msg, started_at, finished_at
+  FROM sys_cron_executions
+ WHERE job_name = 'rollup_task_usage_hourly'
+ ORDER BY plan_time DESC
+ LIMIT 20;

 -- Watermark — if this is 1970-01-01, the rollup has never run.
 SELECT watermark_at FROM task_usage_hourly_rollup_state;
+
+-- Compatibility path: if you previously registered pg_cron, confirm
+-- it is (or isn't) available and pointing at the right database.
+SELECT * FROM pg_available_extensions WHERE name = 'pg_cron';
+SHOW shared_preload_libraries;
+SELECT jobname, schedule, database, active FROM cron.job;
 ```

 **How to fix**:

- Call the rollup once by hand to confirm it works: `SELECT rollup_task_usage_hourly();` — refresh the dashboard; if numbers appear, the only missing piece is a scheduler.
- Pick one of the supported paths from [Self-host quickstart → Schedule the usage rollup](/self-host-quickstart#7-schedule-the-usage-rollup-required-for-the-usage-dashboard): external cron / systemd-timer / Kubernetes CronJob, or swap Postgres for an image with `pg_cron`.
- If you already have history that pre-dates the schedule, run `backfill_task_usage_hourly` inside the backend container to seed buckets before the watermark.
+- Confirm the scheduler is actually running on at least one backend replica — every 30 seconds it should add a SUCCESS row to `sys_cron_executions` for `rollup_task_usage_hourly`.
+- Call the rollup once by hand to verify the SQL path: `SELECT rollup_task_usage_hourly();` — refresh the dashboard; if numbers appear, the SQL function is fine and the issue is on the scheduler claim path.
+- If migration `113_sys_cron_executions` has not applied yet, restart the backend so migrations run, or invoke `migrate up` manually.
+- If you have legacy `pg_cron` history that pre-dates the in-process scheduler, the SQL function still holds advisory lock 4246 internally and the two paths cannot double-write — see [Self-host quickstart → Usage rollup](/self-host-quickstart#7-usage-rollup-no-operator-action-required) for the optional `cron.unschedule` cleanup.

 ## Migration `103` fails with `refusing to drop legacy daily rollups`

@@ -224,9 +230,11 @@ ERROR: refusing to drop legacy daily rollups:

 **Likely cause**: this is migration `103`'s fail-closed guard. It refuses to drop the legacy daily rollups until `task_usage_hourly` has caught up with raw `task_usage`. The guard fires whenever existing rows are present and the rollup watermark still sits at the epoch — i.e. nothing has rolled history into the hourly table yet.

+Since MUL-2957 the migrate command runs an idempotent monthly-slice backfill (under advisory lock 4246) automatically immediately before applying migration `103`, so v0.3.4 → v0.3.5+ direct upgrades complete in a single `migrate up` invocation. If you are still seeing this error you are either on a pre-MUL-2957 binary or the hook itself failed — check the migrate logs for an earlier `task_usage hourly rollup hook` line.
+
 **How to fix**:

-1. Run the backfill against the same database (idempotent, safe to interrupt, safe to re-run):
+1. If you are on a pre-MUL-2957 binary and cannot upgrade the binary first, run the standalone backfill against the same database (idempotent, safe to interrupt, safe to re-run):

   ```bash
   # Docker Compose
@@ -239,7 +247,7 @@ ERROR: refusing to drop legacy daily rollups:
   ```

 2. Re-run the upgrade — restarting the backend container is enough, migrations run on startup. The guard now sees a current watermark and lets `103` apply.
-3. Set up an ongoing rollup schedule (cron / `pg_cron`) so the watermark keeps advancing — see [Self-host quickstart → Schedule the usage rollup](/self-host-quickstart#7-schedule-the-usage-rollup-required-for-the-usage-dashboard).
+3. The in-process scheduler then keeps the watermark advancing — see [Self-host quickstart → Usage rollup](/self-host-quickstart#7-usage-rollup-no-operator-action-required).

 `--sleep-between-slices=2s` is a polite default on production databases with years of history. Use `--months-back N --force-partial` if you only want to keep the last N months and are willing to permanently abandon older buckets.

--- a/apps/docs/content/docs/troubleshooting.zh.mdx
+++ b/apps/docs/content/docs/troubleshooting.zh.mdx
@@ -180,9 +180,9 @@ docker exec <container> env | grep -E 'APP_ENV|MULTICA_DEV_VERIFICATION_CODE'

 **可能原因**：

-1. **`rollup_task_usage_hourly()` 没人调度** —— Usage / Runtime 看板读的是派生表 `task_usage_hourly`，这张表必须靠 `rollup_task_usage_hourly()` 周期性填充。默认的 `pgvector/pgvector:pg17` 镜像不带 `pg_cron`，后端进程内部也不会跑 rollup。如果你是新装的自部署、没配过外部调度器，默认就是这种状态。
-2. **`pg_cron` 装了但指向了错的库** —— `pg_cron.database_name` 默认是 `postgres`；如果你的 Multica 数据库名不是 `postgres`，调度任务根本看不到 `rollup_task_usage_hourly()`。
-3. **调度跑了，但 rollup 静默报错** —— 比如 cron entry 里 DB role / search_path 不对。
+1. **`rollup_task_usage_hourly()` 没被认领** —— Usage / Runtime 看板读的是派生表 `task_usage_hourly`，这张表必须靠 `rollup_task_usage_hourly()` 周期性填充。从 MUL-2957 起后端通过 DB 后端调度器（`sys_cron_executions`）在进程内跑 rollup；旧版本 binary、未应用 migration `113`、或者所有副本长时间下线，都可能让这张表里没有最近的 SUCCESS 行。
+2. **`pg_cron` 作为兼容路径配着、但指向了错的库** —— `pg_cron.database_name` 默认是 `postgres`；如果你的 Multica 数据库名不是 `postgres`，调度任务根本看不到 `rollup_task_usage_hourly()`。进程内调度器不依赖这一项，但如果你刻意拿掉了进程内调度而靠 `pg_cron`，DB 名就必须对得上。
+3. **handler 被认领了、但静默报错** —— 比如 migration 没全部应用导致 SQL 函数缺失、或 DB role / search_path 配错了。看 `sys_cron_executions` 里的 FAILED 审计行。

 **怎么查**：

@@ -191,24 +191,29 @@ docker exec <container> env | grep -E 'APP_ENV|MULTICA_DEV_VERIFICATION_CODE'
 SELECT count(*) AS raw_rows FROM task_usage;
 SELECT count(*) AS hourly_rows FROM task_usage_hourly;

-- 看 pg_cron 装没装、有没有加载
-SELECT * FROM pg_available_extensions WHERE name = 'pg_cron';
-SHOW shared_preload_libraries;
-
-- 如果 pg_cron 装了，看调度和最近一次运行
-SELECT jobname, schedule, database, active FROM cron.job;
-SELECT jobname, status, return_message, start_time, end_time
-  FROM cron.job_run_details ORDER BY start_time DESC LIMIT 10;
+-- 看进程内调度器的审计日志
+SELECT plan_time, status, attempt, runner_id,
+       error_code, error_msg, started_at, finished_at
+  FROM sys_cron_executions
+ WHERE job_name = 'rollup_task_usage_hourly'
+ ORDER BY plan_time DESC
+ LIMIT 20;

 -- watermark —— 如果还是 1970-01-01，说明 rollup 从来没跑过
 SELECT watermark_at FROM task_usage_hourly_rollup_state;
+
+-- 兼容路径：以前注册过 pg_cron，确认装没装、指对了库没
+SELECT * FROM pg_available_extensions WHERE name = 'pg_cron';
+SHOW shared_preload_libraries;
+SELECT jobname, schedule, database, active FROM cron.job;
 ```

 **怎么修**：

- 手动跑一次确认函数本身没问题：`SELECT rollup_task_usage_hourly();` —— 刷新看板；如果数字出来了，缺的就只是调度器。
- 从 [Self-host 快速上手 → 调度用量汇总任务](/self-host-quickstart#7-调度用量汇总任务usage-dashboard-必需) 里挑一种调度方式：外部 cron / systemd-timer / Kubernetes CronJob，或者换成带 `pg_cron` 的 Postgres 镜像。
- 如果调度配好之前数据库已经有一段历史，先在后端容器里跑 `backfill_task_usage_hourly` 把 watermark 之前的桶补出来。
+- 确认至少一个后端副本里调度器真的在跑 —— 每 30 秒应该往 `sys_cron_executions` 的 `rollup_task_usage_hourly` 加一条 SUCCESS 行。
+- 手动跑一次 SQL 验证函数本身没问题：`SELECT rollup_task_usage_hourly();` —— 刷新看板；如果数字出来了，SQL 这层 OK，问题在调度器认领路径上。
+- 如果 migration `113_sys_cron_executions` 还没应用，重启后端让 migration 跑一遍，或手动 `migrate up`。
+- 历史里有遗留的 `pg_cron` 入口也没事 —— SQL 函数里还持有 advisory lock 4246，应用调度器和 `pg_cron` 不会双写；要清掉冗余项见 [Self-host 快速上手 → 用量汇总](/self-host-quickstart#7-用量汇总无需运维操作) 里的 `cron.unschedule`。

 ## migration `103` 报 `refusing to drop legacy daily rollups`

@@ -224,9 +229,11 @@ ERROR: refusing to drop legacy daily rollups:

 **可能原因**：这是 migration `103` 的 fail-closed guard。它要求 `task_usage_hourly` 已经追平了原始的 `task_usage` 之后，才允许丢掉旧的 daily rollup。只要数据库里有历史数据、且 rollup watermark 还停在 epoch（说明还没把历史回填进 hourly 表），这条 guard 就会拦住。

+从 MUL-2957 起，migrate 命令在应用 migration `103` 之前会自动跑一次幂等的按月切片 backfill（advisory lock 4246 保护），所以 v0.3.4 → v0.3.5+ 直升一次 `migrate up` 就能搞定。如果你还看到这个错，要么用的是 MUL-2957 之前的二进制，要么 hook 自己也失败了 —— 看 migrate 日志里更早一行的 `task_usage hourly rollup hook` 看具体原因。
+
 **怎么修**：

-1. 对同一个数据库跑一次 backfill（幂等，可以打断，可以重试）：
+1. 如果你跑的是 MUL-2957 之前的 binary，又没办法先升级 binary，就对同一个数据库手动跑一次独立 backfill（幂等，可以打断，可以重试）：

   ```bash
   # Docker Compose
@@ -239,7 +246,7 @@ ERROR: refusing to drop legacy daily rollups:
   ```

 2. 重新跑升级 —— 重启 backend 容器即可，启动时会自动跑 migration。Guard 看到新的 watermark，`103` 就会通过。
-3. 同时配上持续的 rollup 调度，保证 watermark 持续推进 —— 见 [Self-host 快速上手 → 调度用量汇总任务](/self-host-quickstart#7-调度用量汇总任务usage-dashboard-必需)。
+3. 之后由进程内调度器持续推 watermark —— 见 [Self-host 快速上手 → 用量汇总](/self-host-quickstart#7-用量汇总无需运维操作)。

 `--sleep-between-slices=2s` 在有多年历史的生产库上是个比较克制的默认值。如果你只想保留最近 N 个月、可以接受永久丢掉更老的桶，用 `--months-back N --force-partial`。

--- a/server/cmd/migrate/main.go
+++ b/server/cmd/migrate/main.go
@@ -9,8 +9,55 @@ import (
 	"github.com/jackc/pgx/v5/pgxpool"
 	"github.com/multica-ai/multica/server/internal/logger"
 	"github.com/multica-ai/multica/server/internal/migrations"
+	"github.com/multica-ai/multica/server/internal/taskusagebackfill"
 )

+// preMigrationHook runs work that must happen before a specific
+// migration is applied during `migrate up`. Hooks are idempotent and
+// must not depend on the migration loop's session-pinned advisory lock
+// — they run on the pool, not on the loop's pinned conn, so they can
+// safely acquire other session-level locks (e.g. advisory lock 4246
+// for the task_usage hourly rollup).
+//
+// Returning an error aborts the migration run. The corresponding
+// migration is NOT recorded in schema_migrations, so the next run will
+// retry the hook + migration.
+type preMigrationHook func(ctx context.Context, pool *pgxpool.Pool) error
+
+// preMigrationHooks wires migration version → hook. The version key is
+// the file basename without the `.up.sql` suffix, matching what
+// `migrations.ExtractVersion` returns.
+//
+// MUL-2957: the v0.3.4 → current direct-upgrade path needs the hourly
+// rollup seeded BEFORE migration 103 evaluates its fail-closed lag
+// guard, because at `cmd/migrate up` time the server has not yet
+// started so neither the legacy pg_cron job nor the new app scheduler
+// can advance the watermark. The hook runs the same idempotent
+// monthly-slice backfill that
+// `cmd/backfill_task_usage_hourly` exposes to operators.
+var preMigrationHooks = map[string]preMigrationHook{
+	"103_drop_legacy_daily_rollups": runTaskUsageHourlyHook,
+}
+
+func runTaskUsageHourlyHook(ctx context.Context, pool *pgxpool.Pool) error {
+	res, err := taskusagebackfill.Hook(ctx, pool, taskusagebackfill.HookOptions{})
+	if err != nil {
+		return fmt.Errorf("task_usage_hourly pre-103 hook: %w", err)
+	}
+	if res.Skipped != "" {
+		slog.Info("task_usage hourly rollup hook: skipped",
+			"reason", res.Skipped,
+			"watermark_stamped", res.WatermarkStamped)
+		return nil
+	}
+	slog.Info("task_usage hourly rollup hook: backfill complete",
+		"slices", res.SlicesProcessed,
+		"rows_touched", res.RowsTouched,
+		"from", res.From.Format("2006-01-02T15:04:05Z07:00"),
+		"to", res.To.Format("2006-01-02T15:04:05Z07:00"))
+	return nil
+}
+
 // migrationAdvisoryLockKey is the int64 identifier used with Postgres
 // pg_advisory_lock to serialize the migration loop across concurrent
 // runners (multi-replica backend Deployment, scale-up, or a manual
@@ -138,6 +185,22 @@ func main() {
 			os.Exit(1)
 		}

+		// Run any pre-migration hook before the SQL file. The hook
+		// uses the pool, not the conn pinned for migrationAdvisoryLockKey,
+		// so it can acquire other session-level locks without
+		// colliding with the migrate loop's lock. Hook failures abort
+		// the run before schema_migrations is updated, so the same
+		// version can be retried cleanly on the next invocation.
+		if direction == "up" {
+			if hook, ok := preMigrationHooks[version]; ok {
+				slog.Info("running pre-migration hook", "version", version)
+				if err := hook(ctx, pool); err != nil {
+					slog.Error("pre-migration hook failed", "version", version, "error", err)
+					os.Exit(1)
+				}
+			}
+		}
+
 		_, err = conn.Exec(ctx, string(sql))
 		if err != nil {
 			slog.Error("failed to run migration", "file", file, "error", err)
--- a/server/cmd/server/main.go
+++ b/server/cmd/server/main.go
@@ -19,6 +19,7 @@ import (
 	"github.com/multica-ai/multica/server/internal/logger"
 	obsmetrics "github.com/multica-ai/multica/server/internal/metrics"
 	"github.com/multica-ai/multica/server/internal/realtime"
+	"github.com/multica-ai/multica/server/internal/scheduler"
 	"github.com/multica-ai/multica/server/internal/service"
 	db "github.com/multica-ai/multica/server/pkg/db/generated"
 	"github.com/redis/go-redis/v9"
@@ -360,6 +361,29 @@ func main() {
 		go h.LarkHub.Run(sweepCtx)
 	}

+	// MUL-2957: DB-backed execution scheduler. The scheduler turns the
+	// `sys_cron_executions` table into the distributed lease + audit
+	// log for internal periodic jobs. The first job is
+	// `rollup_task_usage_hourly`, which replaces the previously
+	// operator-registered `pg_cron` entry (still safe to run
+	// concurrently — the SQL function holds advisory lock 4246).
+	//
+	// A failure to register the job is treated as fatal here only at
+	// the registration step (a duplicate name is the only realistic
+	// cause and indicates a code bug). Once running, the manager
+	// surfaces transient errors — DB unreachable, sys_cron_executions
+	// missing because of an unusual partial-migration state — by
+	// logging them on the tick that fails and retrying on the next
+	// cycle, so a temporary outage does not crash the server.
+	schedulerMgr := scheduler.NewManager(pool, scheduler.Options{})
+	if err := schedulerMgr.Register(scheduler.TaskUsageHourlyJob(pool)); err != nil {
+		slog.Warn("scheduler: failed to register task_usage_hourly rollup job", "error", err)
+	} else {
+		go func() {
+			_ = schedulerMgr.Run(sweepCtx)
+		}()
+	}
+
 	if metricsServer != nil {
 		go func() {
 			slog.Info("metrics server starting", "addr", metricsConfig.Addr)
--- a/server/internal/scheduler/concurrent_claim_test.go
+++ b/server/internal/scheduler/concurrent_claim_test.go
@@ -0,0 +1,120 @@
+package scheduler
+
+import (
+	"context"
+	"sync"
+	"testing"
+)
+
+// TestConcurrentClaimsSingleWinner covers RFC §14:
+//
+//	"pg_cron 并跑 | app scheduler 与 pg_cron 同时调用函数,
+//	 无重复窗口写入"
+//
+// The legacy `pg_cron` tick and the in-process scheduler both call
+// `rollup_task_usage_hourly()`, so the SQL function's advisory lock
+// 4246 prevents double-writes of the rollup itself. The scheduler adds
+// a second layer via `sys_cron_executions`: even if multiple ticks (a
+// scheduler in a second replica, a manual SQL call, a leftover
+// `pg_cron` job) arrive at the same plan_time, only one row exists per
+// (job, scope, plan_time) and only one runner gets Won=true. The rest
+// fall through the conflict path and no-op.
+//
+// We simulate this by firing N concurrent claims at the same plan_time
+// from distinct runner ids and asserting the table contract:
+//
+//   - Exactly ONE caller observes Won=true.
+//   - Every other caller observes Conflicted=true (no Won, no Stole).
+//   - sys_cron_executions has exactly one row for the plan.
+//   - The row's runner_id matches the winner.
+//
+// This is the same single-winner property the SQL advisory lock 4246
+// gives at the function-execution layer; the `sys_cron_executions`
+// uniqueness key gives it at the scheduler layer, so a `pg_cron` tick
+// running alongside the in-process scheduler cannot produce a duplicate
+// SUCCESS audit row.
+func TestConcurrentClaimsSingleWinner(t *testing.T) {
+	pool := integrationPool(t)
+	job := newTestJobSpec(uniqueJobName(t, "concurrent_claim"))
+	t.Cleanup(func() { cleanupExecutions(t, pool, job.Name) })
+
+	ctx := context.Background()
+	now, err := dbNow(ctx, pool)
+	if err != nil {
+		t.Fatalf("dbNow: %v", err)
+	}
+	planTime := FloorPlan(now, job.Cadence)
+
+	const contenders = 8
+	type result struct {
+		runnerID string
+		c        claim
+		err      error
+	}
+	results := make([]result, contenders)
+	var wg sync.WaitGroup
+	start := make(chan struct{})
+	for i := range contenders {
+		i := i
+		runnerID := "runner-" + string(rune('A'+i))
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+			<-start
+			c, err := tryClaim(ctx, pool, job, ScopeGlobal, planTime, now, runnerID)
+			results[i] = result{runnerID: runnerID, c: c, err: err}
+		}()
+	}
+	close(start)
+	wg.Wait()
+
+	wins := 0
+	conflicts := 0
+	steals := 0
+	var winner string
+	for _, r := range results {
+		if r.err != nil {
+			t.Fatalf("contender %s: %v", r.runnerID, r.err)
+		}
+		switch {
+		case r.c.Won:
+			wins++
+			winner = r.runnerID
+		case r.c.Stole:
+			steals++
+		case r.c.Conflicted:
+			conflicts++
+		}
+	}
+	if wins != 1 {
+		t.Fatalf("expected exactly 1 fresh winner, got %d (conflicts=%d steals=%d)",
+			wins, conflicts, steals)
+	}
+	if steals != 0 {
+		t.Fatalf("a fresh insert race must not produce a stale steal, got %d steals", steals)
+	}
+	if conflicts != contenders-1 {
+		t.Fatalf("expected %d conflicts, got %d", contenders-1, conflicts)
+	}
+
+	// Database-side proof: exactly one row, runner_id matches.
+	var rowCount int
+	if err := pool.QueryRow(ctx, `
+		SELECT COUNT(*) FROM sys_cron_executions WHERE job_name = $1 AND plan_time = $2
+	`, job.Name, planTime).Scan(&rowCount); err != nil {
+		t.Fatalf("count rows: %v", err)
+	}
+	if rowCount != 1 {
+		t.Fatalf("expected exactly 1 row in sys_cron_executions, got %d", rowCount)
+	}
+
+	var dbRunner string
+	if err := pool.QueryRow(ctx, `
+		SELECT runner_id FROM sys_cron_executions WHERE job_name = $1 AND plan_time = $2
+	`, job.Name, planTime).Scan(&dbRunner); err != nil {
+		t.Fatalf("scan winner: %v", err)
+	}
+	if dbRunner != winner {
+		t.Fatalf("DB winner %q != local winner %q", dbRunner, winner)
+	}
+}
--- a/server/internal/scheduler/db_ops.go
+++ b/server/internal/scheduler/db_ops.go
@@ -0,0 +1,390 @@
+package scheduler
+
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"time"
+
+	"github.com/google/uuid"
+	"github.com/jackc/pgx/v5"
+	"github.com/jackc/pgx/v5/pgtype"
+	"github.com/jackc/pgx/v5/pgxpool"
+)
+
+// ErrLeaseLost is returned by heartbeat / terminal-update primitives
+// when the row is no longer owned by the calling runner. The handler
+// should stop, and the scheduler will not write a terminal status for
+// the lost lease.
+var ErrLeaseLost = errors.New("scheduler: lease lost")
+
+// errTerminalIgnored is the sentinel finishSuccess / finishFailure use
+// to convey a successful UPDATE that affected zero rows. Internal to
+// this package.
+var errTerminalIgnored = ErrLeaseLost
+
+// dbNow returns Postgres's notion of "now" as the canonical clock. The
+// scheduler uses DB time for every plan calculation and lease window so
+// instances with skewed clocks still agree on the same plan_time.
+func dbNow(ctx context.Context, pool *pgxpool.Pool) (time.Time, error) {
+	var t time.Time
+	if err := pool.QueryRow(ctx, "SELECT now()").Scan(&t); err != nil {
+		return time.Time{}, fmt.Errorf("scheduler: read db now: %w", err)
+	}
+	return t.UTC(), nil
+}
+
+// claim is the result of trying to acquire a plan. Only one of the
+// boolean outcomes is true at a time.
+type claim struct {
+	ID         uuid.UUID
+	LeaseToken uuid.UUID
+	Attempt    int
+	Won        bool // fresh insert
+	Stole      bool // stale-steal or FAILED retry
+	Conflicted bool // another runner already owns this plan, or attempts exhausted
+}
+
+// tryClaim attempts to acquire the lease for (job, scope, plan_time).
+// Returns Won=true on a fresh insert, Stole=true on a stale-steal or
+// retry-after-FAILED, and Conflicted=true if another runner owns the
+// plan, the row is already SUCCESS, the FAILED row is not yet
+// retry-eligible (next_retry_at in the future), or attempts are
+// exhausted. The caller treats every Conflicted outcome the same way
+// (no-op); the scheduler distinguishes them through audit-row metrics
+// rather than per-call return fields.
+func tryClaim(
+	ctx context.Context,
+	pool *pgxpool.Pool,
+	job *JobSpec,
+	scope Scope,
+	planTime time.Time,
+	dbTime time.Time,
+	runnerID string,
+) (claim, error) {
+	// Fresh-insert path. ON CONFLICT DO NOTHING means losers do not
+	// touch the existing row — we follow up with the steal/retry path
+	// only if this insert was a conflict.
+	insertSQL := `
+		INSERT INTO sys_cron_executions (
+			job_name, scope_kind, scope_id, plan_time,
+			status, attempt, max_attempts,
+			runner_id, lease_token,
+			heartbeat_at, stale_after,
+			started_at, updated_at
+		) VALUES (
+			$1, $2, $3, $4,
+			'RUNNING', 1, $5,
+			$6, gen_random_uuid(),
+			$7::timestamptz, $7::timestamptz + make_interval(secs => $8),
+			$7::timestamptz, $7::timestamptz
+		)
+		ON CONFLICT ON CONSTRAINT uq_sys_cron_execution DO NOTHING
+		RETURNING id, lease_token, attempt
+	`
+	staleSecs := int64(job.StaleTimeout / time.Second)
+	if staleSecs <= 0 {
+		staleSecs = 1
+	}
+
+	var c claim
+	err := pool.QueryRow(ctx, insertSQL,
+		job.Name, scope.Kind, scope.ID, planTime,
+		job.MaxAttempts,
+		runnerID,
+		dbTime, staleSecs,
+	).Scan(&c.ID, &c.LeaseToken, &c.Attempt)
+	if err == nil {
+		c.Won = true
+		return c, nil
+	}
+	if !errors.Is(err, pgx.ErrNoRows) {
+		return claim{}, fmt.Errorf("scheduler: claim insert: %w", err)
+	}
+
+	// Conflict path. Try retry-after-FAILED or stale-steal in a single
+	// statement. The WHERE clause encodes both branches; the choice is
+	// made server-side from the existing row.
+	stealSQL := `
+		UPDATE sys_cron_executions
+		   SET status        = 'RUNNING',
+		       attempt       = attempt + 1,
+		       runner_id     = $1,
+		       lease_token   = gen_random_uuid(),
+		       heartbeat_at  = $2::timestamptz,
+		       stale_after   = $2::timestamptz + make_interval(secs => $3),
+		       started_at    = $2::timestamptz,
+		       finished_at   = NULL,
+		       duration_ms   = NULL,
+		       next_retry_at = NULL,
+		       error_code    = NULL,
+		       error_msg     = NULL,
+		       updated_at    = $2::timestamptz
+		 WHERE job_name   = $4
+		   AND scope_kind = $5
+		   AND scope_id   = $6
+		   AND plan_time  = $7
+		   AND attempt < max_attempts
+		   AND (
+		        (status = 'FAILED' AND COALESCE(next_retry_at, $2::timestamptz) <= $2::timestamptz)
+		        OR
+		        (status = 'RUNNING' AND stale_after < $2::timestamptz AND $8)
+		   )
+		RETURNING id, lease_token, attempt
+	`
+	err = pool.QueryRow(ctx, stealSQL,
+		runnerID,
+		dbTime, staleSecs,
+		job.Name, scope.Kind, scope.ID, planTime,
+		job.AllowStaleReentry,
+	).Scan(&c.ID, &c.LeaseToken, &c.Attempt)
+	if err == nil {
+		c.Stole = true
+		return c, nil
+	}
+	if !errors.Is(err, pgx.ErrNoRows) {
+		return claim{}, fmt.Errorf("scheduler: claim steal: %w", err)
+	}
+
+	// No-row update — somebody else owns a fresh RUNNING lease, the
+	// row is already SUCCESS, the row is FAILED but not yet
+	// retry-eligible, or attempts are exhausted. The caller treats
+	// these no-op cases identically.
+	c.Conflicted = true
+	return c, nil
+}
+
+// markStaleAsFailed transitions a single stale RUNNING row to FAILED
+// when its job has AllowStaleReentry=false. Returns RowsAffected so the
+// sweeper can log how many rows were closed.
+func markStaleAsFailed(
+	ctx context.Context,
+	pool *pgxpool.Pool,
+	jobName string,
+	dbTime time.Time,
+) (int64, error) {
+	tag, err := pool.Exec(ctx, `
+		UPDATE sys_cron_executions
+		   SET status      = 'FAILED',
+		       finished_at = $2,
+		       error_code  = 'stale_timeout',
+		       error_msg   = 'lease expired without heartbeat',
+		       updated_at  = $2
+		 WHERE job_name    = $1
+		   AND status      = 'RUNNING'
+		   AND stale_after < $2
+	`, jobName, dbTime)
+	if err != nil {
+		return 0, fmt.Errorf("scheduler: mark stale failed: %w", err)
+	}
+	return tag.RowsAffected(), nil
+}
+
+// heartbeat extends stale_after for the runner that holds the lease.
+// Returns ErrLeaseLost if the row is no longer ours (stolen or already
+// finalised).
+func heartbeat(
+	ctx context.Context,
+	pool *pgxpool.Pool,
+	id, leaseToken uuid.UUID,
+	staleTimeout time.Duration,
+) error {
+	staleSecs := int64(staleTimeout / time.Second)
+	if staleSecs <= 0 {
+		staleSecs = 1
+	}
+	tag, err := pool.Exec(ctx, `
+		UPDATE sys_cron_executions
+		   SET heartbeat_at = now(),
+		       stale_after  = now() + make_interval(secs => $3),
+		       updated_at   = now()
+		 WHERE id          = $1
+		   AND lease_token = $2
+		   AND status      = 'RUNNING'
+	`, id, leaseToken, staleSecs)
+	if err != nil {
+		return fmt.Errorf("scheduler: heartbeat: %w", err)
+	}
+	if tag.RowsAffected() == 0 {
+		return ErrLeaseLost
+	}
+	return nil
+}
+
+// finishSuccess writes a terminal SUCCESS row. The lease_token guard
+// prevents an ex-lease-holder from overwriting a newer attempt's state.
+func finishSuccess(
+	ctx context.Context,
+	pool *pgxpool.Pool,
+	id, leaseToken uuid.UUID,
+	dbTime time.Time,
+	durationMs int64,
+	res HandlerResult,
+) error {
+	resultJSON, err := encodeResult(res.Result)
+	if err != nil {
+		return err
+	}
+
+	tag, err := pool.Exec(ctx, `
+		UPDATE sys_cron_executions
+		   SET status        = 'SUCCESS',
+		       finished_at   = $3,
+		       duration_ms   = $4,
+		       rows_affected = $5,
+		       result        = $6::jsonb,
+		       error_code    = NULL,
+		       error_msg     = NULL,
+		       updated_at    = $3
+		 WHERE id            = $1
+		   AND lease_token   = $2
+		   AND status        = 'RUNNING'
+	`, id, leaseToken, dbTime, durationMs, res.RowsAffected, resultJSON)
+	if err != nil {
+		return fmt.Errorf("scheduler: finish success: %w", err)
+	}
+	if tag.RowsAffected() == 0 {
+		return errTerminalIgnored
+	}
+	return nil
+}
+
+// finishFailure writes a terminal FAILED row. nextRetryAt may be the
+// zero value if no retry is due (max_attempts reached).
+func finishFailure(
+	ctx context.Context,
+	pool *pgxpool.Pool,
+	id, leaseToken uuid.UUID,
+	dbTime time.Time,
+	durationMs int64,
+	errorCode, errorMsg string,
+	nextRetryAt time.Time,
+) error {
+	var nextRetry pgtype.Timestamptz
+	if !nextRetryAt.IsZero() {
+		nextRetry = pgtype.Timestamptz{Time: nextRetryAt, Valid: true}
+	}
+
+	if errorCode == "" {
+		errorCode = "handler_error"
+	}
+	if len(errorMsg) > 4000 {
+		errorMsg = errorMsg[:4000]
+	}
+
+	tag, err := pool.Exec(ctx, `
+		UPDATE sys_cron_executions
+		   SET status        = 'FAILED',
+		       finished_at   = $3,
+		       duration_ms   = $4,
+		       next_retry_at = $5,
+		       error_code    = $6,
+		       error_msg     = $7,
+		       updated_at    = $3
+		 WHERE id            = $1
+		   AND lease_token   = $2
+		   AND status        = 'RUNNING'
+	`, id, leaseToken, dbTime, durationMs, nextRetry, errorCode, errorMsg)
+	if err != nil {
+		return fmt.Errorf("scheduler: finish failure: %w", err)
+	}
+	if tag.RowsAffected() == 0 {
+		return errTerminalIgnored
+	}
+	return nil
+}
+
+// encodeResult serialises a handler's small structured result to a
+// JSONB-ready string. Empty/nil maps encode as `{}` so the column
+// default is preserved if no result was set.
+func encodeResult(in map[string]any) (string, error) {
+	if len(in) == 0 {
+		return "{}", nil
+	}
+	b, err := json.Marshal(in)
+	if err != nil {
+		return "", fmt.Errorf("scheduler: marshal result: %w", err)
+	}
+	// Defensive cap — keep result-row size predictable. Bigger
+	// payloads belong in structured logs.
+	if len(b) > 16*1024 {
+		return "", fmt.Errorf("scheduler: result payload too large (%d bytes); keep it small or use logs", len(b))
+	}
+	return string(b), nil
+}
+
+// latestPlanInfo returns the latest known plan_time for (job, scope)
+// plus the fields the catch-up planner needs to decide whether the row
+// is still claimable at the same plan_time (FAILED-with-retry) or
+// finished and the next plan_time should advance past it.
+type latestPlanInfo struct {
+	Found       bool
+	PlanTime    time.Time
+	Status      string
+	Attempt     int
+	MaxAttempts int
+	// NextRetryAt is zero (NULL in DB) when the row is not in a
+	// retry-eligible state, or when the next retry is due immediately
+	// (FAILED with no backoff configured).
+	NextRetryAt time.Time
+}
+
+// RetryEligible reports whether the latest stored row should still be
+// considered for the same plan_time on the next tick. True for FAILED
+// rows that have remaining attempts and whose next_retry_at has
+// passed; the every_plan planner uses this to keep the cursor on the
+// retry-eligible bucket so tryClaim's retry-from-FAILED branch can
+// fire.
+func (i latestPlanInfo) RetryEligible(now time.Time) bool {
+	if !i.Found {
+		return false
+	}
+	if i.Status != "FAILED" {
+		return false
+	}
+	if i.Attempt >= i.MaxAttempts {
+		return false
+	}
+	if i.NextRetryAt.IsZero() {
+		// COALESCE-style: NULL next_retry_at means "as soon as
+		// possible", which is right now.
+		return true
+	}
+	return !i.NextRetryAt.After(now)
+}
+
+func latestPlan(
+	ctx context.Context,
+	pool *pgxpool.Pool,
+	jobName string,
+	scope Scope,
+) (latestPlanInfo, error) {
+	var info latestPlanInfo
+	var nextRetry pgtype.Timestamptz
+	err := pool.QueryRow(ctx, `
+		SELECT plan_time, status, attempt, max_attempts, next_retry_at
+		  FROM sys_cron_executions
+		 WHERE job_name   = $1
+		   AND scope_kind = $2
+		   AND scope_id   = $3
+		 ORDER BY plan_time DESC
+		 LIMIT 1
+	`, jobName, scope.Kind, scope.ID).Scan(
+		&info.PlanTime, &info.Status,
+		&info.Attempt, &info.MaxAttempts,
+		&nextRetry,
+	)
+	if err != nil {
+		if errors.Is(err, pgx.ErrNoRows) {
+			return info, nil
+		}
+		return info, fmt.Errorf("scheduler: read latest plan: %w", err)
+	}
+	info.Found = true
+	info.PlanTime = info.PlanTime.UTC()
+	if nextRetry.Valid {
+		info.NextRetryAt = nextRetry.Time.UTC()
+	}
+	return info, nil
+}
--- a/server/internal/scheduler/every_plan_retry_test.go
+++ b/server/internal/scheduler/every_plan_retry_test.go
@@ -0,0 +1,183 @@
+package scheduler
+
+import (
+	"context"
+	"errors"
+	"sync/atomic"
+	"testing"
+	"time"
+
+	"github.com/jackc/pgx/v5/pgxpool"
+)
+
+// TestManagerEveryPlanRetriesFailedSamePlanTime exercises the
+// `every_plan` retry path that张大彪 flagged on PR #3707:
+//
+//	"every_plan 的 FAILED retry 路径断了。CatchUpEveryPlan 规划必须把
+//	 还在 retry 窗口、attempts < max_attempts 的 FAILED row 先递回去
+//	 给 tryClaim 的 retry 分支，不能直接 latestPlan + cadence 跳过"
+//
+// The previous planner unconditionally advanced the cursor to
+// `latestStored + cadence`, so after a FAILED row was written the next
+// tick would skip past that plan_time and never re-attempt it — even
+// though tryClaim's `(status='FAILED' AND COALESCE(next_retry_at, ...)
+// <= now)` branch is the explicit retry path.
+//
+// This test:
+//
+//  1. Registers an every_plan job whose handler ALWAYS returns an
+//     error.
+//  2. Runs a tick → FAILED row at plan_time T, attempt=1, next_retry_at
+//     stamped from RetryBackoff[0].
+//  3. Forces next_retry_at into the past (test fast-forwards so we
+//     don't have to wait for the real backoff).
+//  4. Runs a second tick → asserts the SAME plan_time T was retried
+//     and is now attempt=2 (not skipped to T+cadence).
+//
+// We pin a cadence well above the wall-clock difference between the
+// two ticks so the rounded "current latest plan" remains the same
+// bucket on both ticks, ensuring the cursor's behaviour is what we are
+// actually measuring.
+func TestManagerEveryPlanRetriesFailedSamePlanTime(t *testing.T) {
+	pool := integrationPool(t)
+	job := newTestJobSpec(uniqueJobName(t, "every_plan_retry"))
+	t.Cleanup(func() { cleanupExecutions(t, pool, job.Name) })
+
+	job.CatchUpMode = CatchUpEveryPlan
+	job.MaxPlansPerTick = 4
+	job.CatchUpWindow = 24 * time.Hour
+	// Long cadence so two consecutive runOnce calls land in the same
+	// plan_time bucket — the test is about the cursor, not the bucket
+	// math.
+	job.Cadence = time.Hour
+	job.MaxAttempts = 3
+	job.RetryBackoff = []time.Duration{
+		1 * time.Second, // attempt 1 → 2: sleep 1s
+		1 * time.Second, // attempt 2 → 3
+	}
+	job.AllowStaleReentry = false
+
+	var calls atomic.Int32
+	job.Handler = func(ctx context.Context, in HandlerInput) (HandlerResult, error) {
+		calls.Add(1)
+		return HandlerResult{}, errors.New("simulated handler failure")
+	}
+
+	mgr := NewManager(pool, Options{RunnerID: "retry-runner"})
+	if err := mgr.Register(*job); err != nil {
+		t.Fatalf("register: %v", err)
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+
+	// Tick 1: handler fails at plan_time T attempt 1.
+	if err := mgr.runOnce(ctx); err != nil {
+		t.Fatalf("first runOnce: %v", err)
+	}
+
+	rowsAfterTick1 := dumpJobRows(t, pool, job.Name)
+	if len(rowsAfterTick1) != 1 {
+		t.Fatalf("expected 1 row after tick 1, got %d: %+v", len(rowsAfterTick1), rowsAfterTick1)
+	}
+	r1 := rowsAfterTick1[0]
+	if r1.Status != "FAILED" {
+		t.Fatalf("expected FAILED after tick 1, got %q", r1.Status)
+	}
+	if r1.Attempt != 1 {
+		t.Fatalf("expected attempt=1 after tick 1, got %d", r1.Attempt)
+	}
+	if r1.NextRetryAt.IsZero() {
+		t.Fatalf("expected next_retry_at to be set after a retry-eligible failure")
+	}
+
+	planT := r1.PlanTime
+
+	// Force next_retry_at into the past so the second tick sees the
+	// retry as due. We deliberately use the DB's clock so this stays
+	// independent of the app process clock (consistent with the rest
+	// of the scheduler's time handling).
+	if _, err := pool.Exec(ctx, `
+		UPDATE sys_cron_executions
+		   SET next_retry_at = now() - INTERVAL '1 minute'
+		 WHERE id = $1
+	`, r1.ID); err != nil {
+		t.Fatalf("force next_retry_at into the past: %v", err)
+	}
+
+	// Tick 2: planner must keep cursor on plan_time T so tryClaim's
+	// FAILED-with-retry branch fires.
+	if err := mgr.runOnce(ctx); err != nil {
+		t.Fatalf("second runOnce: %v", err)
+	}
+
+	rowsAfterTick2 := dumpJobRows(t, pool, job.Name)
+	// Still exactly one row at plan_time T — the retry reuses the
+	// same row, it does not create a new one.
+	if len(rowsAfterTick2) != 1 {
+		t.Fatalf("expected 1 row after tick 2 (retry reuses row), got %d: %+v",
+			len(rowsAfterTick2), rowsAfterTick2)
+	}
+	r2 := rowsAfterTick2[0]
+	if !r2.PlanTime.Equal(planT) {
+		t.Fatalf("planner skipped past failed plan_time: tick1=%s tick2=%s",
+			planT.Format(time.RFC3339), r2.PlanTime.Format(time.RFC3339))
+	}
+	if r2.Attempt != 2 {
+		t.Fatalf("expected attempt=2 after retry, got %d", r2.Attempt)
+	}
+	if r2.Status != "FAILED" {
+		// Handler still fails, so attempt 2 also lands as FAILED.
+		// We still want to confirm the retry actually ran.
+		t.Fatalf("expected attempt 2 to land FAILED again (handler still errors), got %q", r2.Status)
+	}
+	if calls.Load() != 2 {
+		t.Fatalf("expected handler called twice across two ticks, got %d calls", calls.Load())
+	}
+}
+
+// rowSnapshot is the subset of sys_cron_executions fields the test
+// inspects.
+type rowSnapshot struct {
+	ID          string
+	PlanTime    time.Time
+	Status      string
+	Attempt     int
+	MaxAttempts int
+	NextRetryAt time.Time
+}
+
+func dumpJobRows(t *testing.T, pool *pgxpool.Pool, jobName string) []rowSnapshot {
+	t.Helper()
+	rows, err := pool.Query(context.Background(), `
+		SELECT id, plan_time, status, attempt, max_attempts, COALESCE(next_retry_at, 'epoch'::timestamptz)
+		  FROM sys_cron_executions
+		 WHERE job_name = $1
+		 ORDER BY plan_time ASC
+	`, jobName)
+	if err != nil {
+		t.Fatalf("query rows: %v", err)
+	}
+	defer rows.Close()
+	var out []rowSnapshot
+	for rows.Next() {
+		var r rowSnapshot
+		if err := rows.Scan(&r.ID, &r.PlanTime, &r.Status, &r.Attempt, &r.MaxAttempts, &r.NextRetryAt); err != nil {
+			t.Fatalf("scan: %v", err)
+		}
+		// Treat the 'epoch' COALESCE sentinel as zero so callers can
+		// distinguish "no retry scheduled" from "retry scheduled at
+		// some real timestamp".
+		if r.NextRetryAt.Year() == 1970 {
+			r.NextRetryAt = time.Time{}
+		} else {
+			r.NextRetryAt = r.NextRetryAt.UTC()
+		}
+		r.PlanTime = r.PlanTime.UTC()
+		out = append(out, r)
+	}
+	if err := rows.Err(); err != nil {
+		t.Fatalf("rows iter: %v", err)
+	}
+	return out
+}
--- a/server/internal/scheduler/jobs_task_usage.go
+++ b/server/internal/scheduler/jobs_task_usage.go
@@ -0,0 +1,120 @@
+package scheduler
+
+import (
+	"context"
+	"errors"
+	"fmt"
+	"time"
+
+	"github.com/jackc/pgx/v5"
+	"github.com/jackc/pgx/v5/pgxpool"
+
+	"github.com/multica-ai/multica/server/internal/taskusagebackfill"
+)
+
+// JobNameRollupTaskUsageHourly is the canonical name used in audit
+// rows. Stable across releases — do not rename without a migration.
+const JobNameRollupTaskUsageHourly = "rollup_task_usage_hourly"
+
+// TaskUsageHourlyJob returns the JobSpec that drives the
+// task_usage_hourly rollup. The handler calls the existing
+// `rollup_task_usage_hourly()` SQL function, which already holds
+// advisory lock 4246 internally so a concurrent legacy pg_cron tick or
+// manual call is safe (RFC §11.3).
+//
+// The spec is the canonical settings from the RFC §11.1:
+//
+//	cadence:               5m
+//	schedule_delay:        5m
+//	catch_up_mode:         latest_only (handler watermark drives history)
+//	run_timeout:           25m
+//	stale_timeout:         30m
+//	heartbeat_interval:    30s
+//	max_attempts:          3
+//	retry_backoff:         1m, 5m, 15m
+//	allow_stale_reentry:   true
+func TaskUsageHourlyJob(pool *pgxpool.Pool) JobSpec {
+	return JobSpec{
+		Name:              JobNameRollupTaskUsageHourly,
+		Cadence:           5 * time.Minute,
+		ScheduleDelay:     5 * time.Minute,
+		CatchUpMode:       CatchUpLatestOnly,
+		CatchUpWindow:     24 * time.Hour,
+		RunTimeout:        25 * time.Minute,
+		StaleTimeout:      30 * time.Minute,
+		HeartbeatInterval: 30 * time.Second,
+		AllowStaleReentry: true,
+		MaxAttempts:       3,
+		RetryBackoff: []time.Duration{
+			1 * time.Minute,
+			5 * time.Minute,
+			15 * time.Minute,
+		},
+		Scopes:  StaticScopes(ScopeGlobal),
+		Handler: makeTaskUsageHourlyHandler(pool),
+	}
+}
+
+// makeTaskUsageHourlyHandler reads the watermark before/after the SQL
+// function so the audit row records whether business state advanced.
+// The function call itself acquires advisory lock 4246 inside SQL — if
+// another writer holds the lock the function returns 0 immediately and
+// this attempt completes as SUCCESS with rows_affected=0, which is the
+// correct "lost the inner race, no work to do" outcome.
+func makeTaskUsageHourlyHandler(pool *pgxpool.Pool) Handler {
+	return func(ctx context.Context, in HandlerInput) (HandlerResult, error) {
+		watermarkBefore, err := readTaskUsageWatermark(ctx, pool)
+		if err != nil {
+			return HandlerResult{}, fmt.Errorf("read watermark before: %w", err)
+		}
+
+		var rows int64
+		if err := pool.QueryRow(ctx, `SELECT rollup_task_usage_hourly()`).Scan(&rows); err != nil {
+			return HandlerResult{}, fmt.Errorf("rollup_task_usage_hourly: %w", err)
+		}
+
+		watermarkAfter, err := readTaskUsageWatermark(ctx, pool)
+		if err != nil {
+			return HandlerResult{}, fmt.Errorf("read watermark after: %w", err)
+		}
+
+		// Light heartbeat at the end keeps stale_after fresh for jobs
+		// that ran much shorter than HeartbeatInterval.
+		if in.Heartbeat != nil {
+			_ = in.Heartbeat(ctx)
+		}
+
+		result := map[string]any{
+			"advisory_lock_id": taskusagebackfill.AdvisoryLockKey,
+		}
+		if !watermarkBefore.IsZero() {
+			result["watermark_before"] = watermarkBefore.UTC().Format(time.RFC3339)
+		}
+		if !watermarkAfter.IsZero() {
+			result["watermark_after"] = watermarkAfter.UTC().Format(time.RFC3339)
+		}
+		return HandlerResult{
+			RowsAffected: rows,
+			Result:       result,
+		}, nil
+	}
+}
+
+// readTaskUsageWatermark reads the current watermark; returns zero
+// time.Time if the row is missing (shouldn't happen post-101, but the
+// handler should not panic in that case).
+func readTaskUsageWatermark(ctx context.Context, pool *pgxpool.Pool) (time.Time, error) {
+	var t time.Time
+	err := pool.QueryRow(ctx, `
+		SELECT watermark_at
+		  FROM task_usage_hourly_rollup_state
+		 WHERE id = 1
+	`).Scan(&t)
+	if err != nil {
+		if errors.Is(err, pgx.ErrNoRows) {
+			return time.Time{}, nil
+		}
+		return time.Time{}, err
+	}
+	return t, nil
+}
--- a/server/internal/scheduler/manager.go
+++ b/server/internal/scheduler/manager.go
@@ -0,0 +1,466 @@
+package scheduler
+
+import (
+	"context"
+	"errors"
+	"fmt"
+	"log/slog"
+	"sync"
+	"time"
+
+	"github.com/google/uuid"
+	"github.com/jackc/pgx/v5/pgxpool"
+)
+
+// Options configure a Manager. Defaults are set in NewManager so all
+// fields are optional for callers.
+type Options struct {
+	// RunnerID identifies this process in audit rows. Empty defaults
+	// to a fresh UUID — readable enough for short-lived debugging,
+	// still unique across replicas.
+	RunnerID string
+
+	// TickInterval is how often the manager wakes up to evaluate due
+	// plans across all registered jobs. Should be smaller than the
+	// shortest job cadence; defaults to 30 * time.Second.
+	TickInterval time.Duration
+
+	// Logger is used for structured logs. nil defaults to
+	// slog.Default().
+	Logger *slog.Logger
+}
+
+// Manager is the per-process scheduler. Register one or more jobs and
+// call Run with a cancellable context.
+type Manager struct {
+	pool   *pgxpool.Pool
+	opts   Options
+	jobs   map[string]*JobSpec
+	mu     sync.RWMutex
+	logger *slog.Logger
+}
+
+// NewManager constructs a Manager. The pool MUST point at the database
+// containing the sys_cron_executions table. The manager does not start
+// any goroutine until Run is called.
+func NewManager(pool *pgxpool.Pool, opts Options) *Manager {
+	if opts.RunnerID == "" {
+		opts.RunnerID = uuid.NewString()
+	}
+	if opts.TickInterval <= 0 {
+		opts.TickInterval = 30 * time.Second
+	}
+	if opts.Logger == nil {
+		opts.Logger = slog.Default()
+	}
+	return &Manager{
+		pool:   pool,
+		opts:   opts,
+		jobs:   make(map[string]*JobSpec),
+		logger: opts.Logger.With("component", "scheduler", "runner_id", opts.RunnerID),
+	}
+}
+
+// Register adds a job to the manager. Must be called before Run; later
+// registrations are also accepted but the new job will not tick until
+// the next loop iteration.
+func (m *Manager) Register(job JobSpec) error {
+	if err := job.validate(); err != nil {
+		return err
+	}
+	spec := job
+	m.mu.Lock()
+	defer m.mu.Unlock()
+	if _, exists := m.jobs[spec.Name]; exists {
+		return fmt.Errorf("scheduler: duplicate job name %q", spec.Name)
+	}
+	m.jobs[spec.Name] = &spec
+	return nil
+}
+
+// snapshot returns a copy of registered specs so the loop iterates
+// without holding the lock.
+func (m *Manager) snapshot() []*JobSpec {
+	m.mu.RLock()
+	defer m.mu.RUnlock()
+	out := make([]*JobSpec, 0, len(m.jobs))
+	for _, j := range m.jobs {
+		out = append(out, j)
+	}
+	return out
+}
+
+// Run blocks until ctx is cancelled, ticking every Options.TickInterval.
+// Returns the cause of ctx termination (typically context.Canceled).
+func (m *Manager) Run(ctx context.Context) error {
+	m.logger.Info("scheduler starting",
+		"tick_interval", m.opts.TickInterval.String(),
+		"jobs", len(m.snapshot()))
+
+	// First tick immediately so a fresh start does not wait a full
+	// interval; backoff to TickInterval thereafter.
+	if err := m.runOnce(ctx); err != nil && !errors.Is(err, context.Canceled) {
+		m.logger.Warn("scheduler tick error", "error", err)
+	}
+
+	t := time.NewTicker(m.opts.TickInterval)
+	defer t.Stop()
+	for {
+		select {
+		case <-ctx.Done():
+			m.logger.Info("scheduler stopped", "reason", ctx.Err())
+			return ctx.Err()
+		case <-t.C:
+			if err := m.runOnce(ctx); err != nil && !errors.Is(err, context.Canceled) {
+				m.logger.Warn("scheduler tick error", "error", err)
+			}
+		}
+	}
+}
+
+// runOnce executes a single tick across every registered job.
+func (m *Manager) runOnce(ctx context.Context) error {
+	now, err := dbNow(ctx, m.pool)
+	if err != nil {
+		return err
+	}
+	for _, job := range m.snapshot() {
+		if err := m.runJob(ctx, job, now); err != nil {
+			m.logger.Warn("job tick error",
+				"job", job.Name, "error", err)
+		}
+	}
+	return nil
+}
+
+// runJob iterates the scopes for a single job and processes each due
+// plan according to the catch-up mode.
+func (m *Manager) runJob(ctx context.Context, job *JobSpec, now time.Time) error {
+	scopes, err := job.Scopes(ctx, now)
+	if err != nil {
+		return fmt.Errorf("scheduler: scope provider for %q: %w", job.Name, err)
+	}
+
+	// Close out abandoned RUNNING leases before planning. We run this
+	// for EVERY job, regardless of AllowStaleReentry, because:
+	//
+	//   * Non-reentrant jobs (AllowStaleReentry=false) need the FAILED
+	//     audit row + alert; this was the original motivation.
+	//
+	//   * Reentrant jobs (AllowStaleReentry=true) running in
+	//     `latest_only` mode never re-claim historical plan_times, so
+	//     a stuck RUNNING row from a crashed pod would otherwise sit
+	//     in the table forever and pin
+	//     `scheduler_running_stale_total > 0`. Marking it FAILED keeps
+	//     the gauge truthful and (because tryClaim's
+	//     retry-from-FAILED branch is still eligible at the same
+	//     plan_time when attempts remain) preserves the retry path.
+	//
+	//   * Reentrant `every_plan` jobs would otherwise rely on the
+	//     stale-steal branch in tryClaim — but that only fires when
+	//     the same plan_time is being attempted again, which races
+	//     this sweep harmlessly: whichever wins, the row leaves
+	//     RUNNING.
+	//
+	// MUL-2957 review: see张大彪's blocker #1.
+	if affected, err := markStaleAsFailed(ctx, m.pool, job.Name, now); err != nil {
+		m.logger.Warn("scheduler: mark stale failed",
+			"job", job.Name, "error", err)
+	} else if affected > 0 {
+		m.logger.Warn("scheduler: closed out abandoned RUNNING leases",
+			"job", job.Name,
+			"rows", affected,
+			"reentrant", job.AllowStaleReentry)
+	}
+
+	for _, scope := range scopes {
+		plans, err := m.plansForTick(ctx, job, scope, now)
+		if err != nil {
+			m.logger.Warn("scheduler: plan computation",
+				"job", job.Name, "scope", scope.String(), "error", err)
+			continue
+		}
+		for _, planTime := range plans {
+			m.processPlan(ctx, job, scope, planTime, now)
+		}
+	}
+	return nil
+}
+
+// plansForTick computes the list of plan_times to attempt this tick,
+// respecting the catch-up mode and bounds.
+func (m *Manager) plansForTick(
+	ctx context.Context,
+	job *JobSpec,
+	scope Scope,
+	now time.Time,
+) ([]time.Time, error) {
+	eligible := now.Add(-job.ScheduleDelay)
+	latest := FloorPlan(eligible, job.Cadence)
+	if latest.After(eligible) {
+		// Truncate landed in the future — only happens at very small
+		// cadences with rounding; nothing is due yet.
+		return nil, nil
+	}
+
+	switch job.CatchUpMode {
+	case CatchUpLatestOnly:
+		return []time.Time{latest}, nil
+
+	case CatchUpEveryPlan:
+		info, err := latestPlan(ctx, m.pool, job.Name, scope)
+		if err != nil {
+			return nil, err
+		}
+		// Determine the oldest plan we still consider in-window.
+		oldestAllowed := now.Add(-job.CatchUpWindow)
+		if job.CatchUpWindow <= 0 {
+			oldestAllowed = latest
+		}
+		var start time.Time
+		switch {
+		case info.Found && info.RetryEligible(now):
+			// FAILED at info.PlanTime with attempts remaining and
+			// next_retry_at <= now. Keep the cursor on the same
+			// plan_time so tryClaim's retry-from-FAILED branch picks
+			// it up; then advance forward through any newer plans
+			// that may also be due. Without this case the cursor
+			// would unconditionally jump to PlanTime+cadence and
+			// strand the FAILED row until max_attempts is reached
+			// elsewhere — which never happens in steady state.
+			//
+			// (MUL-2957 review: see张大彪's retry blocker.)
+			start = info.PlanTime
+		case info.Found:
+			// Latest stored plan is SUCCESS, RUNNING, or FAILED with
+			// no remaining retry budget — advance past it so we do
+			// not redundantly attempt to insert the same plan_time.
+			start = info.PlanTime.Add(job.Cadence)
+		default:
+			// No history yet for this (job, scope). Fall through to
+			// the latest plan to bootstrap.
+			start = latest
+		}
+		if start.Before(oldestAllowed) {
+			start = FloorPlan(oldestAllowed, job.Cadence)
+			if start.Before(oldestAllowed) {
+				start = start.Add(job.Cadence)
+			}
+		}
+		var plans []time.Time
+		for t := start; !t.After(latest) && len(plans) < job.MaxPlansPerTick; t = t.Add(job.Cadence) {
+			plans = append(plans, t)
+		}
+		return plans, nil
+
+	default:
+		return nil, fmt.Errorf("scheduler: job %q: unknown catch_up_mode %v", job.Name, job.CatchUpMode)
+	}
+}
+
+// (legacy wrapper removed; the previous latestPlan_ shim is no longer
+// needed because plansForTick renames its bucket variable to `latest`.)
+
+// processPlan owns one (job, scope, plan_time) attempt: claim → run
+// handler with heartbeat → terminal update.
+func (m *Manager) processPlan(
+	ctx context.Context,
+	job *JobSpec,
+	scope Scope,
+	planTime time.Time,
+	now time.Time,
+) {
+	c, err := tryClaim(ctx, m.pool, job, scope, planTime, now, m.opts.RunnerID)
+	if err != nil {
+		m.logger.Warn("scheduler claim error",
+			"job", job.Name, "scope", scope.String(),
+			"plan_time", planTime.Format(time.RFC3339), "error", err)
+		return
+	}
+	if c.Conflicted && !c.Won && !c.Stole {
+		// Another runner owns this plan, or it is already terminal.
+		// Silent no-op is the expected case.
+		return
+	}
+	if !c.Won && !c.Stole {
+		// Defensive — should not be reachable but covers future SQL
+		// changes that add a fourth path.
+		return
+	}
+
+	m.runClaimed(ctx, job, scope, planTime, c)
+}
+
+// runClaimed runs the handler for an already-claimed lease and writes
+// the terminal state.
+func (m *Manager) runClaimed(
+	ctx context.Context,
+	job *JobSpec,
+	scope Scope,
+	planTime time.Time,
+	c claim,
+) {
+	log := m.logger.With(
+		"job", job.Name,
+		"scope", scope.String(),
+		"plan_time", planTime.Format(time.RFC3339),
+		"attempt", c.Attempt,
+		"execution_id", c.ID.String())
+
+	if c.Stole {
+		log.Info("scheduler stole stale lease")
+	} else {
+		log.Info("scheduler claimed plan")
+	}
+
+	// Per-handler context bounded by RunTimeout. Heartbeats use a
+	// detached background context so a slow ctx cancellation cannot
+	// drop the renewal.
+	runCtx, cancel := context.WithTimeout(ctx, job.RunTimeout)
+	defer cancel()
+
+	hbCtx, hbCancel := context.WithCancel(context.Background())
+	defer hbCancel()
+	hbDone := make(chan struct{})
+	go m.runHeartbeats(hbCtx, hbDone, job, c, log)
+
+	start := time.Now()
+	res, handlerErr := func() (out HandlerResult, retErr error) {
+		// recover() inside the deferred closure assigns to the named
+		// return retErr so that a panicking handler is treated exactly
+		// like a returned error: classifyError records it as
+		// "handler_panic" and finishFailure writes the FAILED audit
+		// row. Without the named return the panic was being silently
+		// swallowed and the outer code wrote a SUCCESS row with
+		// rows_affected=0.
+		defer func() {
+			if r := recover(); r != nil {
+				log.Error("scheduler handler panic", "panic", r)
+				retErr = fmt.Errorf("%w: %v", ErrHandlerPanic, r)
+			}
+		}()
+		return job.Handler(runCtx, HandlerInput{
+			Job:      job,
+			Scope:    scope,
+			PlanTime: planTime,
+			Attempt:  c.Attempt,
+			RunnerID: m.opts.RunnerID,
+			Heartbeat: func(ctx context.Context) error {
+				return heartbeat(ctx, m.pool, c.ID, c.LeaseToken, job.StaleTimeout)
+			},
+		})
+	}()
+	duration := time.Since(start)
+
+	hbCancel()
+	<-hbDone
+
+	dur := duration.Milliseconds()
+	dbTime, dberr := dbNow(context.Background(), m.pool)
+	if dberr != nil {
+		// Falling back to local time keeps the audit row honest enough
+		// for a triage; the watermark indices use plan_time, not
+		// finished_at, for steady-state queries.
+		dbTime = time.Now().UTC()
+	}
+
+	if handlerErr != nil {
+		nextRetry := time.Time{}
+		if c.Attempt < job.MaxAttempts {
+			delay := job.retryDelay(c.Attempt)
+			nextRetry = dbTime.Add(delay)
+		}
+		errCode := classifyError(handlerErr)
+		if err := finishFailure(context.Background(), m.pool, c.ID, c.LeaseToken,
+			dbTime, dur, errCode, handlerErr.Error(), nextRetry); err != nil {
+			if errors.Is(err, ErrLeaseLost) {
+				log.Warn("scheduler: terminal FAILED ignored, lease was stolen",
+					"duration_ms", dur, "error", handlerErr.Error())
+				return
+			}
+			log.Error("scheduler: write terminal FAILED",
+				"duration_ms", dur, "handler_error", handlerErr.Error(), "error", err)
+			return
+		}
+		log.Warn("scheduler: handler failed",
+			"duration_ms", dur,
+			"error_code", errCode,
+			"error", handlerErr.Error(),
+			"will_retry", c.Attempt < job.MaxAttempts)
+		return
+	}
+
+	if err := finishSuccess(context.Background(), m.pool, c.ID, c.LeaseToken,
+		dbTime, dur, res); err != nil {
+		if errors.Is(err, ErrLeaseLost) {
+			log.Warn("scheduler: terminal SUCCESS ignored, lease was stolen",
+				"duration_ms", dur)
+			return
+		}
+		log.Error("scheduler: write terminal SUCCESS",
+			"duration_ms", dur, "error", err)
+		return
+	}
+	log.Info("scheduler: handler succeeded",
+		"duration_ms", dur,
+		"rows_affected", res.RowsAffected)
+}
+
+// runHeartbeats keeps the lease alive for the duration of the handler.
+// Exits when ctx is cancelled. Closes done when finished so the caller
+// can be sure no further heartbeat updates fire after run completion.
+func (m *Manager) runHeartbeats(
+	ctx context.Context,
+	done chan<- struct{},
+	job *JobSpec,
+	c claim,
+	log *slog.Logger,
+) {
+	defer close(done)
+	t := time.NewTicker(job.HeartbeatInterval)
+	defer t.Stop()
+	for {
+		select {
+		case <-ctx.Done():
+			return
+		case <-t.C:
+			hbCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+			err := heartbeat(hbCtx, m.pool, c.ID, c.LeaseToken, job.StaleTimeout)
+			cancel()
+			if errors.Is(err, ErrLeaseLost) {
+				log.Warn("scheduler: lease lost during heartbeat, runner should stop")
+				return
+			}
+			if err != nil {
+				log.Warn("scheduler: heartbeat error", "error", err)
+			}
+		}
+	}
+}
+
+// classifyError maps handler errors to short error_code strings stored
+// on the audit row. Unknown errors get a generic code; specific codes
+// are reserved for sentinels we recognise (context timeout, lease lost
+// before a terminal write, panic recovered by the scheduler, etc.).
+func classifyError(err error) string {
+	switch {
+	case errors.Is(err, context.DeadlineExceeded):
+		return "run_timeout"
+	case errors.Is(err, context.Canceled):
+		return "canceled"
+	case errors.Is(err, ErrLeaseLost):
+		return "lease_lost"
+	case errors.Is(err, ErrHandlerPanic):
+		return "handler_panic"
+	default:
+		return "handler_error"
+	}
+}
+
+// ErrHandlerPanic wraps a panic value recovered from a job handler so
+// the scheduler can record it on the audit row and (if max_attempts
+// allows) retry. Production handlers should prefer returning errors,
+// but we treat panics as failures rather than letting them either
+// crash the process or — worse — be silently dropped into a SUCCESS
+// audit row.
+var ErrHandlerPanic = errors.New("scheduler: handler panic")
--- a/server/internal/scheduler/manager_tick_test.go
+++ b/server/internal/scheduler/manager_tick_test.go
@@ -0,0 +1,187 @@
+package scheduler
+
+import (
+	"context"
+	"testing"
+	"time"
+
+	"github.com/google/uuid"
+)
+
+// TestManagerTickClosesAbandonedRunning exercises张大彪's blocker #1
+// directly through the Manager.runOnce tick path:
+//
+//	"Manager 在 plan 推进后必须能扫到旧的 stale RUNNING row 重入或
+//	 转 FAILED，不能只盯当前 latest——得加用例直接打 Manager 的 tick
+//	 路径，别再只测 tryClaim"
+//
+// We seed a stuck RUNNING row at an old plan_time (representing a
+// crashed pod's lease) and run Manager.runOnce. The tick must:
+//
+//   - close out the abandoned lease (status='FAILED',
+//     error_code='stale_timeout'), and
+//   - separately claim + complete the current latest plan_time.
+//
+// We verify both branches by registering a no-op handler that always
+// succeeds, so the current plan ends in SUCCESS while the old plan
+// ends in FAILED.
+//
+// The test runs the assertion for both AllowStaleReentry=true and
+// AllowStaleReentry=false, because before this fix the reentrant path
+// only swept stale rows during a same-plan_time replay, which never
+// happens in latest_only mode.
+func TestManagerTickClosesAbandonedRunning(t *testing.T) {
+	for _, allowStaleReentry := range []bool{true, false} {
+		t.Run(boolName("reentrant", allowStaleReentry), func(t *testing.T) {
+			pool := integrationPool(t)
+			job := newTestJobSpec(uniqueJobName(t, "tick_stale"))
+			job.AllowStaleReentry = allowStaleReentry
+			t.Cleanup(func() { cleanupExecutions(t, pool, job.Name) })
+
+			ctx := context.Background()
+			now, err := dbNow(ctx, pool)
+			if err != nil {
+				t.Fatalf("dbNow: %v", err)
+			}
+
+			// Seed a stale RUNNING row at the previous bucket. Lease
+			// is stamped to a long-dead runner so the Manager has no
+			// chance to take credit for finishing it.
+			oldPlan := FloorPlan(now, job.Cadence).Add(-job.Cadence)
+			oldRunner := "ghost-runner"
+			oldLease := uuid.New()
+			if _, err := pool.Exec(ctx, `
+				INSERT INTO sys_cron_executions (
+					job_name, scope_kind, scope_id, plan_time,
+					status, attempt, max_attempts,
+					runner_id, lease_token,
+					heartbeat_at, stale_after,
+					started_at, updated_at
+				) VALUES (
+					$1, 'global', 'global', $2,
+					'RUNNING', 1, $3,
+					$4, $5,
+					now() - INTERVAL '10 minutes',
+					now() - INTERVAL '5 minutes',
+					now() - INTERVAL '10 minutes',
+					now() - INTERVAL '10 minutes'
+				)
+			`, job.Name, oldPlan, job.MaxAttempts, oldRunner, oldLease); err != nil {
+				t.Fatalf("seed stale RUNNING row: %v", err)
+			}
+
+			mgr := NewManager(pool, Options{RunnerID: "manager-under-test"})
+			if err := mgr.Register(*job); err != nil {
+				t.Fatalf("register: %v", err)
+			}
+			if err := mgr.runOnce(ctx); err != nil {
+				t.Fatalf("runOnce: %v", err)
+			}
+
+			// Assert the old row is now FAILED with stale_timeout.
+			var oldStatus, oldErr string
+			if err := pool.QueryRow(ctx, `
+				SELECT status, COALESCE(error_code, '')
+				  FROM sys_cron_executions
+				 WHERE job_name = $1 AND plan_time = $2
+			`, job.Name, oldPlan).Scan(&oldStatus, &oldErr); err != nil {
+				t.Fatalf("scan old row: %v", err)
+			}
+			if oldStatus != "FAILED" {
+				t.Fatalf("old stale RUNNING row should be FAILED, got %q", oldStatus)
+			}
+			if oldErr != "stale_timeout" {
+				t.Fatalf("old row should carry error_code=stale_timeout, got %q", oldErr)
+			}
+
+			// Assert the current plan got picked up + finished SUCCESS
+			// by this Manager.
+			currentPlan := FloorPlan(now, job.Cadence)
+			var curStatus, curRunner string
+			if err := pool.QueryRow(ctx, `
+				SELECT status, runner_id
+				  FROM sys_cron_executions
+				 WHERE job_name = $1 AND plan_time = $2
+			`, job.Name, currentPlan).Scan(&curStatus, &curRunner); err != nil {
+				t.Fatalf("scan current row: %v", err)
+			}
+			if curStatus != "SUCCESS" {
+				t.Fatalf("current plan should be SUCCESS after tick, got %q", curStatus)
+			}
+			if curRunner != "manager-under-test" {
+				t.Fatalf("current plan runner should be the test Manager, got %q", curRunner)
+			}
+		})
+	}
+}
+
+// TestManagerHandlerPanicWritesFailed exercises张大彪's blocker #2:
+//
+//	"handler 的 panic recover 必须把 panic 写进 handlerErr 走
+//	 finishFailure，绝对不能记成 SUCCESS，加测试钉死。"
+//
+// A panicking handler must NOT result in a SUCCESS audit row. This
+// test installs a handler that always panics, runs one tick, and
+// asserts the resulting row is FAILED with error_code=handler_panic
+// and a non-empty error_msg containing the panic value.
+func TestManagerHandlerPanicWritesFailed(t *testing.T) {
+	pool := integrationPool(t)
+	job := newTestJobSpec(uniqueJobName(t, "panic"))
+	t.Cleanup(func() { cleanupExecutions(t, pool, job.Name) })
+
+	job.Handler = func(ctx context.Context, in HandlerInput) (HandlerResult, error) {
+		panic("simulated handler boom")
+	}
+
+	mgr := NewManager(pool, Options{RunnerID: "panic-runner"})
+	if err := mgr.Register(*job); err != nil {
+		t.Fatalf("register: %v", err)
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+	defer cancel()
+	if err := mgr.runOnce(ctx); err != nil {
+		t.Fatalf("runOnce: %v", err)
+	}
+
+	var status, errCode, errMsg string
+	if err := pool.QueryRow(ctx, `
+		SELECT status, COALESCE(error_code, ''), COALESCE(error_msg, '')
+		  FROM sys_cron_executions
+		 WHERE job_name = $1
+		 ORDER BY plan_time DESC
+		 LIMIT 1
+	`, job.Name).Scan(&status, &errCode, &errMsg); err != nil {
+		t.Fatalf("scan row: %v", err)
+	}
+	if status != "FAILED" {
+		t.Fatalf("panicking handler must NOT write SUCCESS; got status=%q", status)
+	}
+	if errCode != "handler_panic" {
+		t.Fatalf("expected error_code=handler_panic, got %q", errCode)
+	}
+	if errMsg == "" {
+		t.Fatalf("expected non-empty error_msg containing panic detail")
+	}
+	// Sanity: error_msg should mention the panic value so on-call can
+	// triage from the audit row alone.
+	if !contains(errMsg, "simulated handler boom") {
+		t.Fatalf("expected error_msg to include panic value, got %q", errMsg)
+	}
+}
+
+func contains(s, sub string) bool {
+	for i := 0; i+len(sub) <= len(s); i++ {
+		if s[i:i+len(sub)] == sub {
+			return true
+		}
+	}
+	return false
+}
+
+func boolName(label string, v bool) string {
+	if v {
+		return label + "=true"
+	}
+	return label + "=false"
+}
--- a/server/internal/scheduler/pgcron_concurrent_test.go
+++ b/server/internal/scheduler/pgcron_concurrent_test.go
@@ -0,0 +1,242 @@
+package scheduler
+
+import (
+	"context"
+	"sync"
+	"testing"
+	"time"
+
+	"github.com/jackc/pgx/v5/pgxpool"
+)
+
+// TestPgCronConcurrentNoDoubleWrite covers张大彪's blocker #4:
+//
+//	"`pg_cron` 并跑那条覆盖要真打——scheduler handler 跟直接
+//	 `SELECT rollup_task_usage_hourly()` / 旧 cron 入口并发，验证
+//	 advisory lock 4246 下不双写。"
+//
+// The test seeds historical `task_usage` rows under a freshly created
+// agent / runtime / agent_task_queue fixture, advances the rollup
+// watermark backwards so a single tick has real work to do, then
+// invokes `rollup_task_usage_hourly()` directly from N concurrent
+// goroutines. This is the same SQL entrypoint the in-process
+// scheduler handler calls AND the same one any leftover `pg_cron`
+// job or operator would call by hand. Advisory lock 4246 inside the
+// SQL function must serialise them: exactly one caller advances the
+// watermark and recomputes the buckets, every other caller returns 0
+// rows immediately.
+//
+// The pass criteria are the operational invariants:
+//
+//   - Across all callers, exactly one returned a non-zero rows count
+//     (the one that won the advisory lock).
+//   - The watermark advanced exactly once — specifically, the resulting
+//     watermark equals what the winning caller computed, and not any
+//     multiple of it.
+//   - The post-rollup `task_usage_hourly` rows match what we expect
+//     from the seeded `task_usage` data (token sums + bucket count).
+func TestPgCronConcurrentNoDoubleWrite(t *testing.T) {
+	pool := integrationPool(t)
+	ctx := context.Background()
+
+	// Seed an isolated workspace/runtime/agent/task and a handful of
+	// task_usage rows landing in the same UTC hour bucket. The bucket
+	// math is the SQL helper task_usage_hour_bucket(...).
+	ws, _, _, task := seedRollupFixture(t, pool)
+	t.Cleanup(func() { cleanupRollupFixture(t, pool, ws) })
+
+	// Bucket 30 minutes ago so the rollup window (now - 5min) covers
+	// the row but not so old that monthly slicing matters.
+	bucketTS := time.Now().UTC().Add(-30 * time.Minute)
+
+	const rowsToSeed = 4
+	for i := 0; i < rowsToSeed; i++ {
+		_, err := pool.Exec(ctx, `
+			INSERT INTO task_usage (
+				task_id, provider, model,
+				input_tokens, output_tokens, cache_read_tokens, cache_write_tokens,
+				created_at, updated_at
+			)
+			VALUES ($1, 'openai', $2, $3, $4, 0, 0, $5, $5)
+		`, task, "model-"+string(rune('A'+i)), int64(100+i), int64(200+i), bucketTS)
+		if err != nil {
+			t.Fatalf("seed task_usage row %d: %v", i, err)
+		}
+	}
+
+	// Force the watermark back so one tick has real work to do. The
+	// rollup function caps at 1 day per call, which is plenty for our
+	// 30-minute-old fixture.
+	if _, err := pool.Exec(ctx, `
+		UPDATE task_usage_hourly_rollup_state
+		   SET watermark_at = $1
+		 WHERE id = 1
+	`, bucketTS.Add(-1*time.Hour)); err != nil {
+		t.Fatalf("force watermark backwards: %v", err)
+	}
+
+	// Wipe any pre-existing hourly rows for this fixture so we can
+	// assert exactly what the rollup wrote.
+	if _, err := pool.Exec(ctx, `
+		DELETE FROM task_usage_hourly WHERE workspace_id = $1
+	`, ws); err != nil {
+		t.Fatalf("clear hourly rows: %v", err)
+	}
+
+	// Capture the winning watermark by snapshotting before/after.
+	var watermarkBefore time.Time
+	if err := pool.QueryRow(ctx, `
+		SELECT watermark_at FROM task_usage_hourly_rollup_state WHERE id = 1
+	`).Scan(&watermarkBefore); err != nil {
+		t.Fatalf("read watermark before: %v", err)
+	}
+
+	const callers = 6
+	results := make([]int64, callers)
+	errs := make([]error, callers)
+	var wg sync.WaitGroup
+	gate := make(chan struct{})
+	for i := range callers {
+		i := i
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+			<-gate
+			err := pool.QueryRow(ctx, `SELECT rollup_task_usage_hourly()`).Scan(&results[i])
+			errs[i] = err
+		}()
+	}
+	close(gate) // start everyone simultaneously
+	wg.Wait()
+
+	winners := 0
+	losers := 0
+	var winningRowCount int64
+	for i, err := range errs {
+		if err != nil {
+			t.Fatalf("caller %d: %v", i, err)
+		}
+		if results[i] > 0 {
+			winners++
+			winningRowCount = results[i]
+		} else {
+			losers++
+		}
+	}
+	if winners != 1 {
+		t.Fatalf("advisory lock 4246 must serialise rollup; got winners=%d losers=%d", winners, losers)
+	}
+	if losers != callers-1 {
+		t.Fatalf("expected %d losers, got %d", callers-1, losers)
+	}
+	if winningRowCount == 0 {
+		t.Fatalf("winning caller returned 0 rows; fixture not wired correctly")
+	}
+
+	// The resulting hourly rows must match exactly the expected
+	// per-(provider, model) aggregation. Running the rollup again
+	// (under no contention) must not change the row count or sums —
+	// that's the SQL function's idempotency contract, and it is what
+	// makes pg_cron + scheduler concurrent execution safe.
+	expectedHourlyRows := rowsToSeed // one per distinct model
+	hourlyRows := countHourlyRowsForWorkspace(t, pool, ws)
+	if hourlyRows != expectedHourlyRows {
+		t.Fatalf("expected %d hourly rows, got %d", expectedHourlyRows, hourlyRows)
+	}
+
+	// Run rollup again and assert no double-write.
+	var followupRows int64
+	if err := pool.QueryRow(ctx, `SELECT rollup_task_usage_hourly()`).Scan(&followupRows); err != nil {
+		t.Fatalf("followup rollup: %v", err)
+	}
+	if followupRows != 0 {
+		t.Fatalf("idempotent re-run should return 0 rows, got %d", followupRows)
+	}
+	if got := countHourlyRowsForWorkspace(t, pool, ws); got != expectedHourlyRows {
+		t.Fatalf("idempotent re-run changed row count from %d to %d", expectedHourlyRows, got)
+	}
+
+	// Watermark advanced past our forced point exactly once — the
+	// cap is `LEAST(now()-5min, watermark + 1 day)`, so the new
+	// watermark must be > watermarkBefore but not duplicated.
+	var watermarkAfter time.Time
+	if err := pool.QueryRow(ctx, `
+		SELECT watermark_at FROM task_usage_hourly_rollup_state WHERE id = 1
+	`).Scan(&watermarkAfter); err != nil {
+		t.Fatalf("read watermark after: %v", err)
+	}
+	if !watermarkAfter.After(watermarkBefore) {
+		t.Fatalf("watermark did not advance: before=%s after=%s", watermarkBefore, watermarkAfter)
+	}
+}
+
+// seedRollupFixture creates the smallest viable
+// (workspace, runtime, agent, task) graph required for task_usage rows
+// to participate in the hourly rollup — the rollup window joins on
+// agent + runtime + (optional) issue, so all four parents must exist.
+// Returns the four IDs.
+func seedRollupFixture(t *testing.T, pool *pgxpool.Pool) (string, string, string, string) {
+	t.Helper()
+	ctx := context.Background()
+	suffix := "rollup-" + uniqueSuffix()
+
+	var wsID, runtimeID, agentID, taskID string
+	if err := pool.QueryRow(ctx, `
+		INSERT INTO workspace (name, slug)
+		VALUES ($1, $1)
+		RETURNING id
+	`, suffix).Scan(&wsID); err != nil {
+		t.Fatalf("seed workspace: %v", err)
+	}
+	if err := pool.QueryRow(ctx, `
+		INSERT INTO agent_runtime (
+			workspace_id, daemon_id, name, runtime_mode, provider, status,
+			device_info, metadata, last_seen_at
+		)
+		VALUES ($1, NULL, $2, 'cloud', 'p', 'online', '{}'::jsonb, '{}'::jsonb, now())
+		RETURNING id
+	`, wsID, "rt-"+suffix).Scan(&runtimeID); err != nil {
+		t.Fatalf("seed agent_runtime: %v", err)
+	}
+	if err := pool.QueryRow(ctx, `
+		INSERT INTO agent (
+			workspace_id, name, description, runtime_mode, runtime_config,
+			runtime_id, visibility, max_concurrent_tasks
+		)
+		VALUES ($1, $2, '', 'cloud', '{}'::jsonb, $3, 'workspace', 1)
+		RETURNING id
+	`, wsID, "ag-"+suffix, runtimeID).Scan(&agentID); err != nil {
+		t.Fatalf("seed agent: %v", err)
+	}
+	if err := pool.QueryRow(ctx, `
+		INSERT INTO agent_task_queue (agent_id, runtime_id)
+		VALUES ($1, $2)
+		RETURNING id
+	`, agentID, runtimeID).Scan(&taskID); err != nil {
+		t.Fatalf("seed agent_task_queue: %v", err)
+	}
+	return wsID, runtimeID, agentID, taskID
+}
+
+func cleanupRollupFixture(t *testing.T, pool *pgxpool.Pool, wsID string) {
+	t.Helper()
+	if _, err := pool.Exec(context.Background(),
+		`DELETE FROM workspace WHERE id = $1`, wsID); err != nil {
+		t.Logf("cleanup rollup fixture: %v", err)
+	}
+}
+
+func countHourlyRowsForWorkspace(t *testing.T, pool *pgxpool.Pool, wsID string) int {
+	t.Helper()
+	var n int
+	if err := pool.QueryRow(context.Background(), `
+		SELECT COUNT(*) FROM task_usage_hourly WHERE workspace_id = $1
+	`, wsID).Scan(&n); err != nil {
+		t.Fatalf("count hourly: %v", err)
+	}
+	return n
+}
+
+func uniqueSuffix() string {
+	return time.Now().UTC().Format("150405.000000000")
+}
--- a/server/internal/scheduler/spec.go
+++ b/server/internal/scheduler/spec.go
@@ -0,0 +1,234 @@
+// Package scheduler is the DB-backed execution-record scheduler
+// described in docs/db-backed-execution-scheduler-rfc.md (MUL-2957).
+//
+// The scheduler turns the `sys_cron_executions` table into the
+// distributed lock + audit log for every internal periodic job. Each
+// app instance ticks the same registered jobs, but the table's unique
+// key on (job_name, scope_kind, scope_id, plan_time) ensures only one
+// instance wins the lease for a given plan; losers no-op silently.
+//
+// Failure handling, stale-lease theft, retry policy, and catch-up
+// behaviour are all driven by the registered JobSpec — the scheduler
+// itself is intentionally a thin shell around the SQL primitives in
+// db_ops.go.
+package scheduler
+
+import (
+	"context"
+	"fmt"
+	"strings"
+	"time"
+)
+
+// CatchUpMode controls how the scheduler decides which plan_times to
+// claim when its tick runs late or after a long pause.
+type CatchUpMode int
+
+const (
+	// CatchUpLatestOnly only claims the most recently due plan. Use this
+	// for jobs whose handler has its own watermark and recovers missed
+	// data without per-tick replay (e.g. the task_usage hourly rollup,
+	// where rollup_task_usage_hourly_window catches up via
+	// task_usage_hourly_rollup_state.watermark_at).
+	CatchUpLatestOnly CatchUpMode = iota
+
+	// CatchUpEveryPlan claims every missed plan_time, oldest first, up
+	// to MaxPlansPerTick per tick and bounded by CatchUpWindow. Use for
+	// jobs where each plan bucket has independent business meaning.
+	CatchUpEveryPlan
+)
+
+func (m CatchUpMode) String() string {
+	switch m {
+	case CatchUpLatestOnly:
+		return "latest_only"
+	case CatchUpEveryPlan:
+		return "every_plan"
+	default:
+		return fmt.Sprintf("unknown(%d)", int(m))
+	}
+}
+
+// Scope identifies the locking dimension for a planned execution. For
+// global jobs the canonical value is ScopeGlobal — the literal string
+// "global" is used for both kind and id so the unique key has no NULL
+// columns.
+type Scope struct {
+	Kind string
+	ID   string
+}
+
+// ScopeGlobal is the singleton scope used by jobs that lock the whole
+// database (e.g. rollup_task_usage_hourly).
+var ScopeGlobal = Scope{Kind: "global", ID: "global"}
+
+func (s Scope) String() string { return s.Kind + "/" + s.ID }
+
+// ScopeProvider produces the list of scopes the scheduler should tick
+// for a given job at a given time. For global jobs the function returns
+// {ScopeGlobal}; sharded jobs may return one entry per shard.
+type ScopeProvider func(ctx context.Context, now time.Time) ([]Scope, error)
+
+// HandlerInput is what the scheduler passes to a job handler.
+type HandlerInput struct {
+	Job       *JobSpec
+	Scope     Scope
+	PlanTime  time.Time
+	Attempt   int
+	RunnerID  string
+	Heartbeat func(ctx context.Context) error
+}
+
+// HandlerResult is what a handler returns. RowsAffected and Result feed
+// the audit row; Result must be small (the table caps it implicitly via
+// JSONB plus a runtime guard in finishSuccess).
+type HandlerResult struct {
+	RowsAffected int64
+	Result       map[string]any
+}
+
+// Handler is the business logic for a job. The scheduler owns the
+// lease, calls Handler exactly once per claimed (job, scope, plan_time)
+// row, and writes the terminal status back guarded by lease_token.
+//
+// Long-running handlers MUST call HandlerInput.Heartbeat periodically
+// (e.g. every 30s) so the scheduler can extend stale_after; if the
+// returned error is ErrLeaseLost, the handler should stop and return.
+type Handler func(ctx context.Context, in HandlerInput) (HandlerResult, error)
+
+// JobSpec describes one registered job. The scheduler stores specs in
+// its registry keyed by Name; Name MUST be stable across releases
+// because it is the audit/index key.
+type JobSpec struct {
+	// Name is the canonical job identifier. Use snake_case ASCII.
+	Name string
+
+	// Cadence is the plan bucket size (e.g. 5 * time.Minute). The
+	// scheduler floors `db_now - ScheduleDelay` to a multiple of
+	// Cadence to derive the canonical UTC plan_time.
+	Cadence time.Duration
+
+	// ScheduleDelay shifts the eligibility horizon back from "now". A
+	// 5-minute delay means the 12:00 plan only becomes eligible at
+	// 12:05 (db_now). This keeps just-arrived data from being missed
+	// by handlers that compare against `now() - 5 min` upper bounds
+	// (e.g. rollup_task_usage_hourly_window).
+	ScheduleDelay time.Duration
+
+	// CatchUpMode selects between latest-only and every-plan replay.
+	CatchUpMode CatchUpMode
+
+	// CatchUpWindow bounds how far back the scheduler will go when
+	// replaying missed plans (CatchUpEveryPlan) or counting skipped
+	// plans (CatchUpLatestOnly). Plans older than now - CatchUpWindow
+	// are ignored.
+	CatchUpWindow time.Duration
+
+	// MaxPlansPerTick caps the number of plans claimed in a single
+	// tick under CatchUpEveryPlan. Latest-only jobs ignore this value.
+	MaxPlansPerTick int
+
+	// RunTimeout bounds the per-handler context. Must be smaller than
+	// StaleTimeout.
+	RunTimeout time.Duration
+
+	// StaleTimeout is how long after the last heartbeat a RUNNING
+	// lease is considered stale. If the lease is stale and
+	// AllowStaleReentry is true, another runner may steal it.
+	StaleTimeout time.Duration
+
+	// HeartbeatInterval is how often the scheduler renews stale_after
+	// while the handler is running. Must be smaller than StaleTimeout.
+	HeartbeatInterval time.Duration
+
+	// AllowStaleReentry permits another runner to steal a stale
+	// RUNNING lease. Set false for non-idempotent jobs; stale leases
+	// then transition to FAILED with error_code='stale_timeout' and
+	// require manual repair.
+	AllowStaleReentry bool
+
+	// MaxAttempts caps the number of times the same plan_time may be
+	// attempted before staying in FAILED. Includes the first attempt.
+	MaxAttempts int
+
+	// RetryBackoff[i] is the delay before attempt i+2 (the second
+	// attempt). Index past len-1 reuses the last entry. Empty slice
+	// disables retry.
+	RetryBackoff []time.Duration
+
+	// Scopes returns the scopes to tick on each loop. For global jobs
+	// use the helper StaticScopes(ScopeGlobal).
+	Scopes ScopeProvider
+
+	// Handler is the per-execution business logic.
+	Handler Handler
+}
+
+// StaticScopes returns a ScopeProvider that always emits the supplied
+// scopes. Use for jobs whose scope set never changes (e.g.
+// global/global, or a fixed shard count).
+func StaticScopes(scopes ...Scope) ScopeProvider {
+	frozen := append([]Scope(nil), scopes...)
+	return func(_ context.Context, _ time.Time) ([]Scope, error) {
+		return frozen, nil
+	}
+}
+
+// validate enforces invariants the SQL primitives rely on.
+func (j *JobSpec) validate() error {
+	if strings.TrimSpace(j.Name) == "" {
+		return fmt.Errorf("scheduler: job name is required")
+	}
+	if j.Cadence <= 0 {
+		return fmt.Errorf("scheduler: job %q: cadence must be > 0", j.Name)
+	}
+	if j.RunTimeout <= 0 {
+		return fmt.Errorf("scheduler: job %q: run_timeout must be > 0", j.Name)
+	}
+	if j.StaleTimeout <= j.RunTimeout {
+		return fmt.Errorf("scheduler: job %q: stale_timeout (%s) must be greater than run_timeout (%s)",
+			j.Name, j.StaleTimeout, j.RunTimeout)
+	}
+	if j.HeartbeatInterval <= 0 || j.HeartbeatInterval >= j.StaleTimeout {
+		return fmt.Errorf("scheduler: job %q: heartbeat_interval must be > 0 and < stale_timeout", j.Name)
+	}
+	if j.MaxAttempts < 1 {
+		return fmt.Errorf("scheduler: job %q: max_attempts must be >= 1", j.Name)
+	}
+	if j.Scopes == nil {
+		return fmt.Errorf("scheduler: job %q: scopes provider is required", j.Name)
+	}
+	if j.Handler == nil {
+		return fmt.Errorf("scheduler: job %q: handler is required", j.Name)
+	}
+	if j.CatchUpMode == CatchUpEveryPlan && j.MaxPlansPerTick <= 0 {
+		return fmt.Errorf("scheduler: job %q: max_plans_per_tick must be > 0 for every_plan catch-up", j.Name)
+	}
+	return nil
+}
+
+// retryDelay returns the wait between attempt N (1-indexed) failing and
+// the next attempt being eligible.
+func (j *JobSpec) retryDelay(attempt int) time.Duration {
+	if len(j.RetryBackoff) == 0 {
+		return 0
+	}
+	idx := attempt - 1
+	if idx < 0 {
+		idx = 0
+	}
+	if idx >= len(j.RetryBackoff) {
+		idx = len(j.RetryBackoff) - 1
+	}
+	return j.RetryBackoff[idx]
+}
+
+// FloorPlan returns the canonical UTC plan_time bucket that contains
+// `eligible` for cadence c. Exposed for tests.
+func FloorPlan(eligible time.Time, c time.Duration) time.Time {
+	if c <= 0 {
+		return eligible.UTC()
+	}
+	t := eligible.UTC()
+	return t.Truncate(c)
+}
--- a/server/internal/scheduler/spec_test.go
+++ b/server/internal/scheduler/spec_test.go
@@ -0,0 +1,44 @@
+package scheduler
+
+import (
+	"testing"
+	"time"
+)
+
+func TestFloorPlanUTC(t *testing.T) {
+	cadence := 5 * time.Minute
+
+	cases := []struct {
+		name  string
+		input time.Time
+		want  time.Time
+	}{
+		{
+			name:  "exact bucket boundary",
+			input: time.Date(2026, 6, 3, 8, 15, 0, 0, time.UTC),
+			want:  time.Date(2026, 6, 3, 8, 15, 0, 0, time.UTC),
+		},
+		{
+			name:  "mid-bucket truncates down",
+			input: time.Date(2026, 6, 3, 8, 17, 42, 0, time.UTC),
+			want:  time.Date(2026, 6, 3, 8, 15, 0, 0, time.UTC),
+		},
+		{
+			name:  "non-utc input normalises to utc bucket",
+			input: time.Date(2026, 6, 3, 17, 17, 42, 0, time.FixedZone("CST", 9*3600)),
+			want:  time.Date(2026, 6, 3, 8, 15, 0, 0, time.UTC),
+		},
+	}
+	for _, tc := range cases {
+		t.Run(tc.name, func(t *testing.T) {
+			got := FloorPlan(tc.input, cadence)
+			if !got.Equal(tc.want) {
+				t.Fatalf("FloorPlan(%s, %s) = %s; want %s",
+					tc.input, cadence, got, tc.want)
+			}
+			if got.Location() != time.UTC {
+				t.Fatalf("FloorPlan must return UTC, got %s", got.Location())
+			}
+		})
+	}
+}
--- a/server/internal/scheduler/stale_steal_test.go
+++ b/server/internal/scheduler/stale_steal_test.go
@@ -0,0 +1,187 @@
+package scheduler
+
+import (
+	"context"
+	"errors"
+	"fmt"
+	"os"
+	"testing"
+	"time"
+
+	"github.com/google/uuid"
+	"github.com/jackc/pgx/v5/pgxpool"
+)
+
+// integrationPool returns a pool against the configured DATABASE_URL,
+// or skips the test if the database is not reachable. Mirrors the
+// pattern used by server/cmd/server/integration_test.go and
+// internal/handler/handler_test.go (see those files' TestMain).
+func integrationPool(t *testing.T) *pgxpool.Pool {
+	t.Helper()
+	dbURL := os.Getenv("DATABASE_URL")
+	if dbURL == "" {
+		dbURL = "postgres://multica:multica@localhost:5432/multica?sslmode=disable"
+	}
+	ctx := context.Background()
+	pool, err := pgxpool.New(ctx, dbURL)
+	if err != nil {
+		t.Skipf("scheduler integration tests require Postgres: %v", err)
+	}
+	if err := pool.Ping(ctx); err != nil {
+		pool.Close()
+		t.Skipf("scheduler integration tests require Postgres: %v", err)
+	}
+	t.Cleanup(pool.Close)
+	return pool
+}
+
+func cleanupExecutions(t *testing.T, pool *pgxpool.Pool, jobName string) {
+	t.Helper()
+	if _, err := pool.Exec(context.Background(),
+		`DELETE FROM sys_cron_executions WHERE job_name = $1`, jobName); err != nil {
+		t.Fatalf("clean executions: %v", err)
+	}
+}
+
+// uniqueJobName isolates concurrent CI runs so each test has its own
+// job-name partition in sys_cron_executions.
+func uniqueJobName(t *testing.T, prefix string) string {
+	t.Helper()
+	return fmt.Sprintf("%s_%s", prefix, uuid.NewString()[:8])
+}
+
+func newTestJobSpec(name string) *JobSpec {
+	return &JobSpec{
+		Name:              name,
+		Cadence:           5 * time.Minute,
+		ScheduleDelay:     0,
+		CatchUpMode:       CatchUpLatestOnly,
+		CatchUpWindow:     time.Hour,
+		RunTimeout:        time.Minute,
+		StaleTimeout:      2 * time.Minute,
+		HeartbeatInterval: 30 * time.Second,
+		AllowStaleReentry: true,
+		MaxAttempts:       3,
+		Scopes:            StaticScopes(ScopeGlobal),
+		Handler: func(ctx context.Context, in HandlerInput) (HandlerResult, error) {
+			return HandlerResult{}, nil
+		},
+	}
+}
+
+// TestStaleStealTerminalUpdateIgnored covers RFC §14:
+//
+//	"stale steal | winner heartbeat stops, stale_after expires, another
+//	 runner steal lease; old lease terminal update doesn't take effect."
+//
+// We claim a plan, simulate a stuck handler by setting stale_after into
+// the past, steal it as a second runner, then verify that the FIRST
+// runner's terminal SUCCESS write is rejected by the lease_token guard
+// — the row stays in the second runner's RUNNING state.
+func TestStaleStealTerminalUpdateIgnored(t *testing.T) {
+	pool := integrationPool(t)
+	job := newTestJobSpec(uniqueJobName(t, "stale_steal"))
+	t.Cleanup(func() { cleanupExecutions(t, pool, job.Name) })
+
+	ctx := context.Background()
+	now, err := dbNow(ctx, pool)
+	if err != nil {
+		t.Fatalf("dbNow: %v", err)
+	}
+	planTime := FloorPlan(now, job.Cadence)
+
+	// Step 1: original runner claims the plan.
+	original, err := tryClaim(ctx, pool, job, ScopeGlobal, planTime, now, "runner-A")
+	if err != nil {
+		t.Fatalf("initial claim: %v", err)
+	}
+	if !original.Won {
+		t.Fatalf("expected fresh win, got %+v", original)
+	}
+
+	// Step 2: simulate a stuck handler — stale_after moved to the past.
+	if _, err := pool.Exec(ctx, `
+		UPDATE sys_cron_executions
+		   SET stale_after  = now() - INTERVAL '1 minute',
+		       heartbeat_at = now() - INTERVAL '5 minute'
+		 WHERE id = $1
+	`, original.ID); err != nil {
+		t.Fatalf("force stale: %v", err)
+	}
+
+	// Step 3: a different runner steals the lease via the same primitive.
+	stealNow, err := dbNow(ctx, pool)
+	if err != nil {
+		t.Fatalf("dbNow: %v", err)
+	}
+	stolen, err := tryClaim(ctx, pool, job, ScopeGlobal, planTime, stealNow, "runner-B")
+	if err != nil {
+		t.Fatalf("steal claim: %v", err)
+	}
+	if !stolen.Stole {
+		t.Fatalf("expected stale steal, got %+v", stolen)
+	}
+	if stolen.ID != original.ID {
+		t.Fatalf("steal must reuse the same row id, got original=%s stolen=%s",
+			original.ID, stolen.ID)
+	}
+	if stolen.LeaseToken == original.LeaseToken {
+		t.Fatalf("steal must rotate lease_token; got %s == %s",
+			stolen.LeaseToken, original.LeaseToken)
+	}
+	if stolen.Attempt != 2 {
+		t.Fatalf("steal must increment attempt; got %d", stolen.Attempt)
+	}
+
+	// Step 4: the original runner returns and tries to write SUCCESS
+	// with its OLD lease_token. The guard must reject the update so
+	// the row stays in runner-B's RUNNING state.
+	doneTime, _ := dbNow(ctx, pool)
+	err = finishSuccess(ctx, pool, original.ID, original.LeaseToken,
+		doneTime, 1234, HandlerResult{RowsAffected: 7})
+	if !errors.Is(err, ErrLeaseLost) {
+		t.Fatalf("expected ErrLeaseLost from old lease finish, got %v", err)
+	}
+
+	// Step 5: verify the row is still RUNNING under runner-B.
+	var status, runner string
+	var attempt int
+	if err := pool.QueryRow(ctx, `
+		SELECT status, runner_id, attempt
+		  FROM sys_cron_executions
+		 WHERE id = $1
+	`, original.ID).Scan(&status, &runner, &attempt); err != nil {
+		t.Fatalf("scan post-steal row: %v", err)
+	}
+	if status != "RUNNING" {
+		t.Fatalf("expected status RUNNING after old-runner overwrite was rejected, got %q", status)
+	}
+	if runner != "runner-B" {
+		t.Fatalf("expected runner_id runner-B, got %q", runner)
+	}
+	if attempt != 2 {
+		t.Fatalf("expected attempt=2 (B's fresh attempt), got %d", attempt)
+	}
+
+	// Step 6: runner B finishes successfully with its own token — that
+	// must be accepted.
+	if err := finishSuccess(ctx, pool, stolen.ID, stolen.LeaseToken,
+		doneTime, 4321, HandlerResult{RowsAffected: 11}); err != nil {
+		t.Fatalf("legitimate finishSuccess from runner-B: %v", err)
+	}
+	var finalStatus string
+	var finalRows int64
+	if err := pool.QueryRow(ctx, `
+		SELECT status, rows_affected
+		  FROM sys_cron_executions
+		 WHERE id = $1
+	`, original.ID).Scan(&finalStatus, &finalRows); err != nil {
+		t.Fatalf("scan final row: %v", err)
+	}
+	if finalStatus != "SUCCESS" {
+		t.Fatalf("expected final status SUCCESS, got %q", finalStatus)
+	}
+	if finalRows != 11 {
+		t.Fatalf("expected rows_affected=11 (B's value), got %d", finalRows)
+	}
+}
--- a/server/internal/taskusagebackfill/backfill.go
+++ b/server/internal/taskusagebackfill/backfill.go
@@ -0,0 +1,371 @@
+// Package taskusagebackfill seeds task_usage_hourly from historical
+// task_usage rows.
+//
+// It exists in two callers:
+//
+//   - server/cmd/backfill_task_usage_hourly: explicit operator command,
+//     used by the SELF-HOST UPGRADE ORDER documented in that file.
+//   - server/cmd/migrate: invoked as a hook BEFORE migration 103 runs,
+//     so that operators upgrading directly from v0.3.4 (or any version
+//     prior to the hourly pipeline) do not trip migration 103's
+//     fail-closed watermark guard while the server is still down. The
+//     hook can run a full idempotent backfill in the same `migrate up`
+//     invocation and then continue applying 103/104.
+//
+// The implementation uses the same SQL window primitive
+// (`rollup_task_usage_hourly_window`) that the rollup worker uses, so
+// re-running is safe — partial progress is recovered on the next call.
+//
+// All callers MUST hold advisory lock 4246 (AdvisoryLockKey) for the
+// duration of the backfill walk. That lock is what makes this safe to
+// run alongside the SQL `rollup_task_usage_hourly()` cron entry, the
+// in-process scheduler, and any other concurrent backfill — winners
+// take the lock, losers no-op until it is released.
+package taskusagebackfill
+
+import (
+	"context"
+	"errors"
+	"fmt"
+	"log/slog"
+	"time"
+
+	"github.com/jackc/pgx/v5"
+	"github.com/jackc/pgx/v5/pgtype"
+	"github.com/jackc/pgx/v5/pgxpool"
+)
+
+// AdvisoryLockKey is the int64 identifier shared by every path that
+// drives task_usage_hourly's rollup writes:
+//
+//   - rollup_task_usage_hourly() in migration 102.
+//   - cmd/backfill_task_usage_hourly.
+//   - cmd/migrate's pre-103 hook (MUL-2957).
+//   - the in-process scheduler's rollup_task_usage_hourly handler
+//     (defense in depth — the scheduler's lease already prevents
+//     double-runs across instances; the advisory lock additionally
+//     prevents racing legacy pg_cron / manual entrypoints).
+//
+// It is the same id used by every prior version of the rollup pipeline,
+// so a mixed-version cluster (rolling deploy) cannot double-write.
+const AdvisoryLockKey int64 = 4246
+
+// MaxLagThreshold mirrors the v_lag interval inside migration 103's
+// fail-closed guard. The pre-103 hook only triggers a backfill when the
+// watermark trails the latest task_usage event by more than this; below
+// the threshold the migration would have passed anyway, so we save the
+// scan.
+const MaxLagThreshold = time.Hour
+
+// The watermark upper bound of `now() - 5 minutes` is encoded directly
+// in the SQL UPDATE in stampWatermark / stampWatermarkOnConn so the
+// math runs in the same session that does the write — no app-side
+// time.Now() participates. (MUL-2957 review: blocker #3.)
+
+// Result describes what a single backfill run did. Exposed so callers
+// (the migrate command and tests) can log or assert on it.
+type Result struct {
+	// Skipped is a short reason string when the run did no slice work.
+	// Empty string means the run actually walked at least one slice.
+	Skipped string
+
+	// SlicesProcessed counts monthly slices that were rolled up.
+	SlicesProcessed int
+
+	// RowsTouched is the sum of rollup_task_usage_hourly_window's
+	// returned counts.
+	RowsTouched int64
+
+	// From / To bracket the walk's UTC range.
+	From time.Time
+	To   time.Time
+
+	// WatermarkStamped reports whether the watermark UPDATE was issued.
+	WatermarkStamped bool
+}
+
+// HookOptions controls Hook behaviour. The defaults are correct for
+// production; tests override fields as needed.
+type HookOptions struct {
+	// Logger receives slog records about the backfill walk. nil =
+	// slog.Default().
+	Logger *slog.Logger
+
+	// LagThreshold overrides MaxLagThreshold. Zero = MaxLagThreshold.
+	// Tests pass a small value to force a backfill on a deterministic
+	// fixture.
+	LagThreshold time.Duration
+
+	// SleepBetweenSlices throttles the walk on a busy DB. Zero = no
+	// pause. Mirrors the operator-facing flag on the standalone
+	// backfill command.
+	SleepBetweenSlices time.Duration
+}
+
+// Hook is the migration-time entrypoint. It checks whether the
+// task_usage_hourly_rollup_state.watermark_at trails the latest
+// task_usage event by more than the lag threshold and, if so, runs an
+// idempotent monthly-slice backfill of task_usage_hourly. After the
+// walk completes (or if no backfill was needed) the watermark is
+// stamped to `now() - 5 minutes`, mirroring the standalone backfill
+// command's stampWatermark step.
+//
+// The hook does NOT fail when:
+//
+//   - task_usage is empty (a fresh database has no history to
+//     backfill — the watermark is stamped so migration 103's guard
+//     accepts the empty state).
+//
+//   - the rollup state tables are missing (the hook ran before
+//     migration 101 in some unusual ordering — treated as nothing to
+//     do, the migration loop will install them next).
+//
+// It DOES return an error when the rollup walk itself fails: that case
+// is identical to the standalone backfill failing, and the migration
+// run must abort so the operator can investigate before migration 103
+// drops the legacy daily rollups.
+//
+// The hook acquires advisory lock 4246 on its own session connection
+// so it does not collide with the migrate loop's session-level
+// migrationAdvisoryLockKey on a different conn.
+func Hook(ctx context.Context, pool *pgxpool.Pool, opts HookOptions) (Result, error) {
+	log := opts.Logger
+	if log == nil {
+		log = slog.Default()
+	}
+	threshold := opts.LagThreshold
+	if threshold <= 0 {
+		threshold = MaxLagThreshold
+	}
+
+	// Step 1: cheap precondition check — if the rollup state tables
+	// have not been created yet, this hook simply has nothing to do
+	// (migrations 101/102 install them, and the migration loop will
+	// reach 103 only after those have already applied).
+	stateExists, err := rollupStateExists(ctx, pool)
+	if err != nil {
+		return Result{}, fmt.Errorf("check rollup state existence: %w", err)
+	}
+	if !stateExists {
+		log.Info("task_usage hourly rollup hook: rollup state tables not present, skipping",
+			"reason", "migrations 101/102 not yet applied")
+		return Result{Skipped: "rollup_state_missing"}, nil
+	}
+
+	// Step 2: read task_usage range and current watermark on the pool.
+	// These do not need to be locked — they are only used to decide
+	// whether the lock-protected walk should run at all, and the walk
+	// itself is idempotent if the watermark advances under us.
+	usageRange, err := loadUsageRange(ctx, pool)
+	if err != nil {
+		return Result{}, fmt.Errorf("load task_usage range: %w", err)
+	}
+	watermark, err := loadWatermark(ctx, pool)
+	if err != nil {
+		return Result{}, fmt.Errorf("load rollup watermark: %w", err)
+	}
+
+	if !usageRange.HasRows {
+		// Empty database. Stamp the watermark so migration 103's
+		// guard accepts the no-history path on a fresh upgrade and
+		// the rollup worker starts forward from the stamp. The DB's
+		// own clock is used so a clock-skewed app process cannot
+		// stamp the watermark into the DB's future.
+		if err := stampWatermark(ctx, pool); err != nil {
+			return Result{}, err
+		}
+		log.Info("task_usage hourly rollup hook: task_usage empty, watermark stamped from db now()")
+		return Result{Skipped: "task_usage_empty", WatermarkStamped: true}, nil
+	}
+
+	if !watermark.Valid {
+		// Defensive — schema guarantees the row exists with a default
+		// of 1970-01-01, so an invalid value here means somebody has
+		// fiddled with the row directly.
+		return Result{}, errors.New("task_usage_hourly_rollup_state row is missing or watermark is NULL; manual intervention required before migration 103")
+	}
+
+	// Lag is computed against the DB's max_event (already DB-time);
+	// comparing to the DB-time watermark avoids any app/DB skew.
+	maxEvent := usageRange.MaxEvent
+	lag := maxEvent.Sub(watermark.Time)
+	if lag <= threshold {
+		log.Info("task_usage hourly rollup hook: watermark already current, skipping backfill",
+			"watermark_at", watermark.Time.UTC().Format(time.RFC3339),
+			"max_event", maxEvent.UTC().Format(time.RFC3339),
+			"lag", lag.String(),
+			"threshold", threshold.String())
+		// Re-stamp from DB now() to bring the value flush with the
+		// cron upper bound; the lag-based guard in migration 103 will
+		// pass either way, but stamping keeps the post-hook state
+		// consistent with the standalone backfill command.
+		if err := stampWatermark(ctx, pool); err != nil {
+			return Result{}, err
+		}
+		return Result{Skipped: "watermark_within_threshold", WatermarkStamped: true}, nil
+	}
+
+	log.Info("task_usage hourly rollup hook: backfilling under advisory lock",
+		"watermark_at", watermark.Time.UTC().Format(time.RFC3339),
+		"max_event", maxEvent.UTC().Format(time.RFC3339),
+		"lag", lag.String(),
+		"threshold", threshold.String())
+
+	// Step 3: serialise against the SQL cron entry / standalone backfill
+	// / scheduler handler via advisory lock 4246. We use a dedicated
+	// session-pinned conn because pg advisory locks are per-session.
+	lockConn, err := pool.Acquire(ctx)
+	if err != nil {
+		return Result{}, fmt.Errorf("acquire advisory-lock connection: %w", err)
+	}
+	defer lockConn.Release()
+
+	if _, err := lockConn.Exec(ctx, `SELECT pg_advisory_lock($1)`, AdvisoryLockKey); err != nil {
+		return Result{}, fmt.Errorf("acquire advisory lock %d: %w", AdvisoryLockKey, err)
+	}
+	// Use a fresh context for the unlock so a cancelled ctx does not
+	// skip the release. Releasing the connection afterwards would end
+	// the session anyway, but an explicit unlock frees it immediately.
+	defer func() {
+		_, _ = lockConn.Exec(context.Background(), `SELECT pg_advisory_unlock($1)`, AdvisoryLockKey)
+	}()
+
+	from := monthFloor(usageRange.MinEvent)
+	end := monthFloor(maxEvent).AddDate(0, 1, 0)
+	res := Result{From: from, To: end}
+
+	cursor := from
+	for cursor.Before(end) {
+		select {
+		case <-ctx.Done():
+			return res, ctx.Err()
+		default:
+		}
+
+		next := cursor.AddDate(0, 1, 0)
+		var rows int64
+		err := lockConn.QueryRow(
+			ctx,
+			`SELECT rollup_task_usage_hourly_window($1::timestamptz, $2::timestamptz)`,
+			cursor, next,
+		).Scan(&rows)
+		if err != nil {
+			return res, fmt.Errorf("rollup slice %s..%s: %w",
+				cursor.Format(time.RFC3339), next.Format(time.RFC3339), err)
+		}
+		res.SlicesProcessed++
+		res.RowsTouched += rows
+		log.Info("task_usage hourly rollup hook: slice complete",
+			"from", cursor.Format(time.RFC3339),
+			"to", next.Format(time.RFC3339),
+			"rows_touched", rows)
+		cursor = next
+		if opts.SleepBetweenSlices > 0 && cursor.Before(end) {
+			select {
+			case <-time.After(opts.SleepBetweenSlices):
+			case <-ctx.Done():
+				return res, ctx.Err()
+			}
+		}
+	}
+
+	if err := stampWatermarkOnConn(ctx, lockConn.Conn()); err != nil {
+		return res, err
+	}
+	res.WatermarkStamped = true
+
+	log.Info("task_usage hourly rollup hook: complete",
+		"slices", res.SlicesProcessed,
+		"total_rows_touched", res.RowsTouched,
+		"watermark_source", "db_now")
+	return res, nil
+}
+
+type usageRange struct {
+	HasRows  bool
+	MinEvent time.Time
+	MaxEvent time.Time
+}
+
+func loadUsageRange(ctx context.Context, pool *pgxpool.Pool) (usageRange, error) {
+	var minTS, maxTS pgtype.Timestamptz
+	// COALESCE(updated_at, created_at) tracks the same expression
+	// migration 103's guard uses, so the lag comparison stays
+	// consistent with the value the guard will check next.
+	err := pool.QueryRow(ctx, `
+		SELECT MIN(created_at), MAX(COALESCE(updated_at, created_at))
+		  FROM task_usage
+	`).Scan(&minTS, &maxTS)
+	if err != nil {
+		return usageRange{}, err
+	}
+	if !minTS.Valid || !maxTS.Valid {
+		return usageRange{HasRows: false}, nil
+	}
+	return usageRange{
+		HasRows:  true,
+		MinEvent: minTS.Time.UTC(),
+		MaxEvent: maxTS.Time.UTC(),
+	}, nil
+}
+
+func loadWatermark(ctx context.Context, pool *pgxpool.Pool) (pgtype.Timestamptz, error) {
+	var watermark pgtype.Timestamptz
+	err := pool.QueryRow(ctx, `
+		SELECT watermark_at FROM task_usage_hourly_rollup_state WHERE id = 1
+	`).Scan(&watermark)
+	if err != nil {
+		if errors.Is(err, pgx.ErrNoRows) {
+			return pgtype.Timestamptz{}, nil
+		}
+		return pgtype.Timestamptz{}, err
+	}
+	return watermark, nil
+}
+
+func rollupStateExists(ctx context.Context, pool *pgxpool.Pool) (bool, error) {
+	var exists bool
+	err := pool.QueryRow(ctx, `
+		SELECT EXISTS (
+			SELECT 1 FROM information_schema.tables
+			 WHERE table_schema = 'public'
+			   AND table_name = 'task_usage_hourly_rollup_state'
+		)
+	`).Scan(&exists)
+	return exists, err
+}
+
+// stampWatermark moves the hourly rollup state watermark to
+// `now() - 5 min` using PostgreSQL's clock, NOT the app process clock.
+// This matches the cron entry's upper bound and — critically —
+// guarantees the watermark cannot be stamped into the DB's future
+// because of container clock drift. (MUL-2957 review: see张大彪's
+// blocker #3.)
+func stampWatermark(ctx context.Context, pool *pgxpool.Pool) error {
+	_, err := pool.Exec(ctx, `
+		UPDATE task_usage_hourly_rollup_state
+		   SET watermark_at = now() - INTERVAL '5 minutes'
+		 WHERE id = 1
+	`)
+	if err != nil {
+		return fmt.Errorf("stamp watermark: %w", err)
+	}
+	return nil
+}
+
+func stampWatermarkOnConn(ctx context.Context, conn *pgx.Conn) error {
+	_, err := conn.Exec(ctx, `
+		UPDATE task_usage_hourly_rollup_state
+		   SET watermark_at = now() - INTERVAL '5 minutes'
+		 WHERE id = 1
+	`)
+	if err != nil {
+		return fmt.Errorf("stamp watermark: %w", err)
+	}
+	return nil
+}
+
+func monthFloor(t time.Time) time.Time {
+	t = t.UTC()
+	return time.Date(t.Year(), t.Month(), 1, 0, 0, 0, 0, time.UTC)
+}
--- a/server/internal/taskusagebackfill/hook_test.go
+++ b/server/internal/taskusagebackfill/hook_test.go
@@ -0,0 +1,399 @@
+package taskusagebackfill_test
+
+import (
+	"context"
+	"errors"
+	"fmt"
+	"os"
+	"path/filepath"
+	"runtime"
+	"sort"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/jackc/pgx/v5"
+	"github.com/jackc/pgx/v5/pgxpool"
+
+	"github.com/multica-ai/multica/server/internal/migrations"
+	"github.com/multica-ai/multica/server/internal/taskusagebackfill"
+)
+
+// TestHook_DirectV034Upgrade simulates the path described in
+// docs/db-backed-execution-scheduler-rfc.md §12.3 ("从 v0.3.4 直接升级
+// 到带 scheduler 的版本") and RFC §14 row "direct v0.3.4 upgrade":
+//
+//   - apply migrations through 102 (legacy daily rollups + new hourly
+//     schema/pipeline are both live);
+//   - seed a `task_usage` row dated > 1 hour ago, watermark left at
+//     1970-01-01 (the post-101 default);
+//   - migration 103's fail-closed guard would normally abort `migrate
+//     up` here because watermark < max_event - 1h;
+//   - run taskusagebackfill.Hook (the migrator hook from MUL-2957);
+//   - the hook performs an idempotent monthly-slice backfill, stamps
+//     the watermark, and migration 103 then passes.
+//
+// The test creates and drops a temporary database so it does not
+// disturb the shared development DB. If the postgres server is not
+// reachable the test skips (mirrors the rest of the integration suite).
+func TestHook_DirectV034Upgrade(t *testing.T) {
+	adminURL := os.Getenv("DATABASE_URL")
+	if adminURL == "" {
+		adminURL = "postgres://multica:multica@localhost:5432/multica?sslmode=disable"
+	}
+
+	ctx := context.Background()
+	if !databaseReachable(ctx, adminURL) {
+		t.Skip("integration test requires Postgres at DATABASE_URL")
+	}
+
+	tmpDB := fmt.Sprintf("multica_v034_upgrade_%d", time.Now().UnixNano())
+	if err := createDatabase(ctx, adminURL, tmpDB); err != nil {
+		t.Fatalf("create temp database %s: %v", tmpDB, err)
+	}
+	t.Cleanup(func() {
+		if err := dropDatabase(context.Background(), adminURL, tmpDB); err != nil {
+			t.Logf("drop temp database %s: %v", tmpDB, err)
+		}
+	})
+
+	tmpURL := replaceDatabase(adminURL, tmpDB)
+	pool, err := pgxpool.New(ctx, tmpURL)
+	if err != nil {
+		t.Fatalf("connect to temp database: %v", err)
+	}
+	t.Cleanup(pool.Close)
+
+	// Apply migrations through 102 (the simulated v0.3.4-equivalent
+	// state where the new hourly schema and pipeline are present, but
+	// 103 has not yet dropped the legacy daily rollups).
+	if err := applyMigrationsUpTo(ctx, pool, "102_task_usage_hourly_pipeline"); err != nil {
+		t.Fatalf("apply migrations to 102: %v", err)
+	}
+
+	// Seed: one task_usage row dated 2 hours ago. Triggers fire on
+	// INSERT/UPDATE post-102 only via the watermark window in
+	// rollup_task_usage_hourly_window, so we leave the watermark at
+	// the post-101 default of 1970-01-01.
+	wsID, runtimeID, agentID, taskID := seedTaskUsageFixture(t, ctx, pool)
+
+	twoHoursAgo := time.Now().UTC().Add(-2 * time.Hour)
+	if _, err := pool.Exec(ctx, `
+		INSERT INTO task_usage (
+			task_id, provider, model,
+			input_tokens, output_tokens, cache_read_tokens, cache_write_tokens,
+			created_at
+		) VALUES ($1, 'openai', 'gpt-test', 10, 20, 0, 0, $2)
+	`, taskID, twoHoursAgo); err != nil {
+		t.Fatalf("seed task_usage: %v", err)
+	}
+	_ = wsID
+	_ = runtimeID
+	_ = agentID
+
+	// Confirm migration 103's guard would abort if applied now.
+	pendingErr := tryApplyMigration(ctx, pool, "103_drop_legacy_daily_rollups")
+	if pendingErr == nil {
+		t.Fatalf("migration 103 guard did not trip with stale watermark; preconditions invalid")
+	}
+	if !strings.Contains(pendingErr.Error(), "refusing to drop legacy daily rollups") {
+		t.Fatalf("expected fail-closed guard error, got %v", pendingErr)
+	}
+	// Roll back the failed transaction state (Postgres connections are
+	// fine; we just did not commit anything destructive).
+
+	// Run the MUL-2957 migrator hook. The hook should:
+	//   * Read MIN/MAX from task_usage.
+	//   * Acquire advisory lock 4246 on its own conn.
+	//   * Walk monthly slices via rollup_task_usage_hourly_window.
+	//   * Stamp the watermark to now() - 5 minutes.
+	res, err := taskusagebackfill.Hook(ctx, pool, taskusagebackfill.HookOptions{})
+	if err != nil {
+		t.Fatalf("Hook returned error: %v", err)
+	}
+	if res.Skipped != "" {
+		t.Fatalf("Hook should have run a backfill; got skipped=%q", res.Skipped)
+	}
+	if !res.WatermarkStamped {
+		t.Fatalf("Hook should have stamped the watermark")
+	}
+	if res.SlicesProcessed < 1 {
+		t.Fatalf("expected at least 1 slice processed, got %d", res.SlicesProcessed)
+	}
+
+	// Hourly bucket exists now.
+	var hourlyRows int
+	if err := pool.QueryRow(ctx, `
+		SELECT COUNT(*) FROM task_usage_hourly
+	`).Scan(&hourlyRows); err != nil {
+		t.Fatalf("count task_usage_hourly: %v", err)
+	}
+	if hourlyRows == 0 {
+		t.Fatalf("backfill did not produce any hourly rows; check seed fixture")
+	}
+
+	// Watermark stamped close to "now - 5 min" (loose bound to absorb
+	// CI clock skew).
+	var watermark time.Time
+	if err := pool.QueryRow(ctx, `
+		SELECT watermark_at FROM task_usage_hourly_rollup_state WHERE id = 1
+	`).Scan(&watermark); err != nil {
+		t.Fatalf("read watermark: %v", err)
+	}
+	expected := time.Now().UTC().Add(-5 * time.Minute)
+	if delta := watermark.UTC().Sub(expected); delta > time.Minute || delta < -2*time.Minute {
+		t.Fatalf("watermark %s far from expected %s (delta=%s)",
+			watermark.Format(time.RFC3339), expected.Format(time.RFC3339), delta)
+	}
+
+	// And now migration 103 must apply cleanly.
+	if err := tryApplyMigration(ctx, pool, "103_drop_legacy_daily_rollups"); err != nil {
+		t.Fatalf("migration 103 still fails after hook: %v", err)
+	}
+}
+
+// TestHook_FreshDatabaseStampsWatermarkOnly covers the empty-history
+// branch: a brand new self-host install runs the hook before migration
+// 103, but task_usage is empty. The hook should stamp the watermark
+// (so the guard's "fresh DB" branch passes) and not touch any rows.
+func TestHook_FreshDatabaseStampsWatermarkOnly(t *testing.T) {
+	adminURL := os.Getenv("DATABASE_URL")
+	if adminURL == "" {
+		adminURL = "postgres://multica:multica@localhost:5432/multica?sslmode=disable"
+	}
+	ctx := context.Background()
+	if !databaseReachable(ctx, adminURL) {
+		t.Skip("integration test requires Postgres at DATABASE_URL")
+	}
+
+	tmpDB := fmt.Sprintf("multica_v034_fresh_%d", time.Now().UnixNano())
+	if err := createDatabase(ctx, adminURL, tmpDB); err != nil {
+		t.Fatalf("create temp database: %v", err)
+	}
+	t.Cleanup(func() {
+		_ = dropDatabase(context.Background(), adminURL, tmpDB)
+	})
+
+	pool, err := pgxpool.New(ctx, replaceDatabase(adminURL, tmpDB))
+	if err != nil {
+		t.Fatalf("pool: %v", err)
+	}
+	t.Cleanup(pool.Close)
+
+	if err := applyMigrationsUpTo(ctx, pool, "102_task_usage_hourly_pipeline"); err != nil {
+		t.Fatalf("apply migrations to 102: %v", err)
+	}
+
+	res, err := taskusagebackfill.Hook(ctx, pool, taskusagebackfill.HookOptions{})
+	if err != nil {
+		t.Fatalf("Hook on empty DB: %v", err)
+	}
+	if res.Skipped != "task_usage_empty" {
+		t.Fatalf("expected skipped=task_usage_empty, got %q", res.Skipped)
+	}
+	if !res.WatermarkStamped {
+		t.Fatalf("watermark must still be stamped on empty DB so 103 can pass")
+	}
+
+	// Migration 103 applies cleanly on the fresh DB — its early-out
+	// branch (no task_usage rows) plus the stamped watermark let it
+	// through.
+	if err := tryApplyMigration(ctx, pool, "103_drop_legacy_daily_rollups"); err != nil {
+		t.Fatalf("migration 103 fresh-DB path failed: %v", err)
+	}
+}
+
+// ---------------------------------------------------------------------
+// Test helpers — minimal harness so we can apply migrations to a temp
+// database without coupling to cmd/migrate's exit-on-error model.
+// ---------------------------------------------------------------------
+
+// resolveMigrationsDir locates server/migrations relative to this test
+// file's source path. The shared migrations.ResolveDir() walks parent
+// directories looking for any folder literally named "migrations",
+// which collides with the Go package server/internal/migrations and
+// returns the wrong directory. Anchoring on runtime.Caller's file path
+// avoids that ambiguity entirely.
+func resolveMigrationsDir() (string, error) {
+	_, here, _, ok := runtime.Caller(0)
+	if !ok {
+		return "", fmt.Errorf("runtime.Caller(0) failed")
+	}
+	// here = .../server/internal/taskusagebackfill/hook_test.go
+	dir := filepath.Join(filepath.Dir(here), "..", "..", "migrations")
+	dir = filepath.Clean(dir)
+	if info, err := os.Stat(dir); err != nil || !info.IsDir() {
+		return "", fmt.Errorf("migrations dir not at %s", dir)
+	}
+	return dir, nil
+}
+
+func databaseReachable(ctx context.Context, url string) bool {
+	pool, err := pgxpool.New(ctx, url)
+	if err != nil {
+		return false
+	}
+	defer pool.Close()
+	return pool.Ping(ctx) == nil
+}
+
+func createDatabase(ctx context.Context, adminURL, name string) error {
+	conn, err := pgx.Connect(ctx, adminURL)
+	if err != nil {
+		return err
+	}
+	defer conn.Close(ctx)
+	if _, err := conn.Exec(ctx, fmt.Sprintf(`CREATE DATABASE %q`, name)); err != nil {
+		return err
+	}
+	return nil
+}
+
+func dropDatabase(ctx context.Context, adminURL, name string) error {
+	conn, err := pgx.Connect(ctx, adminURL)
+	if err != nil {
+		return err
+	}
+	defer conn.Close(ctx)
+	if _, err := conn.Exec(ctx, fmt.Sprintf(`DROP DATABASE IF EXISTS %q WITH (FORCE)`, name)); err != nil {
+		return err
+	}
+	return nil
+}
+
+// replaceDatabase swaps the database segment of a postgres URL.
+// "postgres://u:p@h:5432/old?x=1" with name="new" -> ".../new?x=1".
+func replaceDatabase(url, name string) string {
+	idx := strings.LastIndex(url, "/")
+	if idx < 0 {
+		return url
+	}
+	rest := url[idx+1:]
+	q := strings.Index(rest, "?")
+	if q < 0 {
+		return url[:idx+1] + name
+	}
+	return url[:idx+1] + name + rest[q:]
+}
+
+func applyMigrationsUpTo(ctx context.Context, pool *pgxpool.Pool, lastVersion string) error {
+	dir, err := resolveMigrationsDir()
+	if err != nil {
+		return err
+	}
+	files, err := filepath.Glob(filepath.Join(dir, "*.up.sql"))
+	if err != nil {
+		return err
+	}
+	sort.Strings(files)
+
+	if _, err := pool.Exec(ctx, `
+		CREATE TABLE IF NOT EXISTS schema_migrations (
+			version TEXT PRIMARY KEY,
+			applied_at TIMESTAMPTZ NOT NULL DEFAULT now()
+		)
+	`); err != nil {
+		return err
+	}
+
+	for _, f := range files {
+		v := migrations.ExtractVersion(f)
+		sql, err := os.ReadFile(f)
+		if err != nil {
+			return err
+		}
+		if _, err := pool.Exec(ctx, string(sql)); err != nil {
+			return fmt.Errorf("apply %s: %w", v, err)
+		}
+		if _, err := pool.Exec(ctx,
+			`INSERT INTO schema_migrations (version) VALUES ($1) ON CONFLICT DO NOTHING`,
+			v); err != nil {
+			return err
+		}
+		if v == lastVersion {
+			return nil
+		}
+	}
+	return fmt.Errorf("migration %q not found", lastVersion)
+}
+
+func tryApplyMigration(ctx context.Context, pool *pgxpool.Pool, version string) error {
+	dir, err := resolveMigrationsDir()
+	if err != nil {
+		return err
+	}
+	path := filepath.Join(dir, version+".up.sql")
+	sql, err := os.ReadFile(path)
+	if err != nil {
+		return err
+	}
+	if _, err := pool.Exec(ctx, string(sql)); err != nil {
+		return err
+	}
+	_, err = pool.Exec(ctx,
+		`INSERT INTO schema_migrations (version) VALUES ($1) ON CONFLICT DO NOTHING`,
+		version)
+	return err
+}
+
+// seedTaskUsageFixture inserts the minimal joined rows
+// (workspace, runtime, agent, agent_task_queue) needed for a task_usage
+// row to participate in the hourly rollup. Returns the IDs in
+// (workspace, runtime, agent, task) order.
+func seedTaskUsageFixture(t *testing.T, ctx context.Context, pool *pgxpool.Pool) (string, string, string, string) {
+	t.Helper()
+
+	var wsID, runtimeID, agentID, taskID string
+	if err := pool.QueryRow(ctx, `
+		INSERT INTO workspace (name, slug)
+		VALUES ('upgrade-test', 'upgrade-test')
+		RETURNING id
+	`).Scan(&wsID); err != nil {
+		t.Fatalf("seed workspace: %v", err)
+	}
+	if err := pool.QueryRow(ctx, `
+		INSERT INTO agent_runtime (
+			workspace_id, daemon_id, name, runtime_mode, provider, status,
+			device_info, metadata, last_seen_at
+		)
+		VALUES ($1, NULL, 'upgrade-runtime', 'cloud', 'p', 'online',
+		        '{}'::jsonb, '{}'::jsonb, now())
+		RETURNING id
+	`, wsID).Scan(&runtimeID); err != nil {
+		t.Fatalf("seed runtime: %v", err)
+	}
+	if err := pool.QueryRow(ctx, `
+		INSERT INTO agent (
+			workspace_id, name, description, runtime_mode, runtime_config,
+			runtime_id, visibility, max_concurrent_tasks
+		)
+		VALUES ($1, 'upgrade-agent', '', 'cloud', '{}'::jsonb, $2, 'workspace', 1)
+		RETURNING id
+	`, wsID, runtimeID).Scan(&agentID); err != nil {
+		t.Fatalf("seed agent: %v", err)
+	}
+	if err := pool.QueryRow(ctx, `
+		INSERT INTO agent_task_queue (
+			agent_id, runtime_id, status, payload
+		)
+		VALUES ($1, $2, 'queued', '{}'::jsonb)
+		RETURNING id
+	`, agentID, runtimeID).Scan(&taskID); err != nil {
+		// agent_task_queue schema may differ; fall back to inferring
+		// the smallest column set.
+		var altErr error
+		altErr = pool.QueryRow(ctx, `
+			INSERT INTO agent_task_queue (agent_id, runtime_id)
+			VALUES ($1, $2)
+			RETURNING id
+		`, agentID, runtimeID).Scan(&taskID)
+		if altErr != nil {
+			t.Fatalf("seed agent_task_queue: %v / %v", err, altErr)
+		}
+	}
+	return wsID, runtimeID, agentID, taskID
+}
+
+// ensure errors-helpers import isn't dropped if the compiler decides
+// to optimise the empty references away.
+var _ = errors.New
--- a/server/migrations/113_sys_cron_executions.down.sql
+++ b/server/migrations/113_sys_cron_executions.down.sql
@@ -0,0 +1 @@
+DROP TABLE IF EXISTS sys_cron_executions;
--- a/server/migrations/113_sys_cron_executions.up.sql
+++ b/server/migrations/113_sys_cron_executions.up.sql
@@ -0,0 +1,106 @@
+-- Schema for the DB-backed execution-record scheduler (MUL-2957). One row
+-- represents a single planned execution of a registered system job for a
+-- single scope at a single canonical UTC `plan_time` bucket — the row is
+-- both the distributed lease and the audit log entry.
+--
+-- Design source: docs/db-backed-execution-scheduler-rfc.md.
+--
+-- Key invariants:
+--
+--   * The unique key (job_name, scope_kind, scope_id, plan_time) is the
+--     only thing standing between two app instances and a double-run.
+--     `scope_kind` / `scope_id` default to 'global' so global jobs land
+--     on the same uniqueness slot without using NULL semantics.
+--
+--   * Status uses `TEXT + CHECK` instead of a real enum so adding a state
+--     later is a CHECK swap, not an enum migration. There is no `STALE`
+--     state — stale is `status='RUNNING' AND stale_after < now()`, which
+--     keeps lease theft a single UPDATE.
+--
+--   * Every mutation that targets a live lease (heartbeat, terminal
+--     write) MUST match `id = $id AND lease_token = $token AND
+--     status = 'RUNNING'`. A runner that lost its lease (stolen by a
+--     stale-steal, or already finalised by a sibling) sees
+--     RowsAffected = 0 and stops, so it cannot rewrite a newer attempt's
+--     state to SUCCESS / FAILED.
+--
+--   * `attempt` and `max_attempts` are bounded by CHECK so a buggy
+--     handler cannot "escape" the retry envelope by writing
+--     attempt > max_attempts.
+--
+-- This migration creates the table and the few indexes the steady-state
+-- scheduler queries need (claim-by-key, find-stale-RUNNING, recent
+-- failures, retention prune). It does NOT register any jobs — those are
+-- declared in Go and inserted on demand by the scheduler.
+
+CREATE TABLE sys_cron_executions (
+    id             UUID        PRIMARY KEY DEFAULT gen_random_uuid(),
+    job_name       TEXT        NOT NULL,
+    scope_kind     TEXT        NOT NULL DEFAULT 'global',
+    scope_id       TEXT        NOT NULL DEFAULT 'global',
+    plan_time      TIMESTAMPTZ NOT NULL,
+
+    -- Lifecycle. status ∈ {RUNNING, SUCCESS, FAILED}; stale is computed
+    -- from RUNNING + stale_after, never materialised as its own state.
+    status         TEXT        NOT NULL,
+    attempt        INTEGER     NOT NULL DEFAULT 1,
+    max_attempts   INTEGER     NOT NULL DEFAULT 3,
+    next_retry_at  TIMESTAMPTZ,
+
+    -- Lease ownership. lease_token is rotated on every claim/steal, so a
+    -- runner whose lease was stolen cannot write terminal status — its
+    -- UPDATE matches zero rows.
+    runner_id      TEXT,
+    lease_token    UUID        NOT NULL DEFAULT gen_random_uuid(),
+    heartbeat_at   TIMESTAMPTZ,
+    stale_after    TIMESTAMPTZ,
+
+    -- Audit fields. Small structured `result` only — bulky output goes to
+    -- structured logs, not into this table.
+    started_at     TIMESTAMPTZ,
+    finished_at    TIMESTAMPTZ,
+    duration_ms    INTEGER,
+    rows_affected  BIGINT,
+    result         JSONB       NOT NULL DEFAULT '{}'::jsonb,
+
+    error_code     TEXT,
+    error_msg      TEXT,
+
+    created_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
+    updated_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
+
+    CONSTRAINT chk_sys_cron_status
+        CHECK (status IN ('RUNNING', 'SUCCESS', 'FAILED')),
+    CONSTRAINT chk_sys_cron_attempt
+        CHECK (attempt >= 1 AND max_attempts >= attempt),
+    CONSTRAINT chk_sys_cron_duration
+        CHECK (duration_ms IS NULL OR duration_ms >= 0),
+    CONSTRAINT uq_sys_cron_execution
+        UNIQUE (job_name, scope_kind, scope_id, plan_time)
+);
+
+-- Latest-plan lookup per (job, scope). The scheduler reads this to
+-- decide whether the latest plan is already done or in flight; the
+-- prune job and `latest-success` lag computation use the same key.
+CREATE INDEX idx_sys_cron_exec_job_plan
+    ON sys_cron_executions (job_name, scope_kind, scope_id, plan_time DESC);
+
+-- Find leases that have gone stale without scanning the whole RUNNING
+-- set. Partial because non-RUNNING rows never need to be considered.
+CREATE INDEX idx_sys_cron_exec_running_stale
+    ON sys_cron_executions (stale_after)
+    WHERE status = 'RUNNING';
+
+-- "Recent FAILED for this job" — used by alerting and by the retry
+-- decision, which needs the most recent FAILED row to compare against
+-- max_attempts and next_retry_at.
+CREATE INDEX idx_sys_cron_exec_failed_recent
+    ON sys_cron_executions (job_name, plan_time DESC)
+    WHERE status = 'FAILED';
+
+-- Retention prune scans by terminal `finished_at`; index is partial
+-- because RUNNING rows are kept by the stale policy, not the retention
+-- policy.
+CREATE INDEX idx_sys_cron_exec_finished
+    ON sys_cron_executions (finished_at)
+    WHERE status IN ('SUCCESS', 'FAILED');