docs(self-host): default rollup to in-process scheduler, demote pg_cron/cron/systemd/CronJob to compat paths (#3808)

Rewrite the 'Usage Dashboard Rollup (Required)' section in SELF_HOSTING.md and apps/docs/content/docs/getting-started/self-hosting.zh.mdx so that: - The DB-backed in-process scheduler (sys_cron_executions, MUL-2957) is documented as the default for fresh self-host installs. The bundled pgvector/pgvector:pg17 image works as-is and no operator step is required. - pg_cron, external cron, systemd timer, and Kubernetes CronJob are demoted to a 'Compatibility paths (existing deployments only)' subsection. They remain supported (advisory lock 4246 prevents double-writes) but are no longer the recommended setup. - A retirement sequence is added for production environments that already have a pg_cron job: confirm in-process SUCCESS rows in sys_cron_executions, then cron.unschedule the redundant entry; leave the pg_cron extension installed unless other workloads stop depending on it. - The two upgrade callouts that pointed to the removed 'Usage Dashboard Rollup -> Option C' anchor are repointed to SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup, which already documents the auto-hook backfill and the recovery flow. Refs MUL-3077. Co-authored-by: multica-agent <github@multica.ai>
2026-06-16 19:29:26 +02:00 · 2026-06-05 14:31:14 +08:00
parent 62925b97f1
commit 3913bf9152
2 changed files with 81 additions and 48 deletions
--- a/SELF_HOSTING.md
+++ b/SELF_HOSTING.md
@@ -326,7 +326,7 @@ To roll back if an upgrade goes sideways:
 helm -n multica rollback multica
 ```

-> **Upgrading from `v0.3.4` to `v0.3.5+` fails with `refusing to drop legacy daily rollups: ...`?** Same migration guard as the Docker path — see [Usage Dashboard Rollup → Option C](#option-c--backfill-history-first-then-schedule). Run the backfill against the same database the chart is using (`kubectl -n multica exec deploy/multica-backend -- ./backfill_task_usage_hourly --sleep-between-slices=2s`), then restart the backend deployment to re-apply migrations.
+> **Upgrading from `v0.3.4` to `v0.3.5+` fails with `refusing to drop legacy daily rollups: ...`?** As of MUL-2957 the `migrate up` command runs an idempotent monthly-slice backfill automatically before applying migration `103`, so a clean upgrade is a single `helm upgrade` + backend rollout. If you are still on a pre-MUL-2957 binary or the auto-hook fails, run the standalone backfill against the same database the chart is using (`kubectl -n multica exec deploy/multica-backend -- ./backfill_task_usage_hourly --sleep-between-slices=2s`), then restart the backend deployment to re-apply migrations. See [Advanced Configuration → Usage Dashboard Rollup](SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup) for the full recovery flow.

 ### Tearing down

@@ -340,56 +340,52 @@ kubectl delete namespace multica

 ---

-## Usage Dashboard Rollup (Required)
+## Usage Dashboard Rollup

-Starting with `v0.3.5`, the Usage / Runtime dashboards read from a derived `task_usage_hourly` table rather than directly from `task_usage`. Raw `task_usage` rows are written by the backend on every task, but the dashboard only sees data after `rollup_task_usage_hourly()` runs and aggregates them into `task_usage_hourly`.
+The Usage / Runtime dashboards read from a derived `task_usage_hourly` table populated by `rollup_task_usage_hourly()`. As of MUL-2957 the backend runs this rollup **in-process** on every replica via a DB-backed scheduler (`sys_cron_executions`); a fresh self-host install needs no operator action and the bundled `pgvector/pgvector:pg17` image works without changes — you do **not** need to swap it for an image that ships `pg_cron`, register an external cron job, set up a systemd timer, or run a Kubernetes `CronJob`.

-**The bundled `pgvector/pgvector:pg17` image does NOT include `pg_cron`.** If nothing schedules the rollup, the dashboard will stay at zero forever even though `task_usage` is populated. You have three supported options — pick one before relying on the dashboard.
-
-> **Upgrading from `v0.3.4` to `v0.3.5+`** with existing `task_usage` history: migration `103` is fail-closed and will abort `migrate up` with `refusing to drop legacy daily rollups: …`. Run `backfill_task_usage_hourly` first (Option C below), then re-run the upgrade. **Fresh installs** are exempted by that guard and migrate cleanly — but the dashboard will still stay at zero until you pick Option A or Option B.
-
-### Option A — External cron / systemd-timer (simplest)
-
-Schedule a 5-minute job that calls `rollup_task_usage_hourly()`. It is idempotent and watermark-driven, so a missed tick catches up on the next run.
-
-```bash
-# /etc/cron.d/multica-rollup — every 5 minutes
-*/5 * * * * root docker compose -f /path/to/multica/docker-compose.selfhost.yml \
-  exec -T postgres psql -U multica -d multica \
-  -c "SELECT rollup_task_usage_hourly();" >/dev/null
-```
-
-Or as a systemd timer + service if you prefer that surface. The function returns the number of (upserted + deleted-empty) rows; it's safe to call concurrently with itself (an advisory lock makes overlapping runs no-op) and safe to call alongside `backfill_task_usage_hourly`.
-
-### Option B — Swap Postgres for an image that ships `pg_cron`
-
-If you'd rather have Postgres schedule itself, replace `pgvector/pgvector:pg17` in `docker-compose.selfhost.yml` with an image that bundles both `pgvector` and `pg_cron` (e.g. `supabase/postgres`, or your own build of `pgvector/pgvector` with `pg_cron` added and `shared_preload_libraries=pg_cron` set on the server). Then, once:
+Multiple backend replicas are safe: each replica ticks every 30 seconds and tries to claim the current 5-minute UTC plan, but the unique key `(job_name, scope_kind, scope_id, plan_time)` means only one wins each plan. Inspect steady-state operation:

 ```sql
-CREATE EXTENSION IF NOT EXISTS pg_cron;
-SELECT cron.schedule(
-  'rollup_task_usage_hourly',
-  '*/5 * * * *',
-  $$SELECT rollup_task_usage_hourly()$$
-);
+SELECT plan_time, status, attempt, runner_id,
+       error_code, error_msg, started_at, finished_at
+  FROM sys_cron_executions
+ WHERE job_name = 'rollup_task_usage_hourly'
+ ORDER BY plan_time DESC
+ LIMIT 20;
 ```

-`shared_preload_libraries` requires a Postgres restart to take effect — set it in `postgresql.conf` (or via the image's documented mechanism) before bringing the container up.
+Full reference (audit table semantics, advisory lock 4246, the standalone backfill command, flag descriptions, the `v0.3.4 → v0.3.5+` migration auto-hook) lives in [Advanced Configuration → Usage Dashboard Rollup](SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup).

-### Option C — Backfill history first, then schedule
+> **Upgrading from `v0.3.4` to `v0.3.5+`?** As of MUL-2957 the `migrate up` command runs an idempotent monthly-slice backfill automatically right before applying migration `103`, so the upgrade completes in a single invocation — no operator step required. If you are still on a pre-MUL-2957 binary or the auto-hook fails for an environmental reason, run `backfill_task_usage_hourly` against the same database and re-run the upgrade. See [Advanced Configuration → Usage Dashboard Rollup](SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup) for the recovery flow.

-If you're upgrading from `v0.3.4 → v0.3.5+` and already have `task_usage` rows (or you just want the dashboard to show historical data on a fresh install that you've been running for a while), run the bundled backfill command once before scheduling the rollup:
+### Compatibility paths (existing deployments only)

-```bash
-# Backfills task_usage_hourly from all historical task_usage rows and stamps
-# the rollup watermark. Idempotent — safe to re-run.
-docker compose -f docker-compose.selfhost.yml exec backend \
-  ./backfill_task_usage_hourly --sleep-between-slices=2s
-```
+External schedulers — **`pg_cron` registered on the database, an external cron job, a systemd timer, or a Kubernetes `CronJob`** — that call `SELECT rollup_task_usage_hourly()` directly were the only option before MUL-2957 and remain a supported compatibility path. They are no longer the recommended setup; new deployments should rely on the in-process scheduler instead. The SQL function holds advisory lock 4246 internally, so the in-process scheduler and any pre-existing external schedule can coexist without ever double-writing the rollup.

-On a database with years of data this can scan tens of millions of rows; `--sleep-between-slices=2s` throttles the read pressure. Use `--months-back N` (plus `--force-partial`) if you only want the last N months. Once it finishes, set up Option A or Option B so new buckets keep flowing.
+If you already have a `pg_cron` job in production, the safe sequence to retire it is:

-After upgrading, re-run `migrate up` (or restart the backend container — migrations run automatically on startup) to apply migration `103` cleanly.
+1. Confirm the in-process scheduler is healthy on at least one backend replica — recent SUCCESS rows should be landing in `sys_cron_executions` for `rollup_task_usage_hourly`:
+
+   ```sql
+   SELECT plan_time, status, runner_id, finished_at
+     FROM sys_cron_executions
+    WHERE job_name = 'rollup_task_usage_hourly'
+      AND status = 'SUCCESS'
+    ORDER BY plan_time DESC
+    LIMIT 5;
+   ```
+
+2. Once SUCCESS rows are arriving on schedule, unschedule the redundant `pg_cron` entry:
+
+   ```sql
+   SELECT cron.unschedule('rollup_task_usage_hourly')
+     FROM cron.job WHERE jobname = 'rollup_task_usage_hourly';
+   ```
+
+3. Leave the `pg_cron` extension itself installed unless you are sure no other workload depends on it. The bundled `pgvector/pgvector:pg17` image does **not** ship `pg_cron`, so nothing in Multica's default install needs it; uninstalling `pg_cron` from a custom image that other workloads still use is a separate decision.
+
+External cron / systemd timer / Kubernetes `CronJob` setups that call `SELECT rollup_task_usage_hourly()` directly can be retired the same way — once `sys_cron_executions` shows steady SUCCESS rows from the in-process scheduler, the external job is redundant and can be removed.

 ## Stopping Services

@@ -431,7 +427,7 @@ docker compose -f docker-compose.selfhost.yml up -d
 Pin `MULTICA_IMAGE_TAG` in `.env` to an exact version like `v0.2.4` if you want to stay on a specific release. Migrations run automatically on backend startup.
 If the selected GHCR tag has not been published yet, fall back to `make selfhost-build` or `docker compose -f docker-compose.selfhost.yml -f docker-compose.selfhost.build.yml up -d --build`.

-> **Upgrading from `v0.3.4` to `v0.3.5+` fails with `refusing to drop legacy daily rollups: ...`?** That's migration `103`'s fail-closed guard: it requires `task_usage_hourly` to be seeded before the legacy daily rollups are dropped. Run `backfill_task_usage_hourly` first, then re-run the upgrade. Full instructions in [Usage Dashboard Rollup → Option C](#option-c--backfill-history-first-then-schedule).
+> **Upgrading from `v0.3.4` to `v0.3.5+` fails with `refusing to drop legacy daily rollups: ...`?** That's migration `103`'s fail-closed guard: it requires `task_usage_hourly` to be seeded before the legacy daily rollups are dropped. As of MUL-2957 `migrate up` runs that backfill automatically right before applying `103`, so the upgrade completes in a single invocation. If you are still on a pre-MUL-2957 binary or the auto-hook fails, run `backfill_task_usage_hourly` manually first, then re-run the upgrade. Full instructions in [Advanced Configuration → Usage Dashboard Rollup](SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup).

 ---

--- a/apps/docs/content/docs/getting-started/self-hosting.zh.mdx
+++ b/apps/docs/content/docs/getting-started/self-hosting.zh.mdx
@@ -133,17 +133,54 @@ Alternatively, configure step by step: `multica config set server_url http://loc
 3. Go to **Settings → Agents** and create a new agent
 4. Create an issue and assign it to your agent

-## Usage Dashboard Rollup (Required)
+## Usage Dashboard Rollup

-Starting with `v0.3.5`, the Usage / Runtime dashboards read from a derived `task_usage_hourly` table populated by `rollup_task_usage_hourly()`. The bundled `pgvector/pgvector:pg17` image does **not** include `pg_cron`, and the backend doesn't run the rollup in-process either — until you schedule it yourself, the dashboard will stay at zero even though `task_usage` is populated.
+The Usage / Runtime dashboards read from a derived `task_usage_hourly` table populated by `rollup_task_usage_hourly()`. As of MUL-2957 the backend runs this rollup **in-process** on every replica via a DB-backed scheduler (`sys_cron_executions`). A fresh self-host install needs no operator action — the bundled `pgvector/pgvector:pg17` image works as-is, and you do **not** need to swap it for an image that ships `pg_cron`, register an external cron job, run a systemd timer, or schedule a Kubernetes `CronJob`.

-Pick one supported path before relying on the Usage / Runtime dashboard:
+Multiple backend replicas are safe: every replica ticks every 30 seconds and tries to claim the current 5-minute UTC plan, but the unique key `(job_name, scope_kind, scope_id, plan_time)` means only one wins each plan. Inspect the audit table to confirm steady-state operation:

- **External cron / systemd-timer / Kubernetes `CronJob`**: schedule `SELECT rollup_task_usage_hourly()` every 5 minutes. Idempotent, watermark-driven — overlapping or skipped ticks are safe.
- **Postgres with `pg_cron`**: swap the bundled Postgres image for one that ships `pg_cron`, set `shared_preload_libraries=pg_cron`, then `SELECT cron.schedule('rollup_task_usage_hourly', '*/5 * * * *', 'SELECT rollup_task_usage_hourly()')` once.
- **Backfill historical data**: required on the `v0.3.4 → v0.3.5+` upgrade path when the database already has `task_usage` rows — migration `103` is fail-closed and will abort `migrate up` with `refusing to drop legacy daily rollups: ...` until the hourly table is seeded. Run `./backfill_task_usage_hourly --sleep-between-slices=2s` inside the backend container, then re-run the upgrade and configure one of the schedules above.
+```sql
+SELECT plan_time, status, attempt, runner_id,
+       error_code, error_msg, started_at, finished_at
+  FROM sys_cron_executions
+ WHERE job_name = 'rollup_task_usage_hourly'
+ ORDER BY plan_time DESC
+ LIMIT 20;
+```

-Full reference (Compose + Kubernetes templates, flag descriptions, upgrade order) lives in [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup).
+<Callout>
+**Upgrading from `v0.3.4` to `v0.3.5+`?** As of MUL-2957 the `migrate up` command runs an idempotent monthly-slice backfill automatically right before applying migration `103`, so the upgrade completes in a single invocation — no operator step required. If you are on a pre-MUL-2957 binary or the auto-hook fails for an environmental reason, run `backfill_task_usage_hourly` against the same database and re-run the upgrade. Full recovery flow lives in [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup).
+</Callout>
+
+### Compatibility paths (existing deployments only)
+
+External schedulers — **`pg_cron` registered on the database, an external cron job, a systemd timer, or a Kubernetes `CronJob`** — that call `SELECT rollup_task_usage_hourly()` directly were the only option before MUL-2957 and remain a supported compatibility path. They are no longer the recommended setup; new deployments should rely on the in-process scheduler. The SQL function holds advisory lock 4246 internally, so the in-process scheduler and any pre-existing external schedule can coexist without ever double-writing the rollup.
+
+If you already have a `pg_cron` job in production and want to retire it, the safe sequence is:
+
+1. Confirm the in-process scheduler is healthy on at least one backend replica — recent SUCCESS rows should be landing in `sys_cron_executions` for `rollup_task_usage_hourly`:
+
+   ```sql
+   SELECT plan_time, status, runner_id, finished_at
+     FROM sys_cron_executions
+    WHERE job_name = 'rollup_task_usage_hourly'
+      AND status = 'SUCCESS'
+    ORDER BY plan_time DESC
+    LIMIT 5;
+   ```
+
+2. Once SUCCESS rows are arriving on schedule, unschedule the redundant `pg_cron` entry:
+
+   ```sql
+   SELECT cron.unschedule('rollup_task_usage_hourly')
+     FROM cron.job WHERE jobname = 'rollup_task_usage_hourly';
+   ```
+
+3. Leave the `pg_cron` extension itself installed unless you are sure no other workload depends on it. The bundled `pgvector/pgvector:pg17` image does **not** ship `pg_cron`, so nothing in Multica's default install needs it; uninstalling `pg_cron` from a custom image that other workloads still use is a separate decision.
+
+External cron / systemd timer / Kubernetes `CronJob` setups that call `SELECT rollup_task_usage_hourly()` directly can be retired the same way — once `sys_cron_executions` shows steady SUCCESS rows from the in-process scheduler, the external job is redundant and can be removed.
+
+Full reference (audit table semantics, advisory lock 4246, the standalone backfill command, flag descriptions, the migration auto-hook) lives in [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup).

 ## Stopping Services