diff --git a/SELF_HOSTING.md b/SELF_HOSTING.md index 8aa2826b8..8e46b8c9a 100644 --- a/SELF_HOSTING.md +++ b/SELF_HOSTING.md @@ -326,7 +326,7 @@ To roll back if an upgrade goes sideways: helm -n multica rollback multica ``` -> **Upgrading from `v0.3.4` to `v0.3.5+` fails with `refusing to drop legacy daily rollups: ...`?** Same migration guard as the Docker path — see [Usage Dashboard Rollup → Option C](#option-c--backfill-history-first-then-schedule). Run the backfill against the same database the chart is using (`kubectl -n multica exec deploy/multica-backend -- ./backfill_task_usage_hourly --sleep-between-slices=2s`), then restart the backend deployment to re-apply migrations. +> **Upgrading from `v0.3.4` to `v0.3.5+` fails with `refusing to drop legacy daily rollups: ...`?** As of MUL-2957 the `migrate up` command runs an idempotent monthly-slice backfill automatically before applying migration `103`, so a clean upgrade is a single `helm upgrade` + backend rollout. If you are still on a pre-MUL-2957 binary or the auto-hook fails, run the standalone backfill against the same database the chart is using (`kubectl -n multica exec deploy/multica-backend -- ./backfill_task_usage_hourly --sleep-between-slices=2s`), then restart the backend deployment to re-apply migrations. See [Advanced Configuration → Usage Dashboard Rollup](SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup) for the full recovery flow. ### Tearing down @@ -340,56 +340,52 @@ kubectl delete namespace multica --- -## Usage Dashboard Rollup (Required) +## Usage Dashboard Rollup -Starting with `v0.3.5`, the Usage / Runtime dashboards read from a derived `task_usage_hourly` table rather than directly from `task_usage`. Raw `task_usage` rows are written by the backend on every task, but the dashboard only sees data after `rollup_task_usage_hourly()` runs and aggregates them into `task_usage_hourly`. +The Usage / Runtime dashboards read from a derived `task_usage_hourly` table populated by `rollup_task_usage_hourly()`. As of MUL-2957 the backend runs this rollup **in-process** on every replica via a DB-backed scheduler (`sys_cron_executions`); a fresh self-host install needs no operator action and the bundled `pgvector/pgvector:pg17` image works without changes — you do **not** need to swap it for an image that ships `pg_cron`, register an external cron job, set up a systemd timer, or run a Kubernetes `CronJob`. -**The bundled `pgvector/pgvector:pg17` image does NOT include `pg_cron`.** If nothing schedules the rollup, the dashboard will stay at zero forever even though `task_usage` is populated. You have three supported options — pick one before relying on the dashboard. - -> **Upgrading from `v0.3.4` to `v0.3.5+`** with existing `task_usage` history: migration `103` is fail-closed and will abort `migrate up` with `refusing to drop legacy daily rollups: …`. Run `backfill_task_usage_hourly` first (Option C below), then re-run the upgrade. **Fresh installs** are exempted by that guard and migrate cleanly — but the dashboard will still stay at zero until you pick Option A or Option B. - -### Option A — External cron / systemd-timer (simplest) - -Schedule a 5-minute job that calls `rollup_task_usage_hourly()`. It is idempotent and watermark-driven, so a missed tick catches up on the next run. - -```bash -# /etc/cron.d/multica-rollup — every 5 minutes -*/5 * * * * root docker compose -f /path/to/multica/docker-compose.selfhost.yml \ - exec -T postgres psql -U multica -d multica \ - -c "SELECT rollup_task_usage_hourly();" >/dev/null -``` - -Or as a systemd timer + service if you prefer that surface. The function returns the number of (upserted + deleted-empty) rows; it's safe to call concurrently with itself (an advisory lock makes overlapping runs no-op) and safe to call alongside `backfill_task_usage_hourly`. - -### Option B — Swap Postgres for an image that ships `pg_cron` - -If you'd rather have Postgres schedule itself, replace `pgvector/pgvector:pg17` in `docker-compose.selfhost.yml` with an image that bundles both `pgvector` and `pg_cron` (e.g. `supabase/postgres`, or your own build of `pgvector/pgvector` with `pg_cron` added and `shared_preload_libraries=pg_cron` set on the server). Then, once: +Multiple backend replicas are safe: each replica ticks every 30 seconds and tries to claim the current 5-minute UTC plan, but the unique key `(job_name, scope_kind, scope_id, plan_time)` means only one wins each plan. Inspect steady-state operation: ```sql -CREATE EXTENSION IF NOT EXISTS pg_cron; -SELECT cron.schedule( - 'rollup_task_usage_hourly', - '*/5 * * * *', - $$SELECT rollup_task_usage_hourly()$$ -); +SELECT plan_time, status, attempt, runner_id, + error_code, error_msg, started_at, finished_at + FROM sys_cron_executions + WHERE job_name = 'rollup_task_usage_hourly' + ORDER BY plan_time DESC + LIMIT 20; ``` -`shared_preload_libraries` requires a Postgres restart to take effect — set it in `postgresql.conf` (or via the image's documented mechanism) before bringing the container up. +Full reference (audit table semantics, advisory lock 4246, the standalone backfill command, flag descriptions, the `v0.3.4 → v0.3.5+` migration auto-hook) lives in [Advanced Configuration → Usage Dashboard Rollup](SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup). -### Option C — Backfill history first, then schedule +> **Upgrading from `v0.3.4` to `v0.3.5+`?** As of MUL-2957 the `migrate up` command runs an idempotent monthly-slice backfill automatically right before applying migration `103`, so the upgrade completes in a single invocation — no operator step required. If you are still on a pre-MUL-2957 binary or the auto-hook fails for an environmental reason, run `backfill_task_usage_hourly` against the same database and re-run the upgrade. See [Advanced Configuration → Usage Dashboard Rollup](SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup) for the recovery flow. -If you're upgrading from `v0.3.4 → v0.3.5+` and already have `task_usage` rows (or you just want the dashboard to show historical data on a fresh install that you've been running for a while), run the bundled backfill command once before scheduling the rollup: +### Compatibility paths (existing deployments only) -```bash -# Backfills task_usage_hourly from all historical task_usage rows and stamps -# the rollup watermark. Idempotent — safe to re-run. -docker compose -f docker-compose.selfhost.yml exec backend \ - ./backfill_task_usage_hourly --sleep-between-slices=2s -``` +External schedulers — **`pg_cron` registered on the database, an external cron job, a systemd timer, or a Kubernetes `CronJob`** — that call `SELECT rollup_task_usage_hourly()` directly were the only option before MUL-2957 and remain a supported compatibility path. They are no longer the recommended setup; new deployments should rely on the in-process scheduler instead. The SQL function holds advisory lock 4246 internally, so the in-process scheduler and any pre-existing external schedule can coexist without ever double-writing the rollup. -On a database with years of data this can scan tens of millions of rows; `--sleep-between-slices=2s` throttles the read pressure. Use `--months-back N` (plus `--force-partial`) if you only want the last N months. Once it finishes, set up Option A or Option B so new buckets keep flowing. +If you already have a `pg_cron` job in production, the safe sequence to retire it is: -After upgrading, re-run `migrate up` (or restart the backend container — migrations run automatically on startup) to apply migration `103` cleanly. +1. Confirm the in-process scheduler is healthy on at least one backend replica — recent SUCCESS rows should be landing in `sys_cron_executions` for `rollup_task_usage_hourly`: + + ```sql + SELECT plan_time, status, runner_id, finished_at + FROM sys_cron_executions + WHERE job_name = 'rollup_task_usage_hourly' + AND status = 'SUCCESS' + ORDER BY plan_time DESC + LIMIT 5; + ``` + +2. Once SUCCESS rows are arriving on schedule, unschedule the redundant `pg_cron` entry: + + ```sql + SELECT cron.unschedule('rollup_task_usage_hourly') + FROM cron.job WHERE jobname = 'rollup_task_usage_hourly'; + ``` + +3. Leave the `pg_cron` extension itself installed unless you are sure no other workload depends on it. The bundled `pgvector/pgvector:pg17` image does **not** ship `pg_cron`, so nothing in Multica's default install needs it; uninstalling `pg_cron` from a custom image that other workloads still use is a separate decision. + +External cron / systemd timer / Kubernetes `CronJob` setups that call `SELECT rollup_task_usage_hourly()` directly can be retired the same way — once `sys_cron_executions` shows steady SUCCESS rows from the in-process scheduler, the external job is redundant and can be removed. ## Stopping Services @@ -431,7 +427,7 @@ docker compose -f docker-compose.selfhost.yml up -d Pin `MULTICA_IMAGE_TAG` in `.env` to an exact version like `v0.2.4` if you want to stay on a specific release. Migrations run automatically on backend startup. If the selected GHCR tag has not been published yet, fall back to `make selfhost-build` or `docker compose -f docker-compose.selfhost.yml -f docker-compose.selfhost.build.yml up -d --build`. -> **Upgrading from `v0.3.4` to `v0.3.5+` fails with `refusing to drop legacy daily rollups: ...`?** That's migration `103`'s fail-closed guard: it requires `task_usage_hourly` to be seeded before the legacy daily rollups are dropped. Run `backfill_task_usage_hourly` first, then re-run the upgrade. Full instructions in [Usage Dashboard Rollup → Option C](#option-c--backfill-history-first-then-schedule). +> **Upgrading from `v0.3.4` to `v0.3.5+` fails with `refusing to drop legacy daily rollups: ...`?** That's migration `103`'s fail-closed guard: it requires `task_usage_hourly` to be seeded before the legacy daily rollups are dropped. As of MUL-2957 `migrate up` runs that backfill automatically right before applying `103`, so the upgrade completes in a single invocation. If you are still on a pre-MUL-2957 binary or the auto-hook fails, run `backfill_task_usage_hourly` manually first, then re-run the upgrade. Full instructions in [Advanced Configuration → Usage Dashboard Rollup](SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup). --- diff --git a/apps/docs/content/docs/getting-started/self-hosting.zh.mdx b/apps/docs/content/docs/getting-started/self-hosting.zh.mdx index d75118dc5..14ef967a7 100644 --- a/apps/docs/content/docs/getting-started/self-hosting.zh.mdx +++ b/apps/docs/content/docs/getting-started/self-hosting.zh.mdx @@ -133,17 +133,54 @@ Alternatively, configure step by step: `multica config set server_url http://loc 3. Go to **Settings → Agents** and create a new agent 4. Create an issue and assign it to your agent -## Usage Dashboard Rollup (Required) +## Usage Dashboard Rollup -Starting with `v0.3.5`, the Usage / Runtime dashboards read from a derived `task_usage_hourly` table populated by `rollup_task_usage_hourly()`. The bundled `pgvector/pgvector:pg17` image does **not** include `pg_cron`, and the backend doesn't run the rollup in-process either — until you schedule it yourself, the dashboard will stay at zero even though `task_usage` is populated. +The Usage / Runtime dashboards read from a derived `task_usage_hourly` table populated by `rollup_task_usage_hourly()`. As of MUL-2957 the backend runs this rollup **in-process** on every replica via a DB-backed scheduler (`sys_cron_executions`). A fresh self-host install needs no operator action — the bundled `pgvector/pgvector:pg17` image works as-is, and you do **not** need to swap it for an image that ships `pg_cron`, register an external cron job, run a systemd timer, or schedule a Kubernetes `CronJob`. -Pick one supported path before relying on the Usage / Runtime dashboard: +Multiple backend replicas are safe: every replica ticks every 30 seconds and tries to claim the current 5-minute UTC plan, but the unique key `(job_name, scope_kind, scope_id, plan_time)` means only one wins each plan. Inspect the audit table to confirm steady-state operation: -- **External cron / systemd-timer / Kubernetes `CronJob`**: schedule `SELECT rollup_task_usage_hourly()` every 5 minutes. Idempotent, watermark-driven — overlapping or skipped ticks are safe. -- **Postgres with `pg_cron`**: swap the bundled Postgres image for one that ships `pg_cron`, set `shared_preload_libraries=pg_cron`, then `SELECT cron.schedule('rollup_task_usage_hourly', '*/5 * * * *', 'SELECT rollup_task_usage_hourly()')` once. -- **Backfill historical data**: required on the `v0.3.4 → v0.3.5+` upgrade path when the database already has `task_usage` rows — migration `103` is fail-closed and will abort `migrate up` with `refusing to drop legacy daily rollups: ...` until the hourly table is seeded. Run `./backfill_task_usage_hourly --sleep-between-slices=2s` inside the backend container, then re-run the upgrade and configure one of the schedules above. +```sql +SELECT plan_time, status, attempt, runner_id, + error_code, error_msg, started_at, finished_at + FROM sys_cron_executions + WHERE job_name = 'rollup_task_usage_hourly' + ORDER BY plan_time DESC + LIMIT 20; +``` -Full reference (Compose + Kubernetes templates, flag descriptions, upgrade order) lives in [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup). + +**Upgrading from `v0.3.4` to `v0.3.5+`?** As of MUL-2957 the `migrate up` command runs an idempotent monthly-slice backfill automatically right before applying migration `103`, so the upgrade completes in a single invocation — no operator step required. If you are on a pre-MUL-2957 binary or the auto-hook fails for an environmental reason, run `backfill_task_usage_hourly` against the same database and re-run the upgrade. Full recovery flow lives in [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup). + + +### Compatibility paths (existing deployments only) + +External schedulers — **`pg_cron` registered on the database, an external cron job, a systemd timer, or a Kubernetes `CronJob`** — that call `SELECT rollup_task_usage_hourly()` directly were the only option before MUL-2957 and remain a supported compatibility path. They are no longer the recommended setup; new deployments should rely on the in-process scheduler. The SQL function holds advisory lock 4246 internally, so the in-process scheduler and any pre-existing external schedule can coexist without ever double-writing the rollup. + +If you already have a `pg_cron` job in production and want to retire it, the safe sequence is: + +1. Confirm the in-process scheduler is healthy on at least one backend replica — recent SUCCESS rows should be landing in `sys_cron_executions` for `rollup_task_usage_hourly`: + + ```sql + SELECT plan_time, status, runner_id, finished_at + FROM sys_cron_executions + WHERE job_name = 'rollup_task_usage_hourly' + AND status = 'SUCCESS' + ORDER BY plan_time DESC + LIMIT 5; + ``` + +2. Once SUCCESS rows are arriving on schedule, unschedule the redundant `pg_cron` entry: + + ```sql + SELECT cron.unschedule('rollup_task_usage_hourly') + FROM cron.job WHERE jobname = 'rollup_task_usage_hourly'; + ``` + +3. Leave the `pg_cron` extension itself installed unless you are sure no other workload depends on it. The bundled `pgvector/pgvector:pg17` image does **not** ship `pg_cron`, so nothing in Multica's default install needs it; uninstalling `pg_cron` from a custom image that other workloads still use is a separate decision. + +External cron / systemd timer / Kubernetes `CronJob` setups that call `SELECT rollup_task_usage_hourly()` directly can be retired the same way — once `sys_cron_executions` shows steady SUCCESS rows from the in-process scheduler, the external job is redundant and can be removed. + +Full reference (audit table semantics, advisory lock 4246, the standalone backfill command, flag descriptions, the migration auto-hook) lives in [`SELF_HOSTING_ADVANCED.md → Usage Dashboard Rollup`](https://github.com/multica-ai/multica/blob/main/SELF_HOSTING_ADVANCED.md#usage-dashboard-rollup). ## Stopping Services