mirror of
https://github.com/multica-ai/multica.git
synced 2026-06-17 11:48:42 +02:00
* feat(autopilots): webhook delivery layer + idempotency / signature / replay (MUL-2334)
Splits "inbound webhook receipt" from "autopilot run creation" so we can
record duplicate attempts, signature outcomes, and ignored/skipped
deliveries — and replay a delivery on demand. v1 ingress wrote straight
into autopilot_run.trigger_payload, which collapsed the two concerns and
left run_only autopilots vulnerable to provider retry storms.
Backend only (PR1). UI Deliveries tab follows in PR2.
Schema (migration 093):
- autopilot_trigger.provider: 'generic' | 'github' (default 'generic').
- autopilot_trigger.signing_secret: nullable plaintext (HMAC needs it
cleartext; mirrors how webhook_token is stored).
- webhook_delivery: one row per inbound POST. Carries raw_body,
selected_headers, dedupe_key/source, signature_status,
autopilot_run_id, replayed_from_delivery_id, response_status / body.
- Partial unique index on (trigger_id, dedupe_key) excludes NULL and
'rejected' rows, so a wrong-secret 401 does NOT permanently block a
future retry with the same X-GitHub-Delivery once the operator fixes
the secret.
Ingress flow (autopilot_webhook.go), persist-first + sync dispatch:
1. IP rate limit -> 2. token lookup -> 3. token rate limit ->
4. read raw body -> 5. autopilot/workspace cross-check ->
6. normalize JSON (400 without persistence on parse failure) ->
7. compute dedupe key + signature status ->
8. INSERT delivery (status=queued). On (trigger_id, dedupe_key)
unique-violation: bump attempt_count on existing row and return
the original delivery_id + autopilot_run_id with 200 ->
9. invalid/missing signature: UPDATE -> rejected, return 401 with
delivery_id (no dispatch, not replayable) ->
10. trigger disabled / autopilot paused/archived: UPDATE -> ignored,
return 200 ->
11. DispatchAutopilot synchronously, UPDATE -> dispatched/skipped/failed
with autopilot_run_id and the response body we returned ->
12. TouchAutopilotTriggerFiredAt and return 200.
No new long-running worker. A stale 'queued' row only happens if the
process dies between INSERT and UPDATE; that's a follow-up sweeper, not
this PR.
Authenticated API:
- GET /api/autopilots/{id}/deliveries (slim list)
- GET /api/autopilots/{id}/deliveries/{deliveryId} (with raw_body)
- POST /api/autopilots/{id}/deliveries/{deliveryId}/replay -> creates
a new delivery row (replayed_from_delivery_id set), dispatches a
new run, never collapses onto the original via dedupe.
- PUT /api/autopilots/{id}/triggers/{triggerId}/signing-secret
Write-only; trigger response surfaces has_signing_secret +
signing_secret_hint (last 4 chars), never the secret itself.
Signature verification reuses the GitHub-compatible
X-Hub-Signature-256: sha256=<hex(hmac(body, secret))> scheme; the
HMAC helper is constant-time. Invalid/missing signatures still count
against per-IP and per-token rate limits.
autopilot_run.trigger_payload is intentionally preserved — delivery
records the HTTP receipt; run records the normalized envelope handed
to the agent. They are two different views.
Tests (Postgres-backed):
- delivery persistence on accept
- dedupe via Idempotency-Key and X-GitHub-Delivery; run_only retry
storm pin (3 retries -> 1 run)
- invalid signature: 401 + rejected row + no run linkage
- missing signature when secret configured: 401 + 'missing' state
- valid signature dispatches
- signing secret never echoed in trigger responses; hint shows last 4
- min-length and clear-by-empty for signing secret PUT
- replay creates a NEW delivery + new run; rejected deliveries cannot
be replayed
- list omits raw_body; detail includes it; cross-autopilot ID returns
404 (workspace isolation defense in depth)
- provider validation: unknown -> 400, github -> 201 round-trips
- bad-signature stream still counts against per-token rate limit
Co-authored-by: multica-agent <github@multica.ai>
* fix(autopilots): address PR review on webhook delivery layer (MUL-2334)
- Exclude `failed` from the (trigger_id, dedupe_key) partial unique index
alongside `rejected`, so a transient ingress failure does not strand the
provider's stable X-GitHub-Delivery / Idempotency-Key retry. Update the
dedupe lookup to prefer non-terminal rows under the same predicate.
- Tighten delivery status enum: drop `skipped` from the CHECK constraint
and from the handler. A run that was admission-skipped (e.g. runtime
offline) is now recorded as delivery=`dispatched` linked to the
skipped run, with the response payload carrying status=`skipped`.
Source of truth for skipped-ness is autopilot_run.status, not the
delivery row — keeps the Deliveries UI enum unambiguous.
- On dispatch error, link the (possibly non-nil) autopilot_run returned
by DispatchAutopilot to the failed delivery so Deliveries UI can
navigate to the run row for debugging.
- Slim list projection: ListWebhookDeliveriesByAutopilot no longer pulls
raw_body / selected_headers / response_body — a 100-row page × 256 KiB
would otherwise round-trip ~25 MiB from Postgres per Deliveries reload.
Detail endpoint continues to return the full row.
- Fix backend CI: TestGetDelivery_ReturnsFullPayload now decodes the
response and asserts on the parsed raw_body instead of substring-
matching against an escaped JSON string; raise the test-suite default
webhook rate limits in TestMain so the shared 192.0.2.1 IP bucket
doesn't fill across the suite and leak 429s into unrelated tests.
- Add regression coverage for the dedupe-after-failure path.
cd server && go test ./... is green locally.
Co-authored-by: multica-agent <github@multica.ai>
---------
Co-authored-by: multica-agent <github@multica.ai>
98 lines
5.4 KiB
SQL
98 lines
5.4 KiB
SQL
-- Webhook delivery layer: separates "we received an inbound HTTP webhook" from
|
|
-- "we created an autopilot run". Splitting the two lets us record duplicate
|
|
-- attempts, signature outcomes, ignored/skipped deliveries, and offers a
|
|
-- replay path — all of which the previous shape (writing straight into
|
|
-- autopilot_run.trigger_payload) could not express.
|
|
--
|
|
-- Scope of this migration:
|
|
-- 1. Add `provider` + `signing_secret` to autopilot_trigger so a webhook
|
|
-- trigger can optionally carry per-provider configuration for signature
|
|
-- verification. signing_secret is plaintext at rest, mirroring how
|
|
-- webhook_token already lives — HMAC verification needs the cleartext
|
|
-- and there is no general secrets-at-rest infrastructure to layer on
|
|
-- yet (see issue MUL-2334 for the design rationale).
|
|
-- 2. Create webhook_delivery, one row per inbound HTTP request the public
|
|
-- ingress endpoint accepted (including rejected / ignored outcomes).
|
|
-- Duplicate requests don't get their own row — they bump attempt_count
|
|
-- on the existing dedupe target. The autopilot_run table is kept intact
|
|
-- and continues to represent the downstream execution — both views are
|
|
-- needed.
|
|
|
|
ALTER TABLE autopilot_trigger
|
|
ADD COLUMN provider TEXT NOT NULL DEFAULT 'generic'
|
|
CHECK (provider IN ('generic', 'github')),
|
|
ADD COLUMN signing_secret TEXT;
|
|
|
|
CREATE TABLE IF NOT EXISTS webhook_delivery (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
workspace_id UUID NOT NULL REFERENCES workspace(id) ON DELETE CASCADE,
|
|
autopilot_id UUID NOT NULL REFERENCES autopilot(id) ON DELETE CASCADE,
|
|
trigger_id UUID NOT NULL REFERENCES autopilot_trigger(id) ON DELETE CASCADE,
|
|
provider TEXT NOT NULL
|
|
CHECK (provider IN ('generic', 'github')),
|
|
event TEXT NOT NULL DEFAULT 'webhook.received',
|
|
-- dedupe_key is extracted from request headers per-provider:
|
|
-- github -> X-GitHub-Delivery
|
|
-- generic -> Idempotency-Key
|
|
-- NULL means "no provider-supplied dedupe identifier"; the partial unique
|
|
-- index below skips these rows so each NULL-keyed delivery is treated as
|
|
-- distinct (correct behavior for clients that don't implement idempotency
|
|
-- headers — the alternative would collapse all of them onto one row).
|
|
dedupe_key TEXT,
|
|
dedupe_source TEXT,
|
|
signature_status TEXT NOT NULL DEFAULT 'not_required'
|
|
CHECK (signature_status IN ('not_required', 'valid', 'invalid', 'missing')),
|
|
-- Delivery status tracks the *ingress*, not the autopilot run:
|
|
-- queued — INSERTed, dispatch not yet attempted
|
|
-- dispatched — handed off to AutopilotService; autopilot_run_id is set.
|
|
-- A run that was admission-skipped (e.g. runtime offline)
|
|
-- still lives here — the skipped-ness is recorded on
|
|
-- autopilot_run.status, not on the delivery, so the
|
|
-- Deliveries enum stays unambiguous.
|
|
-- rejected — signature verification failed (invalid or missing)
|
|
-- ignored — trigger disabled / autopilot paused / archived
|
|
-- failed — dispatch attempted and errored
|
|
status TEXT NOT NULL DEFAULT 'queued'
|
|
CHECK (status IN ('queued', 'dispatched', 'rejected', 'ignored', 'failed')),
|
|
attempt_count INTEGER NOT NULL DEFAULT 1,
|
|
-- Selected headers we want to keep for debugging (user-agent, event,
|
|
-- delivery id, idempotency key, signature presence). NOT the raw header
|
|
-- map — never store auth tokens or signature values plaintext.
|
|
selected_headers JSONB NOT NULL DEFAULT '{}'::jsonb,
|
|
content_type TEXT,
|
|
-- raw_body holds the exact bytes we received, capped at the ingress body
|
|
-- limit (256 KiB). Required for replay and to debug normalization issues.
|
|
-- No TTL/retention enforced here; if delivery volume becomes a storage
|
|
-- concern, add a follow-up migration with a retention job rather than
|
|
-- baking the policy into the schema.
|
|
raw_body BYTEA,
|
|
response_status INTEGER,
|
|
response_body TEXT,
|
|
autopilot_run_id UUID REFERENCES autopilot_run(id) ON DELETE SET NULL,
|
|
replayed_from_delivery_id UUID REFERENCES webhook_delivery(id) ON DELETE SET NULL,
|
|
error TEXT,
|
|
received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
|
last_attempt_at TIMESTAMPTZ NOT NULL DEFAULT now(),
|
|
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
|
|
);
|
|
|
|
-- Listing deliveries per autopilot, newest first.
|
|
CREATE INDEX IF NOT EXISTS idx_webhook_delivery_autopilot
|
|
ON webhook_delivery(autopilot_id, created_at DESC);
|
|
|
|
-- Provider-supplied dedupe identifiers must be unique per trigger. Partial so
|
|
-- that NULL keys (no Idempotency-Key / X-GitHub-Delivery header) never
|
|
-- collide. Terminal-but-not-successful outcomes (`rejected`, `failed`) are
|
|
-- excluded so a transient ingress failure or a misconfigured secret does
|
|
-- not strand the operator: providers like GitHub keep X-GitHub-Delivery
|
|
-- stable across retries, and a permanently-blocking row would prevent the
|
|
-- next retry from ever being dispatched.
|
|
CREATE UNIQUE INDEX IF NOT EXISTS idx_webhook_delivery_dedupe
|
|
ON webhook_delivery(trigger_id, dedupe_key)
|
|
WHERE dedupe_key IS NOT NULL AND status NOT IN ('rejected', 'failed');
|
|
|
|
-- Lookup by linked run (sync flows, gc check, run-detail "what triggered me").
|
|
CREATE INDEX IF NOT EXISTS idx_webhook_delivery_run
|
|
ON webhook_delivery(autopilot_run_id)
|
|
WHERE autopilot_run_id IS NOT NULL;
|