Files
multica/.github/workflows/ci.yml
Bohan Jiang ad803b86ec fix(skills): shared-state runtime local-skill stores (MUL-1288) (#1557)
* fix(skills): shared-state runtime local-skill stores (MUL-1288)

Fixes the bug Bohan surfaced on MUL-1288: behind prod's multi-node API the
runtime-local-skill list/import flow would intermittently time out or 404.
Root cause: LocalSkillListStore and LocalSkillImportStore were per-process
sync.Mutex+map, so when the frontend POST, the daemon heartbeat and the
frontend GET landed on different API instances, each saw a different
pending set. Confirmed against production daemon logs — the failed
request_id never showed up in the daemon's "runtime local skills
requested" log, even though other requests around the same window worked.

Per Yushen's guidance (server must stay stateless; state lives in
storage), migrate both stores to Redis so every node agrees on the same
pending set.

What changed
- LocalSkillListStore / LocalSkillImportStore are now interfaces. Methods
  take context.Context and return error.
- InMemoryLocalSkill{List,Import}Store — renamed from the existing types,
  kept as the default for single-node dev and the in-process test suite.
- RedisLocalSkill{List,Import}Store — new. Keyed on
  mul:local_skill:{list,import}:<id> (JSON record, TTL = retention), with
  a per-runtime ZSET mul:local_skill:{list,import}:pending:<runtime_id>
  (score = created_at UnixNano) providing cross-node ordering. PopPending
  wins the claim via ZREM == 1, so concurrent pops from different nodes
  never return the same request twice.
- NewRouter gets an optional *redis.Client; when non-nil it swaps in the
  Redis-backed stores. main.go hoists the existing Redis client (already
  used by the realtime relay) so both subsystems share one client.
- Handler fields flip to interface types; handler.New still constructs
  in-memory stores by default.
- Daemon heartbeat's PopPending call sites thread r.Context() through so
  Redis operations inherit request cancellation. Errors warn instead of
  poisoning the heartbeat response.

Tests
- Existing in-memory tests updated for the new signatures (ctx + error).
- New runtime_local_skills_redis_store_test.go covers:
  - Create/Get/Complete round trip preserves skills payload
  - PopPending across two *store instances sharing one rdb (the exact
    regression: node A creates, node B pops)
  - N concurrent PopPending on one record => exactly one winner
  - Pending-timeout threshold transitions the record and removes the zset
    member so a later PopPending doesn't return a timed-out request
  - Import store round-trips CreatorID (which is json:"-" on the public
    struct — needs a Redis envelope so ReportLocalSkillImportResult can
    still attribute the created Skill)
  - Per-runtime isolation — a PopPending for runtime B does not disturb
    A's pending zset
- Tests skip gracefully if REDIS_TEST_URL is unset; CI now spins up a
  redis:7-alpine service and exports the URL so the suite actually runs
  there.

Out of scope
PingStore / UpdateStore / ModelListStore have the same shape and the
same latent bug (they just fire rarely enough to have gone unnoticed).
Migrating them to Redis is a follow-up — MUL-1288 is specifically the
local-skills break Bohan is blocked on.

* fix(skills): atomic Redis claim + surface store write failures (PR #1557 review)

Two real gaps GPT-Boy flagged:

1. RedisLocalSkill{List,Import}Store.PopPending was doing ZREM then SET as
   two separate round-trips. If the SET failed for any reason — transient
   Redis error, context cancellation, pod getting SIGKILL'd mid-call — the
   request was already gone from the pending zset but the stored record
   still said "pending", and no subsequent PopPending would re-dispatch
   it. Exactly the "request disappears" class of bug this PR is supposed
   to kill.

   Fix: push the claim into a Lua script so Redis runs ZREM + SET as one
   atomic unit. If ZREM returns 0 (another node won the race), SET is
   skipped and the caller retries.

2. ReportLocalSkill{List,Import}Result handlers were logging Complete/Fail
   store failures at Warn and still returning 200 OK. That made the
   daemon think the report landed when it hadn't, leaving the request
   stuck in "running" until the server-side timeout and — worse for the
   import flow — leaving the just-created Skill row orphaned in Postgres
   so every retry collided with the unique-name constraint.

   Fix: escalate to Error + return 500 so the daemon (and monitoring) can
   see the write failed. For the import flow, Complete failure after the
   Skill row is already committed also triggers a best-effort DeleteSkill
   so a daemon retry lands on a clean slate instead of hitting
   "a skill with this name already exists" forever.

Tests
- New TestRedisLocalSkillListStore_PopPendingAtomicClaim asserts the
  happy-path invariant: after one PopPending the record is "running"
  AND a second PopPending returns nothing. Deliberately does NOT poke
  Redis internals directly so the test survives any future key-layout
  refactor.
- Existing cross-instance / concurrent / timeout / per-runtime tests
  continue to pass against the Lua-based claim path (verified locally
  against a scratch redis-server; 8/8 Redis tests green).
2026-04-23 17:07:34 +08:00

85 lines
2.1 KiB
YAML

name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
jobs:
frontend:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v6
- name: Setup pnpm
uses: pnpm/action-setup@v4
- name: Setup Node.js
uses: actions/setup-node@v6
with:
node-version: 22
cache: pnpm
- name: Install dependencies
run: pnpm install
- name: Build, type check, and test
run: pnpm exec turbo build typecheck test --filter='!@multica/docs'
backend:
runs-on: ubuntu-latest
services:
postgres:
image: pgvector/pgvector:pg17
env:
POSTGRES_DB: multica
POSTGRES_USER: multica
POSTGRES_PASSWORD: multica
ports:
- 5432:5432
options: >-
--health-cmd "pg_isready -U multica -d multica"
--health-interval 5s
--health-timeout 5s
--health-retries 20
redis:
image: redis:7-alpine
ports:
- 6379:6379
options: >-
--health-cmd "redis-cli ping"
--health-interval 5s
--health-timeout 5s
--health-retries 10
env:
DATABASE_URL: postgres://multica:multica@localhost:5432/multica?sslmode=disable
# Wires up the RedisLocalSkill*_test.go suite. Distinct from REDIS_URL
# (which would flip the server binary itself onto the Redis-backed
# realtime relay + request stores); the tests talk to this Redis
# directly so they run alongside the Postgres-backed suite.
REDIS_TEST_URL: redis://localhost:6379/1
steps:
- name: Checkout
uses: actions/checkout@v6
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: "1.26.1"
cache-dependency-path: server/go.sum
- name: Build
run: cd server && go build ./...
- name: Run migrations
run: cd server && go run ./cmd/migrate up
- name: Test
run: cd server && go test ./...