Files
multica/.github/workflows
Bohan Jiang ad803b86ec fix(skills): shared-state runtime local-skill stores (MUL-1288) (#1557)
* fix(skills): shared-state runtime local-skill stores (MUL-1288)

Fixes the bug Bohan surfaced on MUL-1288: behind prod's multi-node API the
runtime-local-skill list/import flow would intermittently time out or 404.
Root cause: LocalSkillListStore and LocalSkillImportStore were per-process
sync.Mutex+map, so when the frontend POST, the daemon heartbeat and the
frontend GET landed on different API instances, each saw a different
pending set. Confirmed against production daemon logs — the failed
request_id never showed up in the daemon's "runtime local skills
requested" log, even though other requests around the same window worked.

Per Yushen's guidance (server must stay stateless; state lives in
storage), migrate both stores to Redis so every node agrees on the same
pending set.

What changed
- LocalSkillListStore / LocalSkillImportStore are now interfaces. Methods
  take context.Context and return error.
- InMemoryLocalSkill{List,Import}Store — renamed from the existing types,
  kept as the default for single-node dev and the in-process test suite.
- RedisLocalSkill{List,Import}Store — new. Keyed on
  mul:local_skill:{list,import}:<id> (JSON record, TTL = retention), with
  a per-runtime ZSET mul:local_skill:{list,import}:pending:<runtime_id>
  (score = created_at UnixNano) providing cross-node ordering. PopPending
  wins the claim via ZREM == 1, so concurrent pops from different nodes
  never return the same request twice.
- NewRouter gets an optional *redis.Client; when non-nil it swaps in the
  Redis-backed stores. main.go hoists the existing Redis client (already
  used by the realtime relay) so both subsystems share one client.
- Handler fields flip to interface types; handler.New still constructs
  in-memory stores by default.
- Daemon heartbeat's PopPending call sites thread r.Context() through so
  Redis operations inherit request cancellation. Errors warn instead of
  poisoning the heartbeat response.

Tests
- Existing in-memory tests updated for the new signatures (ctx + error).
- New runtime_local_skills_redis_store_test.go covers:
  - Create/Get/Complete round trip preserves skills payload
  - PopPending across two *store instances sharing one rdb (the exact
    regression: node A creates, node B pops)
  - N concurrent PopPending on one record => exactly one winner
  - Pending-timeout threshold transitions the record and removes the zset
    member so a later PopPending doesn't return a timed-out request
  - Import store round-trips CreatorID (which is json:"-" on the public
    struct — needs a Redis envelope so ReportLocalSkillImportResult can
    still attribute the created Skill)
  - Per-runtime isolation — a PopPending for runtime B does not disturb
    A's pending zset
- Tests skip gracefully if REDIS_TEST_URL is unset; CI now spins up a
  redis:7-alpine service and exports the URL so the suite actually runs
  there.

Out of scope
PingStore / UpdateStore / ModelListStore have the same shape and the
same latent bug (they just fire rarely enough to have gone unnoticed).
Migrating them to Redis is a follow-up — MUL-1288 is specifically the
local-skills break Bohan is blocked on.

* fix(skills): atomic Redis claim + surface store write failures (PR #1557 review)

Two real gaps GPT-Boy flagged:

1. RedisLocalSkill{List,Import}Store.PopPending was doing ZREM then SET as
   two separate round-trips. If the SET failed for any reason — transient
   Redis error, context cancellation, pod getting SIGKILL'd mid-call — the
   request was already gone from the pending zset but the stored record
   still said "pending", and no subsequent PopPending would re-dispatch
   it. Exactly the "request disappears" class of bug this PR is supposed
   to kill.

   Fix: push the claim into a Lua script so Redis runs ZREM + SET as one
   atomic unit. If ZREM returns 0 (another node won the race), SET is
   skipped and the caller retries.

2. ReportLocalSkill{List,Import}Result handlers were logging Complete/Fail
   store failures at Warn and still returning 200 OK. That made the
   daemon think the report landed when it hadn't, leaving the request
   stuck in "running" until the server-side timeout and — worse for the
   import flow — leaving the just-created Skill row orphaned in Postgres
   so every retry collided with the unique-name constraint.

   Fix: escalate to Error + return 500 so the daemon (and monitoring) can
   see the write failed. For the import flow, Complete failure after the
   Skill row is already committed also triggers a best-effort DeleteSkill
   so a daemon retry lands on a clean slate instead of hitting
   "a skill with this name already exists" forever.

Tests
- New TestRedisLocalSkillListStore_PopPendingAtomicClaim asserts the
  happy-path invariant: after one PopPending the record is "running"
  AND a second PopPending returns nothing. Deliberately does NOT poke
  Redis internals directly so the test survives any future key-layout
  refactor.
- Existing cross-instance / concurrent / timeout / per-runtime tests
  continue to pass against the Lua-based claim path (verified locally
  against a scratch redis-server; 8/8 Redis tests green).
2026-04-23 17:07:34 +08:00
..