Files
multica/server/internal/auth/daemon_token_cache.go
Bohan Jiang 86e7de3e41 feat(server/auth): cache auth token lookups in Redis with 10m TTL
* feat(server/auth): cache PAT lookups in Redis with 60s TTL

Personal access tokens used to hit Postgres on every request: a SELECT
to resolve token_hash → user_id, plus a fire-and-forget UPDATE of
last_used_at. For a CLI / daemon making many requests per second this
is wasted DB load — the token is the same and the answer hasn't changed.

Add a Redis-backed cache (auth.PATCache) keyed by token hash, TTL 60s:

- On cache hit, the auth middleware skips both the SELECT and the
  last_used_at UPDATE. last_used_at is now refreshed at most once per
  TTL window per token, not per request.
- On cache miss the middleware falls back to today's behavior: query
  Postgres, populate the cache, async-update last_used_at.
- On revoke, the handler invalidates the cache entry so revocation
  takes effect immediately rather than waiting for the TTL to expire.
  This required changing RevokePersonalAccessToken from :exec to :one
  RETURNING token_hash.

The cache is nil-safe: when REDIS_URL isn't configured, NewPATCache
returns nil and the middleware degrades to today's always-hit-DB
behavior. JWT validation is untouched (already DB-free).

Tested with REDIS_TEST_URL — same gating pattern the rest of the
suite uses for Redis-backed tests. New tests cover nil-safety, set/
get/invalidate, TTL, and the middleware short-circuit on cache hit.

* fix(server/auth): clamp PAT cache TTL to token's remaining lifetime

GPT-Boy review caught: a PAT expiring in <60s would still be cached
for the full PATCacheTTL window, so the token could continue passing
auth on cache hit for up to ~60s after its expires_at. The DB query
filters expired tokens (revoked = FALSE AND expires_at > now()), but
that filter never ran on a cache hit.

Make Set take an explicit ttl, and add TTLForExpiry to compute it:
  - no expires_at      → full PATCacheTTL
  - expires_at far     → full PATCacheTTL
  - expires_at <60s    → time until expiry
  - already expired    → 0, Set skips caching (TOCTOU defense between
                         the SELECT and the Set, since the SELECT
                         already filters expired rows)

Regression test pins the clamp behavior end-to-end against Redis.

* feat(server/auth): cache daemon-token + PAT lookups in DaemonAuth, bump TTL to 10m

Daemon /api/daemon/* requests (heartbeat, claim task) hit DaemonAuth
which previously did its own GetDaemonTokenByHash on every request and
*also* duplicated the PAT lookup on the mul_ fallback — bypassing the
cache added in 1cdd674c. Today's daemons authenticate via mul_ PATs
(mdt_ minting isn't wired up yet), so the duplicate PAT path is the one
that actually matters for hot-path DB load.

Three changes:

1. New auth.DaemonTokenCache mirrors PATCache for the mdt_ path
   (key = mul:auth:daemon:<sha256>, JSON value = {workspace_id, daemon_id}).
   Forward-looking infrastructure for when daemon tokens get minted; the
   middleware short-circuits the DB SELECT on cache hit. TTL clamped to
   the token's expires_at via the shared TTLForExpiry helper.

2. DaemonAuth now also consults PATCache on its mul_ fallback, sharing
   the same cache as the regular Auth middleware. A daemon making 4 hb/min
   collapses from 4 GetPersonalAccessTokenByHash + 4 last_used_at writes
   per minute to ~1 of each per AuthCacheTTL window (~10 minutes).

3. Rename PATCacheTTL → AuthCacheTTL and bump from 60s to 10 minutes.
   The constant is now shared between PAT and daemon caches; 10m matches
   the user-requested longer TTL for further DB write reduction. Revoke
   latency on the happy path is still instant via active invalidation;
   the worst-case (Redis Del miss / direct-DB revoke) grows from ~60s to
   ~10m.

Tests cover nil-safety, set/get/invalidate, TTL, clamped TTL on near-
expiry tokens, and the middleware short-circuit for both cache paths
(mdt_ via DaemonTokenCache, mul_ fallback via PATCache).

* feat(server/auth): cache PAT lookups on the WebSocket auth path

The third place a PAT is resolved — patResolver.ResolveToken used by
realtime.HandleWebSocket — was still hitting Postgres on every /ws
auth and firing an unconditional last_used_at UPDATE, bypassing the
cache added in 1cdd674c. Wire it through the same shared PATCache so
revoking a token through any path (Auth middleware, DaemonAuth PAT
fallback, or WS auth) hits all three caches with one Invalidate.

Also leaves a comment on DeleteDaemonTokensByWorkspaceAndDaemon —
the query has no caller today, but a future deregister/rotate flow
must remember to call DaemonTokenCache.Invalidate(hash) for each
deleted row, otherwise deleted daemon tokens stay valid until TTL.
2026-04-29 17:07:54 +08:00

100 lines
3.3 KiB
Go

package auth
import (
"context"
"encoding/json"
"errors"
"log/slog"
"time"
"github.com/redis/go-redis/v9"
)
// daemonTokenCachePrefix namespaces daemon-token cache keys separately
// from PAT (mul:auth:pat:*) so the two key spaces can't collide and an
// invalidation on one kind of token doesn't accidentally hit the other.
const daemonTokenCachePrefix = "mul:auth:daemon:"
// DaemonTokenIdentity is what DaemonAuth needs from the cached lookup —
// the workspace_id and daemon_id that the middleware injects into the
// request context. We deliberately omit token_hash, expires_at, and the
// row id; cache entries should leak the minimum.
type DaemonTokenIdentity struct {
WorkspaceID string `json:"w"`
DaemonID string `json:"d"`
}
// DaemonTokenCache caches resolved daemon-token (mdt_) lookups in Redis.
// A nil *DaemonTokenCache is safe to use — every method becomes a no-op
// or reports a cache miss, so single-node dev / tests with no REDIS_URL
// degrade cleanly to direct DB lookups.
type DaemonTokenCache struct {
rdb *redis.Client
}
// NewDaemonTokenCache returns a cache backed by rdb. Pass nil to disable
// caching; the returned *DaemonTokenCache is safe to call but never hits
// Redis.
func NewDaemonTokenCache(rdb *redis.Client) *DaemonTokenCache {
if rdb == nil {
return nil
}
return &DaemonTokenCache{rdb: rdb}
}
func daemonTokenCacheKey(hash string) string { return daemonTokenCachePrefix + hash }
// Get returns the cached identity for a token hash. ok=false on cache
// miss or any Redis / decode error — a dead Redis must not take down
// auth.
func (c *DaemonTokenCache) Get(ctx context.Context, hash string) (DaemonTokenIdentity, bool) {
if c == nil {
return DaemonTokenIdentity{}, false
}
raw, err := c.rdb.Get(ctx, daemonTokenCacheKey(hash)).Bytes()
if err != nil {
if !errors.Is(err, redis.Nil) {
slog.Warn("daemon_token_cache: get failed; falling back to DB", "error", err)
}
return DaemonTokenIdentity{}, false
}
var id DaemonTokenIdentity
if err := json.Unmarshal(raw, &id); err != nil {
slog.Warn("daemon_token_cache: malformed entry; falling back to DB", "error", err)
return DaemonTokenIdentity{}, false
}
return id, true
}
// Set populates the cache with the given TTL. Use TTLForExpiry to clamp
// the TTL to the token's remaining lifetime so a daemon token expiring
// in <AuthCacheTTL can't outlive its expires_at on a cache hit.
//
// Errors are logged and swallowed — a cache write failure is not a
// request failure.
func (c *DaemonTokenCache) Set(ctx context.Context, hash string, id DaemonTokenIdentity, ttl time.Duration) {
if c == nil || ttl <= 0 {
return
}
raw, err := json.Marshal(id)
if err != nil {
slog.Warn("daemon_token_cache: marshal failed", "error", err)
return
}
if err := c.rdb.Set(ctx, daemonTokenCacheKey(hash), raw, ttl).Err(); err != nil {
slog.Warn("daemon_token_cache: set failed", "error", err)
}
}
// Invalidate removes the entry for hash. Called when a daemon token is
// deleted so the deletion takes effect immediately rather than waiting
// for the TTL.
func (c *DaemonTokenCache) Invalidate(ctx context.Context, hash string) {
if c == nil {
return
}
if err := c.rdb.Del(ctx, daemonTokenCacheKey(hash)).Err(); err != nil {
slog.Warn("daemon_token_cache: invalidate failed; entry will expire on TTL", "error", err)
}
}