Files
multica/server/internal/auth/pat_cache.go
Bohan Jiang 86e7de3e41 feat(server/auth): cache auth token lookups in Redis with 10m TTL
* feat(server/auth): cache PAT lookups in Redis with 60s TTL

Personal access tokens used to hit Postgres on every request: a SELECT
to resolve token_hash → user_id, plus a fire-and-forget UPDATE of
last_used_at. For a CLI / daemon making many requests per second this
is wasted DB load — the token is the same and the answer hasn't changed.

Add a Redis-backed cache (auth.PATCache) keyed by token hash, TTL 60s:

- On cache hit, the auth middleware skips both the SELECT and the
  last_used_at UPDATE. last_used_at is now refreshed at most once per
  TTL window per token, not per request.
- On cache miss the middleware falls back to today's behavior: query
  Postgres, populate the cache, async-update last_used_at.
- On revoke, the handler invalidates the cache entry so revocation
  takes effect immediately rather than waiting for the TTL to expire.
  This required changing RevokePersonalAccessToken from :exec to :one
  RETURNING token_hash.

The cache is nil-safe: when REDIS_URL isn't configured, NewPATCache
returns nil and the middleware degrades to today's always-hit-DB
behavior. JWT validation is untouched (already DB-free).

Tested with REDIS_TEST_URL — same gating pattern the rest of the
suite uses for Redis-backed tests. New tests cover nil-safety, set/
get/invalidate, TTL, and the middleware short-circuit on cache hit.

* fix(server/auth): clamp PAT cache TTL to token's remaining lifetime

GPT-Boy review caught: a PAT expiring in <60s would still be cached
for the full PATCacheTTL window, so the token could continue passing
auth on cache hit for up to ~60s after its expires_at. The DB query
filters expired tokens (revoked = FALSE AND expires_at > now()), but
that filter never ran on a cache hit.

Make Set take an explicit ttl, and add TTLForExpiry to compute it:
  - no expires_at      → full PATCacheTTL
  - expires_at far     → full PATCacheTTL
  - expires_at <60s    → time until expiry
  - already expired    → 0, Set skips caching (TOCTOU defense between
                         the SELECT and the Set, since the SELECT
                         already filters expired rows)

Regression test pins the clamp behavior end-to-end against Redis.

* feat(server/auth): cache daemon-token + PAT lookups in DaemonAuth, bump TTL to 10m

Daemon /api/daemon/* requests (heartbeat, claim task) hit DaemonAuth
which previously did its own GetDaemonTokenByHash on every request and
*also* duplicated the PAT lookup on the mul_ fallback — bypassing the
cache added in 1cdd674c. Today's daemons authenticate via mul_ PATs
(mdt_ minting isn't wired up yet), so the duplicate PAT path is the one
that actually matters for hot-path DB load.

Three changes:

1. New auth.DaemonTokenCache mirrors PATCache for the mdt_ path
   (key = mul:auth:daemon:<sha256>, JSON value = {workspace_id, daemon_id}).
   Forward-looking infrastructure for when daemon tokens get minted; the
   middleware short-circuits the DB SELECT on cache hit. TTL clamped to
   the token's expires_at via the shared TTLForExpiry helper.

2. DaemonAuth now also consults PATCache on its mul_ fallback, sharing
   the same cache as the regular Auth middleware. A daemon making 4 hb/min
   collapses from 4 GetPersonalAccessTokenByHash + 4 last_used_at writes
   per minute to ~1 of each per AuthCacheTTL window (~10 minutes).

3. Rename PATCacheTTL → AuthCacheTTL and bump from 60s to 10 minutes.
   The constant is now shared between PAT and daemon caches; 10m matches
   the user-requested longer TTL for further DB write reduction. Revoke
   latency on the happy path is still instant via active invalidation;
   the worst-case (Redis Del miss / direct-DB revoke) grows from ~60s to
   ~10m.

Tests cover nil-safety, set/get/invalidate, TTL, clamped TTL on near-
expiry tokens, and the middleware short-circuit for both cache paths
(mdt_ via DaemonTokenCache, mul_ fallback via PATCache).

* feat(server/auth): cache PAT lookups on the WebSocket auth path

The third place a PAT is resolved — patResolver.ResolveToken used by
realtime.HandleWebSocket — was still hitting Postgres on every /ws
auth and firing an unconditional last_used_at UPDATE, bypassing the
cache added in 1cdd674c. Wire it through the same shared PATCache so
revoking a token through any path (Auth middleware, DaemonAuth PAT
fallback, or WS auth) hits all three caches with one Invalidate.

Also leaves a comment on DeleteDaemonTokensByWorkspaceAndDaemon —
the query has no caller today, but a future deregister/rotate flow
must remember to call DaemonTokenCache.Invalidate(hash) for each
deleted row, otherwise deleted daemon tokens stay valid until TTL.
2026-04-29 17:07:54 +08:00

109 lines
3.5 KiB
Go

package auth
import (
"context"
"errors"
"log/slog"
"time"
"github.com/redis/go-redis/v9"
)
// AuthCacheTTL bounds how long a token-hash lookup stays cached before
// the auth middleware goes back to Postgres. Shared by PATCache and
// DaemonTokenCache so both kinds of token follow the same revocation
// latency contract. Short enough that revocation lag from a missed
// invalidation is bounded; long enough that a high-frequency client
// (CLI, daemon) collapses from one DB round-trip per request to one
// per TTL window.
const AuthCacheTTL = 10 * time.Minute
// patCachePrefix namespaces auth-cache keys away from the realtime relay
// (ws:*) and local-skill (mul:local_skill:*) keys.
const patCachePrefix = "mul:auth:pat:"
// PATCache caches resolved PAT lookups in Redis. A nil *PATCache is safe
// to use — every method becomes a no-op or reports a cache miss, and the
// auth middleware degrades to direct DB lookups.
type PATCache struct {
rdb *redis.Client
}
// NewPATCache returns a cache backed by rdb. Pass nil to disable caching;
// the returned *PATCache is safe to call but never hits Redis.
func NewPATCache(rdb *redis.Client) *PATCache {
if rdb == nil {
return nil
}
return &PATCache{rdb: rdb}
}
func patCacheKey(hash string) string { return patCachePrefix + hash }
// Get returns the cached user_id for a token hash. ok=false on cache miss
// or any Redis error — a dead Redis must not take down auth.
func (c *PATCache) Get(ctx context.Context, hash string) (userID string, ok bool) {
if c == nil {
return "", false
}
v, err := c.rdb.Get(ctx, patCacheKey(hash)).Result()
if err != nil {
if !errors.Is(err, redis.Nil) {
slog.Warn("pat_cache: get failed; falling back to DB", "error", err)
}
return "", false
}
return v, true
}
// Set populates the cache with the given TTL. Callers MUST pass a TTL no
// longer than the token's remaining lifetime — otherwise an entry could
// outlive the PAT's expires_at and let an expired token pass auth on
// cache hit. Use TTLForExpiry to compute it from a token's expires_at.
//
// Errors are logged and swallowed — a cache write failure is not a
// request failure.
func (c *PATCache) Set(ctx context.Context, hash, userID string, ttl time.Duration) {
if c == nil || ttl <= 0 {
return
}
if err := c.rdb.Set(ctx, patCacheKey(hash), userID, ttl).Err(); err != nil {
slog.Warn("pat_cache: set failed", "error", err)
}
}
// TTLForExpiry returns the cache TTL for a token given its expires_at.
// - Zero expiresAt (token never expires) → full AuthCacheTTL.
// - expiresAt in the future → min(AuthCacheTTL, time until expiry).
// - expiresAt at or before now → 0 (caller should skip caching; the
// middleware shouldn't reach here because the SELECT already
// filters expired tokens, but a TOCTOU between SELECT and Set is
// possible).
//
// Pass time.Time{} when the token has no expiry (pgtype.Timestamptz with
// Valid=false maps to a zero Time).
func TTLForExpiry(now, expiresAt time.Time) time.Duration {
if expiresAt.IsZero() {
return AuthCacheTTL
}
remaining := expiresAt.Sub(now)
if remaining <= 0 {
return 0
}
if remaining < AuthCacheTTL {
return remaining
}
return AuthCacheTTL
}
// Invalidate removes the entry for hash. Called on PAT revocation so the
// revoke takes effect immediately rather than waiting for the TTL.
func (c *PATCache) Invalidate(ctx context.Context, hash string) {
if c == nil {
return
}
if err := c.rdb.Del(ctx, patCacheKey(hash)).Err(); err != nil {
slog.Warn("pat_cache: invalidate failed; entry will expire on TTL", "error", err)
}
}