Files
multica/server/internal/handler/cloud_runtime.go
LinYushen c968c13c87 feat(auth): support mcn_ Cloud Node PATs verified via Fleet (#3349)
* feat(auth): support mcn_ Cloud Node PATs verified via Fleet

Adds a new token kind, mcn_ (multica cloud node), recognized in both
the regular Auth and DaemonAuth middlewares. mcn_ tokens are minted
and owned by Multica Cloud (not the local personal_access_tokens
table); the server validates them by POSTing to the Fleet's
/api/v1/pat/verify endpoint and uses the returned owner_id as
X-User-ID for downstream handlers.

Cloud is the authoritative owner of token status, so this is a
verifier-only path with no DB fallback:

  * Fleet says valid:false -> 401 (token genuinely bad)
  * Fleet unreachable / 5xx -> 503 (transient, retry)
  * No MULTICA_CLOUD_FLEET_URL configured -> 401 (fail closed)

Verification results are cached in Redis for 60s under
mul:auth:mcn:<sha256> to bound the per-request load on Fleet without
extending the revocation window beyond what the Cloud doc allows.
Negative results are NOT cached, so a freshly minted token doesn't
get locked out by a stale 'token_not_found'.

Reuses MULTICA_CLOUD_FLEET_URL (the same env the cloud-runtime proxy
already uses) so deployments don't need a second config knob.

Tests cover the happy path, every documented invalid reason, 4xx/5xx
mapping, network error, decode error, ctx cancellation, the
fail-closed valid:true-without-owner_id case, trailing-slash URL
normalization, and the Redis cache short-circuit + negative
no-cache contract. Middleware tests pin the four 401/503/200 outcomes
in both Auth and DaemonAuth.

* auth(mcn): require owner_id to map to a real local user; drop X-User-PAT plumbing

Two related changes:

1. Cloud-verified owner_id is now checked against our local users table.
   The Cloud owner_id and our users.id share the same UUID space by
   contract; a missing local user means either the row was deleted
   under an active node or something is forging owner_ids — either
   way, fail closed.

   CloudPATVerifier.Verify takes a new OwnerLookupFunc:
     - returns (true, nil)   -> success, cache + return
     - returns (false, nil)  -> ErrCloudPATInvalid (reason='owner_unknown'),
                                NOT cached (so a freshly-created user
                                doesn't get locked out for a TTL window)
     - returns (_, error)    -> ErrCloudPATUnavailable (transient,
                                middleware emits 503)

   Both Auth and DaemonAuth wire ownerLookupFor(queries), a new shared
   helper that wraps queries.GetUser, mapping pgx.ErrNoRows / unparseable
   UUIDs to (false, nil) and other errors to a real Go error.

2. Removed all X-User-PAT plumbing. Cloud now mints node-scoped mcn_
   PATs itself during /api/v1/nodes (see multica-cloud
   docs/api/node-pat.md) and ships them into the EC2 instance via SSM,
   so multica-api no longer needs to forward the caller's mul_ PAT.
   Propagating a long-lived user PAT into a remote machine widened
   the blast radius of any node compromise; that's gone now.

   Removed:
     - cloud_runtime.go: withUserPAT option, cloudRuntimeUserPAT,
       generateCloudRuntimePAT, revokeGeneratedPAT
     - cloudruntime/Request.UserPAT field + X-User-PAT header
     - X-User-PAT from CORS allowed headers
     - obsolete handler tests:
         TestCreateCloudRuntimeNodeForwardsValidatedPAT
         TestCreateCloudRuntimeNodeRejectsUnownedPAT
         TestCreateCloudRuntimeNodeRejectsExpiredPAT
         TestCreateCloudRuntimeNodeAutoGeneratesPAT
       replaced with TestCreateCloudRuntimeNodeForwardsBody
     - X-User-PAT references in packages/core/api/client.test.ts

Tests:
  * 3 new verifier-level tests (owner_unknown not cached, lookup error
    -> Unavailable, success path is cached for both fleet AND lookup)
  * 5 new owner_lookup_test.go tests (nil queries, existing user,
    missing user, malformed UUID, DB error)
  * 1 new end-to-end DaemonAuth test (cloud says valid, no local user
    -> 401)
  * Existing X-User-PAT TS assertions removed; full vitest run passes.
  * go test ./... and go vet ./... clean on the server module.
2026-05-27 14:52:03 +08:00

209 lines
6.0 KiB
Go

package handler
import (
"bytes"
"context"
"encoding/json"
"errors"
"io"
"log/slog"
"net/http"
"net/url"
chimw "github.com/go-chi/chi/v5/middleware"
"github.com/multica-ai/multica/server/internal/cloudruntime"
"github.com/multica-ai/multica/server/internal/logger"
)
const maxCloudRuntimeRequestBodySize = 1 << 20
type cloudRuntimeProxyOptions struct {
withUserID bool
withQuery bool
withBody bool
}
func (h *Handler) GetCloudRuntimeService(w http.ResponseWriter, r *http.Request) {
h.proxyCloudRuntime(w, r, http.MethodGet, "/api/v1/", cloudRuntimeProxyOptions{
withUserID: true,
})
}
func (h *Handler) GetCloudRuntimeHealth(w http.ResponseWriter, r *http.Request) {
h.proxyCloudRuntime(w, r, http.MethodGet, "/healthz", cloudRuntimeProxyOptions{})
}
func (h *Handler) GetCloudRuntimeReady(w http.ResponseWriter, r *http.Request) {
h.proxyCloudRuntime(w, r, http.MethodGet, "/readyz", cloudRuntimeProxyOptions{})
}
func (h *Handler) ListCloudRuntimeNodes(w http.ResponseWriter, r *http.Request) {
h.proxyCloudRuntime(w, r, http.MethodGet, "/api/v1/nodes", cloudRuntimeProxyOptions{
withUserID: true,
withQuery: true,
})
}
func (h *Handler) CreateCloudRuntimeNode(w http.ResponseWriter, r *http.Request) {
// Cloud now mints a node-scoped mcn_ PAT itself during /api/v1/nodes
// and injects it into the EC2 instance via SSM bootstrap (see
// multica-cloud docs/api/node-pat.md). We no longer forward the
// caller's mul_ PAT — Fleet doesn't need it, and propagating a
// long-lived user PAT into a remote machine widened the blast
// radius of any node compromise. Hence the handler now mirrors
// the other write endpoints: just the body, no PAT plumbing.
h.proxyCloudRuntime(w, r, http.MethodPost, "/api/v1/nodes", cloudRuntimeProxyOptions{
withUserID: true,
withBody: true,
})
}
func (h *Handler) DeleteCloudRuntimeNode(w http.ResponseWriter, r *http.Request) {
h.proxyCloudRuntime(w, r, http.MethodDelete, "/api/v1/nodes", cloudRuntimeProxyOptions{
withUserID: true,
withBody: true,
})
}
func (h *Handler) StartCloudRuntimeNode(w http.ResponseWriter, r *http.Request) {
h.proxyCloudRuntime(w, r, http.MethodPost, "/api/v1/nodes/start", cloudRuntimeProxyOptions{
withUserID: true,
withBody: true,
})
}
func (h *Handler) StopCloudRuntimeNode(w http.ResponseWriter, r *http.Request) {
h.proxyCloudRuntime(w, r, http.MethodPost, "/api/v1/nodes/stop", cloudRuntimeProxyOptions{
withUserID: true,
withBody: true,
})
}
func (h *Handler) RebootCloudRuntimeNode(w http.ResponseWriter, r *http.Request) {
h.proxyCloudRuntime(w, r, http.MethodPost, "/api/v1/nodes/reboot", cloudRuntimeProxyOptions{
withUserID: true,
withBody: true,
})
}
func (h *Handler) GetCloudRuntimeNodeStatus(w http.ResponseWriter, r *http.Request) {
h.proxyCloudRuntime(w, r, http.MethodPost, "/api/v1/nodes/status", cloudRuntimeProxyOptions{
withUserID: true,
withBody: true,
})
}
func (h *Handler) ExecCloudRuntimeNode(w http.ResponseWriter, r *http.Request) {
h.proxyCloudRuntime(w, r, http.MethodPost, "/api/v1/nodes/exec", cloudRuntimeProxyOptions{
withUserID: true,
withBody: true,
})
}
func (h *Handler) proxyCloudRuntime(w http.ResponseWriter, r *http.Request, method, path string, opts cloudRuntimeProxyOptions) {
if h.CloudRuntime == nil || !h.CloudRuntime.Enabled() {
writeError(w, http.StatusServiceUnavailable, "cloud runtime is not configured")
return
}
var userID string
if opts.withUserID {
var ok bool
userID, ok = requireUserID(w, r)
if !ok {
return
}
}
var body []byte
if opts.withBody {
var ok bool
body, ok = readCloudRuntimeJSONBody(w, r)
if !ok {
return
}
}
var query url.Values
if opts.withQuery {
query = r.URL.Query()
}
resp, err := h.CloudRuntime.Do(r.Context(), cloudruntime.Request{
Method: method,
Path: path,
Query: query,
Body: body,
UserID: userID,
RequestID: cloudRuntimeRequestID(r),
})
if err != nil {
writeCloudRuntimeError(w, r, err)
return
}
writeCloudRuntimeResponse(w, resp)
}
func readCloudRuntimeJSONBody(w http.ResponseWriter, r *http.Request) ([]byte, bool) {
r.Body = http.MaxBytesReader(w, r.Body, maxCloudRuntimeRequestBodySize)
data, err := io.ReadAll(r.Body)
if err != nil {
var maxErr *http.MaxBytesError
if errors.As(err, &maxErr) {
writeError(w, http.StatusRequestEntityTooLarge, "request body is too large")
return nil, false
}
writeError(w, http.StatusBadRequest, "invalid request body")
return nil, false
}
if len(bytes.TrimSpace(data)) == 0 {
writeError(w, http.StatusBadRequest, "request body is required")
return nil, false
}
var raw json.RawMessage
if err := json.Unmarshal(data, &raw); err != nil {
writeError(w, http.StatusBadRequest, "invalid request body")
return nil, false
}
return data, true
}
func cloudRuntimeRequestID(r *http.Request) string {
if id := r.Header.Get("X-Request-ID"); id != "" {
return id
}
return chimw.GetReqID(r.Context())
}
func writeCloudRuntimeResponse(w http.ResponseWriter, resp *cloudruntime.Response) {
if requestID := resp.Header.Get("X-Request-ID"); requestID != "" {
w.Header().Set("X-Request-ID", requestID)
}
body := bytes.TrimSpace(resp.Body)
if len(body) == 0 {
w.WriteHeader(resp.StatusCode)
return
}
if json.Valid(body) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(resp.StatusCode)
w.Write(body)
return
}
writeJSON(w, resp.StatusCode, map[string]string{"error": string(body)})
}
func writeCloudRuntimeError(w http.ResponseWriter, r *http.Request, err error) {
switch {
case errors.Is(err, cloudruntime.ErrDisabled):
writeError(w, http.StatusServiceUnavailable, "cloud runtime is not configured")
case errors.Is(err, cloudruntime.ErrInvalidBaseURL):
writeError(w, http.StatusServiceUnavailable, "cloud runtime is misconfigured")
case errors.Is(err, context.DeadlineExceeded):
writeError(w, http.StatusGatewayTimeout, "cloud runtime request timed out")
default:
slog.Warn("cloud runtime request failed", append(logger.RequestAttrs(r), "error", err)...)
writeError(w, http.StatusBadGateway, "cloud runtime request failed")
}
}