ollama

mirror of https://github.com/ollama/ollama.git synced 2025-07-04 13:01:11 +02:00

Files

Jesse Gross ea79003180 kvcache: Skip computing causal mask for worst case graph reservation

Computing an attention mask for a large context and max batch is
expensive - over 100ms. Models like Gemma3 that have multiple types
of caches and custom attention masks need to do this 4 times, so this
adds approximately 500ms to startup time when using 128k context

When we are reserving the worst case graph, we don't need the mask,
only its shape, so we can skip this.

2025-05-27 14:25:15 -07:00

cache.go

…

causal_test.go

…

causal.go

kvcache: Skip computing causal mask for worst case graph reservation

2025-05-27 14:25:15 -07:00

encoder.go

…

wrapper.go

…