mirror of
https://github.com/ollama/ollama.git
synced 2025-03-18 13:51:45 +01:00
The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.