ollama

highperfocused/ollama

Fork 0

mirror of https://github.com/ollama/ollama.git synced 2025-04-08 11:58:07 +02:00

Commit Graph

Author	SHA1	Message	Date
Jesse Gross	f53f4198c3	ml: Abstract attention out of model definitions There are two benefits to doing this: - Provide a library function that models can use, reducing code for each model implementation - Enables a single place to drop in optimized implementations of attention based on the backend or other factors. One is provided for GGML. On CUDA this improves token generation rate by about 3%. It does not have a significant effect on Metal. Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2025-02-21 13:16:21 -08:00

Author

SHA1

Message

Date

Jesse Gross

f53f4198c3

ml: Abstract attention out of model definitions

There are two benefits to doing this:
 - Provide a library function that models can use, reducing code for
   each model implementation
 - Enables a single place to drop in optimized implementations of
   attention based on the backend or other factors. One is provided for
   GGML.

On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

2025-02-21 13:16:21 -08:00

1 Commits