mirror of https://github.com/ollama/ollama.git synced 2025-03-20 23:02:48 +01:00

History

Jesse Gross 21aa666a1e ml: Enable support for flash attention

The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.

Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.

2025-03-01 20:53:23 -08:00

common

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

llamarunner

runner: defer context cancel

2025-02-28 22:27:28 +00:00

ollamarunner

ml: Enable support for flash attention

2025-03-01 20:53:23 -08:00

README.md

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

runner.go

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding