mirror of https://github.com/ollama/ollama.git synced 2025-04-14 14:49:25 +02:00

History

jmorganca b42970063d kvcache: Add check for values that fall out of sliding window cache

The sliding window cache trims entries that are outside the window for
the latest token. This works when we are extending the cache, such as
when the conversation continues. However, if we have a partial overlap
in conversation (including the BOS tokens), then we resume from a past
point in the conversation and the needed tokens are no longer stored
in memory. This verifies that the new window overlaps with the old one
before reusing the cache.

Co-authored-by: Jesse Gross <jesse@ollama.com>

2025-04-02 11:55:48 -07:00

common

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

llamarunner

runner: clear cache when shift is not possible (#9433 )

2025-03-31 12:54:45 -07:00

ollamarunner

kvcache: Add check for values that fall out of sliding window cache

2025-04-02 11:55:48 -07:00

README.md

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

runner.go

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding