mirror of https://github.com/ollama/ollama.git synced 2025-04-15 23:21:28 +02:00

History

Jesse Gross 3ed7ad3ab3 kvcache: Pass granular cache size into implementations

Currently the runner computes the kv size needed and creates a
cache of that size. This is the context size times number of
parallel sequences.

Cache implementations can make better decisions about their memory
usage, so instead pass in the required capacity, number of sequences
and maximum batch size. For now, the causal cache just uses this to
compute the size in the same way as before.

2025-03-21 11:20:19 -07:00

common

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

llamarunner

llm: remove internal subprocess req and resp types (#9324 )

2025-03-14 15:21:53 -07:00

ollamarunner

kvcache: Pass granular cache size into implementations

2025-03-21 11:20:19 -07:00

README.md

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

runner.go

Runner for Ollama engine

2025-02-13 17:09:26 -08:00

README.md

`runner`

Note: this is a work in progress

A minimial runner for loading a model and running inference via a http web server.

./runner -model <model binary>

Completion

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "hi"}' http://localhost:8080/completion

Embeddings

curl -X POST -H "Content-Type: application/json" -d '{"prompt": "turn me into an embedding"}' http://localhost:8080/embedding