ollama/kvcache/cache.go

package kvcache

import (
	"errors"

	"github.com/ollama/ollama/ml"
	"github.com/ollama/ollama/model/input"
)

var (
	ErrKvCacheFull  = errors.New("could not find a kv cache slot")
	ErrNotSupported = errors.New("model does not support operation")
)

type Cache interface {
	// ** used by model implementations **

	// SetLayer sets the active layer of the cache
	SetLayer(layer int)

	// Get returns the history of key and value tensors plus a mask
	//
	// The shape of the tensors is documented in the specific
	// cache implementation used.
	Get(ctx ml.Context) (ml.Tensor, ml.Tensor, ml.Tensor)

	// Put stores a batch of key and value in the cache
	//
	// The shape of the tensors is documented in the specific
	// cache implementation used.
	Put(ctx ml.Context, key, value ml.Tensor)

	// SetConfig controls optimizations (mostly backend-specific) that may transform
	// the output of the cache to work better with specific kernels. If not called,
	// the backend settings will be used. This works well when calling Attention.
	//
	// The config can be overridden by models, especially if they require vanilla
	// output when implementing their own version of attention. To do this, pass
	// an empty ml.CacheConfig.
	//
	// Most models will not need to use this.
	SetConfig(ml.CacheConfig)

	// ** cache management **

	// Init sets up runtime parameters.
	// backend: Used to allocate cache data storage and execute management operations (such as defrag)
	// dtype: The data type for storing cache entries
	// maxSequences: The maximum number of sequences stored in the cache - across all batches
	// capacity: The number of cache entries to store, per sequence
	// maxBatch: The maximum number of tokens that can occur in a single batch
	Init(backend ml.Backend, dtype ml.DType, maxSequences, capacity, maxBatch int)

	// Close closes the cache and frees resources associated with it
	Close()

	// StartForward is called before the start of the model's forward pass.
	// For each token in the coming batch, there must be a corresponding
	// entry in positions and seqs.
	StartForward(ctx ml.Context, batch input.Batch) error

	// CopyPrefix copies tokens in the range [0, len) from srcSeq to dstSeq
	CopyPrefix(srcSeq, dstSeq int, len int32)

	// Remove deletes tokens in the range [beginIndex, endIndex) from seq. Set
	// endIndex to math.MaxInt32 to remove everything starting at beginIndex.
	//
	// If an error occurs, the entire context for the sequence should be
	// removed by calling Remove(seq, 0, math.MaxInt32)
	Remove(seq int, beginIndex, endIndex int32) error
}
Runner for Ollama engine This provides integration with the new Ollama engine (5824541 next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1 2024-12-17 19:59:41 -08:00			`package kvcache`

			`import (`
			`"errors"`

			`"github.com/ollama/ollama/ml"`
model: Update encoder cache to use multimodal input processing handler The encoder cache needs to know the position of images in the input stream so that it knows when to delete them. Previously images didn't have a position, so we implied one by breaking batches before an image and then assuming the image was in the first position. However, multimodal objects are now given explicit positions in the input stream, so we can use that instead. Breaking batches was also a way to simulate a cross attention mask for mllama. However, given that it only supports a single sequence and a single image, this mask doesn't serve any real purpose. Removing the batch break does not appear to affect the quality of the output. Most of this is simply moving the input data structures to a new package to avoid import cycles. 2025-03-08 15:45:31 -08:00			`"github.com/ollama/ollama/model/input"`
Runner for Ollama engine This provides integration with the new Ollama engine (5824541 next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1 2024-12-17 19:59:41 -08:00			`)`

			`var (`
			`ErrKvCacheFull = errors.New("could not find a kv cache slot")`
			`ErrNotSupported = errors.New("model does not support operation")`
			`)`

			`type Cache interface {`
			`// used by model implementations `

			`// SetLayer sets the active layer of the cache`
			`SetLayer(layer int)`

			`// Get returns the history of key and value tensors plus a mask`
			`//`
			`// The shape of the tensors is documented in the specific`
			`// cache implementation used.`
			`Get(ctx ml.Context) (ml.Tensor, ml.Tensor, ml.Tensor)`

			`// Put stores a batch of key and value in the cache`
			`//`
			`// The shape of the tensors is documented in the specific`
			`// cache implementation used.`
			`Put(ctx ml.Context, key, value ml.Tensor)`

attention: Remove unnecessary contiguous operations Prior to performing attention, we need to permute query, key and value. Currently we call Contiguous after each of these permutations, which is correct but expensive. Avoiding the 3 calls to Contiguous increases performance by over 20%. The permutations of query and key do not violate the continuity rules for mulmat and the Contiguous call can be simply removed. Value requires a different permutation and does require Contiguous. However, we can use the copy into the cache as a way to perform this without further overhead. To support this and avoid unexpected tensor shapes that are seen by models, we need tighter integration between attention, cache and backend. Future optimization will also likely need this structure - for example, flash attention has special padding requirements in the cache and other backends may have their own needs. This further contains the operations that go into attention so that these and other optimizations can be handled transparently. Models that have special requirements for attention can still implement their own version of it. 2025-02-22 21:34:10 -08:00			`// SetConfig controls optimizations (mostly backend-specific) that may transform`
			`// the output of the cache to work better with specific kernels. If not called,`
			`// the backend settings will be used. This works well when calling Attention.`
			`//`
			`// The config can be overridden by models, especially if they require vanilla`
			`// output when implementing their own version of attention. To do this, pass`
			`// an empty ml.CacheConfig.`
			`//`
			`// Most models will not need to use this.`
			`SetConfig(ml.CacheConfig)`

Runner for Ollama engine This provides integration with the new Ollama engine (5824541 next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1 2024-12-17 19:59:41 -08:00			`// cache management `

kvcache: Pass granular cache size into implementations Currently the runner computes the kv size needed and creates a cache of that size. This is the context size times number of parallel sequences. Cache implementations can make better decisions about their memory usage, so instead pass in the required capacity, number of sequences and maximum batch size. For now, the causal cache just uses this to compute the size in the same way as before. 2025-03-18 14:31:52 -07:00			`// Init sets up runtime parameters.`
			`// backend: Used to allocate cache data storage and execute management operations (such as defrag)`
			`// dtype: The data type for storing cache entries`
			`// maxSequences: The maximum number of sequences stored in the cache - across all batches`
			`// capacity: The number of cache entries to store, per sequence`
			`// maxBatch: The maximum number of tokens that can occur in a single batch`
			`Init(backend ml.Backend, dtype ml.DType, maxSequences, capacity, maxBatch int)`
Runner for Ollama engine This provides integration with the new Ollama engine (5824541 next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1 2024-12-17 19:59:41 -08:00
			`// Close closes the cache and frees resources associated with it`
			`Close()`

			`// StartForward is called before the start of the model's forward pass.`
			`// For each token in the coming batch, there must be a corresponding`
			`// entry in positions and seqs.`
input: Rename Options to Batch Options is no longer very descriptive of this struct. 2025-03-19 14:28:15 -07:00			`StartForward(ctx ml.Context, batch input.Batch) error`
Runner for Ollama engine This provides integration with the new Ollama engine (5824541 next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1 2024-12-17 19:59:41 -08:00
			`// CopyPrefix copies tokens in the range [0, len) from srcSeq to dstSeq`
			`CopyPrefix(srcSeq, dstSeq int, len int32)`

			`// Remove deletes tokens in the range [beginIndex, endIndex) from seq. Set`
			`// endIndex to math.MaxInt32 to remove everything starting at beginIndex.`
			`//`
			`// If an error occurs, the entire context for the sequence should be`
			`// removed by calling Remove(seq, 0, math.MaxInt32)`
			`Remove(seq int, beginIndex, endIndex int32) error`
			`}`