ollama

mirror of https://github.com/ollama/ollama.git synced 2025-03-19 22:32:15 +01:00

Author	SHA1	Message	Date
jmorganca	c6b6938b3a	kvcache: fix tests by adding AvgPool2D stub	2025-03-11 14:49:20 -07:00
Jesse Gross	a8e83a7654	Disable causal attention based on batch index Currently we are using positions, which are relative to a sequence and may not be unique.	2025-03-11 14:49:20 -07:00
Michael Yang	e95278932b	use non-causal mask only for image positions	2025-03-11 14:49:19 -07:00
Jesse Gross	0e886595bf	Fix tests and drift from main	2025-03-11 14:49:18 -07:00
Jesse Gross	4346c2409d	fix drift from main	2025-03-11 14:49:18 -07:00
Patrick Devine	5f74d1fd47	gemma2 impl	2025-03-11 14:35:08 -07:00
Jesse Gross	a1cda80bcb	model: Update encoder cache to use multimodal input processing handler The encoder cache needs to know the position of images in the input stream so that it knows when to delete them. Previously images didn't have a position, so we implied one by breaking batches before an image and then assuming the image was in the first position. However, multimodal objects are now given explicit positions in the input stream, so we can use that instead. Breaking batches was also a way to simulate a cross attention mask for mllama. However, given that it only supports a single sequence and a single image, this mask doesn't serve any real purpose. Removing the batch break does not appear to affect the quality of the output. Most of this is simply moving the input data structures to a new package to avoid import cycles.	2025-03-09 17:05:26 -07:00
Jesse Gross	f52b2615ef	kvcache: Set context for shift offsets	2025-03-07 18:43:39 -08:00
Jesse Gross	6da8b6a879	kvcache: Support non-causal attention Models can disable causality for all or part of their processing while continuing to store data in the KV cache.	2025-03-07 18:39:27 -08:00
Michael Yang	58b9ec1f6b	kvcache: update tests	2025-03-07 14:08:21 -08:00
Michael Yang	7bae7fa5ce	ml/backend/ggml: create tensor on specific backend some tensors should be created on specific backends to reduce number of copies and improve performance	2025-03-07 14:08:21 -08:00
Michael Yang	764e199d67	kvcache: create cache ctx per layer each cache layer creates and maintains its own context instead of using a large context for all layers	2025-03-07 14:08:21 -08:00
Jesse Gross	21aa666a1e	ml: Enable support for flash attention The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.	2025-03-01 20:53:23 -08:00
Jesse Gross	ee141cc821	ml: Empty tensor constructor for tensors In cases where we allocate a tensor and then fully overwrite it with copied data, it is wasteful to first zero out the memory.	2025-03-01 20:53:23 -08:00
Jesse Gross	854a9195f3	attention: Remove unnecessary contiguous operations Prior to performing attention, we need to permute query, key and value. Currently we call Contiguous after each of these permutations, which is correct but expensive. Avoiding the 3 calls to Contiguous increases performance by over 20%. The permutations of query and key do not violate the continuity rules for mulmat and the Contiguous call can be simply removed. Value requires a different permutation and does require Contiguous. However, we can use the copy into the cache as a way to perform this without further overhead. To support this and avoid unexpected tensor shapes that are seen by models, we need tighter integration between attention, cache and backend. Future optimization will also likely need this structure - for example, flash attention has special padding requirements in the cache and other backends may have their own needs. This further contains the operations that go into attention so that these and other optimizations can be handled transparently. Models that have special requirements for attention can still implement their own version of it.	2025-03-01 20:53:23 -08:00
Michael Yang	8b194b7520	kvcache: update tests	2025-02-27 22:27:16 +00:00
Michael Yang	3e8b8a1933	ml: update Context.Forward interface update Context.Forward to accept multiple tensors to match Context.Compute signature update Context.Forward to return Context such that it can be chained with Context.Compute	2025-02-27 22:27:16 +00:00
Daniel Hiltgen	df2680b4b9	Wire up system info log for new engine (#9123 )	2025-02-14 15:55:33 -08:00
Jesse Gross	ed443a0393	Runner for Ollama engine This provides integration with the new Ollama engine (5824541 next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1	2025-02-13 17:09:26 -08:00

19 Commits