ollama

mirror of https://github.com/ollama/ollama.git synced 2025-11-11 20:17:33 +01:00

Author	SHA1	Message	Date
Michael Yang	6f7117145f	batch: use tensors for outputs (#12185 ) this cleans up the model interface slightly without too much impact in other areas	2025-09-15 14:33:06 -07:00
Daniel Hiltgen	517807cdf2	perf: build graph for next batch async to keep GPU busy (#11863 ) * perf: build graph for next batch in parallel to keep GPU busy This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work. * tests: tune integration tests for ollama engine This tunes the integration tests to focus more on models supported by the new engine.	2025-08-29 14:20:28 -07:00
Michael Yang	30fb7e19f8	remove extra field attr (#11205 )	2025-08-25 09:58:16 -07:00
Jesse Gross	1f371ea92f	ml: Panic rather than return error on tensor allocation failure FromFloatSlice and FromIntSlice return an error if the shape doesn't match the passed data or if memory can't be allocated. Since these are inputs, the memory being allocated is system memory rather than VRAM. In many cases, the caller can't really handle the error and panics. Empty and Zeros directly panic if they can't allocate memory. This makes things consistent by panicing for the first two cases, removing a fair amount of error handling code. This is also consistent with how Go typically handles these situations.	2025-05-22 14:38:09 -07:00
Michael Yang	333e360422	model: handle multiple eos tokens (#10577 ) * get eos_token_id from generation_config.json * refactor * include both ids and strings in trace * comments * remove special case for gemma3 special vocab (#10743)	2025-05-16 13:40:23 -07:00
Jesse Gross	3c14461d5d	ollamarunner: Separate text and multimodal graphs For some multimodal models (such as gemma3), we create a single graph that generates the image embedding and then use this in the text model. The embedding tensor is completely opaque to the runner. However, this doesn't work if we need to use the embedding in multiple batches. This can arise if the embedding is larger than the batch size. In these cases (as with llama4), we would like to create views that are more appropriately sized. However, if we do this then the original source tensor is used in multiple graphs, which isn't allowed. To avoid that problem, models with this pattern compute the embedding tensor on first use and recreate the individual views. There is no longer a single vision and text graph. This codifies the pattern of separating vision and text graphs. The logic of computing tensors on demand is moved to the runner, so models no longer have to worry about this. It also gives the runner visibility into the multimodal tensors, which is important for memory management.	2025-05-15 13:46:20 -07:00
Michael Yang	526b2ed102	fix vocabulary (#10679 )	2025-05-12 17:29:46 -07:00
Michael Yang	5cfc1c39f3	model: fix build (#10416 )	2025-04-25 19:24:48 -07:00
Michael Yang	7ba9fa9c7d	fixes for maverick	2025-04-25 16:59:20 -07:00
Michael Yang	8bf11b84c1	chunked attention	2025-04-25 16:59:20 -07:00
Michael Yang	470af8ab89	connect vision to text	2025-04-25 16:59:20 -07:00
Michael Yang	178761aef3	image processing Co-authored-by: Patrick Devine <patrick@infrahq.com>	2025-04-25 16:59:20 -07:00
Michael Yang	f0c66e6dea	llama4	2025-04-25 16:59:20 -07:00

13 Commits