ollama

mirror of https://github.com/ollama/ollama.git synced 2025-04-17 16:11:22 +02:00

Author	SHA1	Message	Date
Michael Yang	7946618dc3	metal: op_neg	2025-04-02 16:55:06 -07:00
Michael Yang	a9220da3b6	s/gelu/silu/	2025-04-02 16:13:19 -07:00
Michael Yang	394b69dece	mistral3 quantization	2025-04-02 13:17:06 -07:00
Michael Yang	bfdd02472c	remove unused rope	2025-04-02 12:24:02 -07:00
Michael Yang	7465c0118e	mistral3 memory mistral3 graph is very similar to gemma3 so use that for now	2025-04-02 12:11:15 -07:00
Michael Yang	87cf2fa1b8	compute image embeddings once	2025-04-02 11:44:06 -07:00
Michael Yang	2ab14468a8	ml: add repeat op repeat is a convenience operation for repeating a tensor n times along a dimention. this can replace instances where the same tensors are stacked together	2025-04-01 15:25:20 -07:00
Michael Yang	90d8a1e8a2	mistral-small: use ollama engine	2025-04-01 13:58:23 -07:00
Michael Yang	1ca8eb5c05	use fast attention	2025-04-01 13:58:23 -07:00
Michael Yang	8aec3e1374	fix image embeddings	2025-04-01 13:02:14 -07:00
Michael Yang	e6b561005e	fix patch batch	2025-03-31 17:21:47 -07:00
Michael Yang	6184028fc0	fix convert	2025-03-31 13:12:34 -07:00
Michael Yang	557c641697	2d rope	2025-03-31 10:34:17 -07:00
Michael Yang	863ba57477	fixes	2025-03-25 13:57:24 -07:00
Bruce MacDonald	dce7cf2a1a	remove debugging code	2025-03-24 09:48:20 -07:00
jmorganca	62108621d5	update comment	2025-03-23 23:11:47 -07:00
jmorganca	a1c8b0fdb0	cleanup	2025-03-23 22:49:56 -07:00
jmorganca	3daa26e8e8	remove unneeded conversion replacement	2025-03-23 21:57:35 -07:00
jmorganca	1663ef289c	remove large files	2025-03-23 21:43:17 -07:00
jmorganca	4586e137fe	wip	2025-03-23 21:41:18 -07:00
jmorganca	cfeca27133	wip	2025-03-23 01:01:23 -07:00
jmorganca	4530661799	wip	2025-03-22 23:20:39 -07:00
jmorganca	8dd2a81f8c	wip	2025-03-22 22:33:39 -07:00
jmorganca	caddb1e4cf	rebased	2025-03-22 10:15:52 -07:00
Bruce MacDonald	4d8dac8ffc	wip	2025-03-22 10:03:23 -07:00
Bruce MacDonald	63e6509ec0	vision conversion	2025-03-22 10:03:22 -07:00
Bruce MacDonald	6f34126dcc	image processing	2025-03-22 10:03:22 -07:00
Bruce MacDonald	ecc0ef468f	split text model to its own file	2025-03-22 10:03:22 -07:00
Bruce MacDonald	9b57238834	...	2025-03-22 10:03:22 -07:00
Bruce MacDonald	3b4ad00a4b	mistral3 arch	2025-03-22 10:03:22 -07:00
Bruce MacDonald	9a12fd1067	wip: test fixes	2025-03-22 10:03:22 -07:00
Bruce MacDonald	edac05387f	convert: mistral-3.1-2503 text component	2025-03-22 10:03:22 -07:00
Bruce MacDonald	e65cf9dc94	minimal convert	2025-03-22 10:03:22 -07:00
jmorganca	7e3c62f388	wip	2025-03-22 10:03:22 -07:00
jmorganca	a75703b2cc	wip	2025-03-22 10:03:22 -07:00
Bruce MacDonald	c24e8860c1	model: support for mistral-small in the ollama runner Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.	2025-03-22 10:03:22 -07:00
Blake Mizerany	ce929984a3	server/internal/client/ollama: fix file descriptor management in Pull (#9931 ) Close chunked writers as soon as downloads complete, rather than deferring closure until Pull exits. This prevents exhausting file descriptors when pulling many layers. Instead of unbounded defers, use a WaitGroup and background goroutine to close each chunked writer as soon as its downloads finish. Also rename 'total' to 'received' for clarity.	2025-03-21 16:16:38 -07:00
Michael Yang	4b34930a31	Merge pull request #9897 from ollama/mxyng/chunk-load ml/backend/ggml: load tensors in 128KiB chunks v0.6.3-rc0	2025-03-21 14:47:13 -07:00
Michael Yang	74bd09652d	ml/backend/ggml: load tensors in 32KiB chunks	2025-03-21 14:43:52 -07:00
Bruce MacDonald	fb6252d786	benchmark: performance of running ollama server (#8643 )	2025-03-21 13:08:20 -07:00
Blake Mizerany	c794fef2f2	server/internal/client/ollama: persist through chunk download errors (#9923 )	2025-03-21 13:03:43 -07:00
Parth Sareen	00ebda8cc4	Revert "parser: remove role validation from Modelfile parser" (#9917 ) This reverts commit ffbfe833da387f9b6806fe887b85992c11d26eaa.	2025-03-21 12:38:09 -07:00
Parth Sareen	d14ce75b95	docs: update final response for /api/chat stream (#9919 )	2025-03-21 12:35:47 -07:00
Jesse Gross	2d6eac9084	kvcache: Optimize sliding window attention Currently sliding window attention allocates and uses the full context size and just masks out any tokens that are outside of the window. However, we really only need (roughly) the sliding window size. At large context sizes this improves two things: - Memory allocated - since the fully context size is allocated up front, memory requirements drop substantially. On Gemma3:4b with a 32k context window, total memory usage (including weights and non-sliding layers) drops from ~20GB to ~8GB. - Computation - ranges that are completely outside of the sliding window are now removed from the tensors that are returned from the cache rather than simply being masked out. This results in more efficient processing, scaling with the size of the context that has actually been used. Notable, this does not update the scheduler for any model to be aware of the smaller memory requirements. This is difficult for Gemma3 because the layers are heterogeneous between sliding and non-sliding attention. As a result, while actual memory consumption will be reduced, the scheduler will over-estimate the requirements of the model. This means that splitting between GPUs or GPUs and CPUs will still be suboptimal. Bug #9730	2025-03-21 11:20:19 -07:00
Jesse Gross	3ed7ad3ab3	kvcache: Pass granular cache size into implementations Currently the runner computes the kv size needed and creates a cache of that size. This is the context size times number of parallel sequences. Cache implementations can make better decisions about their memory usage, so instead pass in the required capacity, number of sequences and maximum batch size. For now, the causal cache just uses this to compute the size in the same way as before.	2025-03-21 11:20:19 -07:00
Patrick Devine	6d1103048e	fix: show correct bool value for kv in verbose show information (#9928 )	2025-03-21 11:13:54 -07:00
Jesse Gross	0ff28758b3	ollamarunner: Provide mechanism for backends to report loading progress This enables the runner to report progress back to the Ollama server, both for showing status to the user and also to prevent the server from killing the runner if it thinks things have stalled. Most of the infrastructure was already there, this extends it to be available to the backends.	2025-03-21 10:44:26 -07:00
Jesse Gross	d3e9ca3eda	kvcache: Account for source tensors in defrag operation count Defragging the KV cache can generate a lot of operations, so we need to be careful that we don't overflow the number that the graph can support. We currently account for all of the nodes that we add to the graph for each move but we also need to include the original cache tensors as well. Fixes #9904	2025-03-21 10:42:19 -07:00
Jesse Gross	0fbfcf3c9c	model: Pass input tensor instead of raw data to models Rather than directly giving the input data to models, we can pass a tensor instead. In the short term, this saves some duplicated code. Longer term, we will want to overlap setting up the next batch with processing of the current one. In this case, we will only have the shape of tensor but it will not be loaded with data at the time of graph generation. By passing only a tensor to models now, we set up this possibility and prevent them from relying on data that they won't have in the future. Although the same could be done for Positions and Outputs, in some cases we either need the raw input data or don't use them at all. Therefore, for now we leave them as they are and allow models to convert them to tensors as needed.	2025-03-20 13:28:13 -07:00
Jesse Gross	0c220935bd	input: Rename Options to Batch Options is no longer very descriptive of this struct.	2025-03-20 13:28:13 -07:00

1 2 3 4 5 ...

4139 Commits