ollama

mirror of https://github.com/ollama/ollama.git synced 2025-09-13 21:41:52 +02:00

Author	SHA1	Message	Date
Devon Rifkin	b2b270ad5d	Merge branch 'main' into drifkin/array-head-count-simple	2025-06-23 10:37:31 -07:00
Michael Yang	0a066cfd91	Reapply "feat: incremental gguf parser (#10822 )" (#11114 ) (#11119 ) * Reapply "feat: incremental gguf parser (#10822)" (#11114) This reverts commit `a6e64fbdf2`. * fix older ggufs	2025-06-20 11:11:40 -07:00
Jeffrey Morgan	a6e64fbdf2	Revert "feat: incremental gguf parser (#10822 )" (#11114 ) This reverts commit `6b04cad7e8`.	2025-06-18 05:42:44 -07:00
曹家巧	60cfa2a203	cache: fix comment function name in cache.go (#11110 )	2025-06-18 05:21:45 -07:00
Jeffrey Morgan	9f8a18ec05	tools: loosen tool parsing to allow for more formats (#11030 )	2025-06-12 14:18:54 -07:00
Michael Yang	6b04cad7e8	feat: incremental gguf parser (#10822 ) * incremental gguf parser * gguf: update test to not rely on gguf on disc * re-use existing create gguf * read capabilities from gguf kv * kv exists * update tests * s/doneFunc/successFunc/g * new buffered reader --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-06-12 11:04:11 -07:00
Jeffrey Morgan	09d308d6b6	Revert "server: add model capabilities to the list endpoint (#10174 )" (#11004 ) This reverts commit `0943001193`.	2025-06-06 23:29:14 -04:00
Devon Rifkin	a3b6886b7d	move thinking logic into its own package (#10990 ) move thinking logic into its own package	2025-06-06 12:02:20 -07:00
Devon Rifkin	0683efa637	export ThinkingParser	2025-06-05 10:22:32 -07:00
JasonHonKL	0943001193	server: add model capabilities to the list endpoint (#10174 )	2025-06-04 11:39:48 -07:00
Devon Rifkin	5f57b0ef42	add thinking support to the api and cli (#10584 ) - Both `/api/generate` and `/api/chat` now accept a `"think"` option that allows specifying whether thinking mode should be on or not - Templates get passed this new option so, e.g., qwen3's template can put `/think` or `/no_think` in the system prompt depending on the value of the setting - Models' thinking support is inferred by inspecting model templates. The prefix and suffix the parser uses to identify thinking support is also automatically inferred from templates - Thinking control & parsing is opt-in via the API to prevent breaking existing API consumers. If the `"think"` option is not specified, the behavior is unchanged from previous versions of ollama - Add parsing for thinking blocks in both streaming/non-streaming mode in both `/generate` and `/chat` - Update the CLI to make use of these changes. Users can pass `--think` or `--think=false` to control thinking, or during an interactive session they can use the commands `/set think` or `/set nothink` - A `--hidethinking` option has also been added to the CLI. This makes it easy to use thinking in scripting scenarios like `ollama run qwen3 --think --hidethinking "my question here"` where you just want to see the answer but still want the benefits of thinking models	2025-05-28 19:38:52 -07:00
Kyle Steere	9239a254e0	server: abort download on empty digest Signed-off-by: Kyle Steere <kyle.steere@chainguard.dev>	2025-05-27 11:28:48 -07:00
frob	eda472df1b	server: add hint to the error message when model path access fails (#10843 )	2025-05-24 13:17:04 -07:00
Parth Sareen	e8b981fa5d	tools: refactor tool call parsing and enable streaming (#10415 )	2025-05-23 14:19:31 -07:00
Daniel Hiltgen	d950ff12c0	sched: fix runner leak during reloading unload (#10819 ) When the same model is being reloaded rapidly with client connections being canceled before the model finishes loading, the queued unload event could cause a leak of runners by deleting a different runner from the loaded list.	2025-05-22 14:31:36 -07:00
Bruce MacDonald	fbe6ae285a	server: improve tensor quantization fallback logic (#10806 ) Fall back to alternative quantization types when a tensor's dimensions aren't divisible by the block size required for the original desired quantization type. If retried quantization types fail, the system ultimately falls back to F16 (half-precision floating point) which has a block size of 1 and can handle any tensor dimension.	2025-05-22 10:48:08 -07:00
Michael Yang	61aeaf7e81	remove support for multiple ggufs in a single file (#10722 ) * remove support for multiple ggufs in a single file this was an attempt to make it easier to import multimodal models into ollama. this was rarely used and error prone so remove it * fix: create fused model from blob	2025-05-21 13:55:31 -07:00
Daniel Hiltgen	1a0cfd080a	avoid kv truncation during create (#10761 )	2025-05-19 13:54:54 -07:00
Jesse Gross	94ab428e3f	ggml: Seperate tensor load from backend creation Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.	2025-05-19 09:54:22 -07:00
Daniel Hiltgen	ff80718e9c	fix crash in old clients with quantization progress (#10710 ) Older clients assumed the digest was at least 19 characters long so increase the size of the dummy digest to avoid array out of bounds crashes.	2025-05-14 14:54:18 -07:00
Michael Yang	23125648b8	chore: update mllama to use ollama engine (#10637 )	2025-05-13 17:36:02 -07:00
Jeffrey Morgan	c7f4ae7b9c	server: add webp image input support (#10653 )	2025-05-12 20:41:42 -07:00
Daniel Hiltgen	9d6df90805	Follow up to #10363 (#10647 ) The quantization PR didn't block all unsupported file types, which this PR fixes. It also updates the API docs to reflect the now reduced set of supported types.	2025-05-12 15:23:31 -07:00
Bruce MacDonald	ad035ad595	convert: quantize from safetensors needs kv (#10675 ) When creating a quantized model from safetensors we need the array KV values to be loaded.Changing this value to -1 loads the KV values on the returned layer to be used and saved during quantization.	2025-05-12 12:04:20 -07:00
Michael Yang	f95a1f2bef	feat: add trace log level (#10650 ) reduce prompt log to trace level	2025-05-12 11:43:00 -07:00
Michael Yang	0d6e35d3c6	fix: stream accumulator exits early (#10593 ) the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.	2025-05-08 13:17:30 -07:00
Devon Rifkin	20c5fd39c8	Merge branch 'main' into drifkin/array-head-count-simple	2025-05-08 11:46:52 -07:00
Michael Yang	6e9a7a2568	lint: enable usetesting, disable tenv (#10594 )	2025-05-08 11:42:14 -07:00
Daniel Hiltgen	5e380c3b42	sched: fix race leading to orphaned runners (#10599 ) If a model is loading, and the request context is canceled during the load by a client closing the connection, and another request is inbound for the same model with a different configuration (context size, etc.) thus requiring a reload, two unload events can be in flight. The first shuts down the original model load, but the second one caused the loss of the new reloading runner reference, thus triggering the leak. The primary fix is detecting the duplicate unload and ignoring the second instance. The load routine is also hardened to ensure we detect clobbering an already present runner and unload it with a warning.	2025-05-07 09:38:17 -07:00
Jeffrey Morgan	392de84031	api: remove unused RetrieveModelResponse type (#10603 )	2025-05-06 23:08:03 -07:00
Devon Rifkin	4090aca97b	server: send 405 instead of 404 for unallowed methods (#10275 ) Fixes: #5483	2025-05-06 14:45:37 -07:00
Michael Yang	92ce438de0	server: remove internal cmd (#10595 )	2025-05-06 13:05:01 -07:00
Daniel Hiltgen	424810450f	Move quantization to new backend (#10363 ) * Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.	2025-05-06 11:20:48 -07:00
Jeffrey Morgan	1703d1472e	server: fix panic when runner.Options is nil (#10566 )	2025-05-05 09:01:33 -07:00
Daniel Hiltgen	76ea735aaf	sched: logging improvements (#10550 ) This enhances our logging in the scheduler. The initial "waiting for server" log no longer claims an initial error state (now "not responding" which better reflects the actual state). Runners now have slog wiring to report more details about the runner, including PID.	2025-05-03 12:01:56 -07:00
frob	e6d2d04121	image: add vision capability for projector-based models (#10509 ) Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-05-01 16:50:20 -07:00
Devon Rifkin	ad3c7c9bda	strip out thinking tags in message history for qwen3 & r1 (#10490 ) * strip out thinking tags in message history for qwen3 & r1 This is in advance of "proper" support where we'll make reasoning configurable and we'll parse out thinking/reasoning tags and provide them to the caller. These models expect there to be no thinking tags in the message history, so this should improve quality * parse model names instead of hacky prefix check	2025-04-30 13:57:45 -07:00
Daniel Hiltgen	415c8fcc3d	Fix "Stopping..." scheduler hang (#10487 ) * Adjust initial scheduler refCount Ensure we only set the refCount on success * sched: fix lock order inversion deadlock Under certain race conditions, there was a scenario where the scheduler would get into a deadlock while trying to update free space information while a model was trying to unload.	2025-04-30 11:26:52 -07:00
Devon Rifkin	fe5b9bb21b	lower default num parallel to 2 this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k	2025-04-29 02:04:14 -07:00
Devon Rifkin	dd93e1af85	Revert "increase default context length to 4096 (#10364 )" This reverts commit `424f648632`.	2025-04-28 16:54:11 -07:00
Devon Rifkin	d2ee599dcf	load arrays with up to 1024 elements when estimating This mirrors the old behavior before #10382	2025-04-27 13:45:13 -07:00
Michael Yang	340448d2d1	explicitly decode maxarraysize 1024	2025-04-25 16:59:01 -07:00
Michael Yang	214a7678ea	fix superfluous call to WriteHeader the first call to http.ResponseWriter.Write implicitly calls WriteHeader with http.StatusOK if it hasn't already been called. once WriteHeader has been called, subsequent calls has no effect. Write is called when JSON encoding progressUpdateJSON{}. calls to http.ResponseWriter.WriteHeader after the first encode is useless and produces a warning: http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77)	2025-04-25 16:58:49 -07:00
Devon Rifkin	424f648632	increase default context length to 4096 (#10364 ) * increase default context length to 4096 We lower the default numParallel from 4 to 2 and use these "savings" to double the default context length from 2048 to 4096. We're memory neutral in cases when we previously would've used numParallel == 4, but we add the following mitigation to handle some cases where we would have previously fallen back to 1x2048 due to low VRAM: we decide between 2048 and 4096 using a runtime check, choosing 2048 if we're on a one GPU system with total VRAM of <= 4 GB. We purposefully don't check the available VRAM because we don't want the context window size to change unexpectedly based on the available VRAM. We plan on making the default even larger, but this is a relatively low-risk change we can make to quickly double it. * fix tests add an explicit context length so they don't get truncated. The code that converts -1 from being a signal for doing a runtime check isn't running as part of these tests. * tweak small gpu message * clarify context length default also make it actually show up in `ollama serve --help`	2025-04-22 16:33:24 -07:00
Michael Yang	88738b357b	create tempdir in models directory the models directory should have plenty of storage and also ensure there's no cross-device copy	2025-04-18 18:13:05 -07:00
Blake Mizerany	4e535e6188	server/internal/registry: make pull send errors with Error field (#10326 ) Previously, the pull handler would send an error message in the Status field, this prevented the client from using the message as a signal to stop. In the case of the "run" command, it would follow the pull with a "show" which would print a nearly identical "not found" message for unresolved models. Fixes #10307	2025-04-18 18:12:28 -07:00
Blake Mizerany	1d99451ad7	server/internal/client/ollama: handle some network errors gracefully (#10317 )	2025-04-17 12:43:09 -07:00
Blake Mizerany	369de832cd	server/internal/registry: remove superfluous progress bar flush (#10303 ) This removes the extra flushProgress() at the end of handlePull. It is unnecessary because final progress updates are flushed in all cases of the main select loop.	2025-04-16 14:43:07 -07:00
Blake Mizerany	3457a315b2	server/internal/client/ollama: cleanup use of multiple counters (#10304 ) The completed and received counters must work in tandem and the code should better reflect that. Previously, the act of updating them was 2-3 lines of code duplicated in multiple places. This consolidates them into a single update closure for easy reading and maintenance. This also simplifies error handling in places where we can use a return parameter and defer to handle the error case for updates. Also, remove the old Layer field from the trackingReader struct.	2025-04-16 14:33:40 -07:00
Daniel Hiltgen	56dc316a57	Give tests more time to run (#10306 ) Fix flake failures on windows	2025-04-16 13:37:00 -07:00

1 2 3 4 5 ...

862 Commits