862 Commits

Author SHA1 Message Date
Devon Rifkin
b2b270ad5d Merge branch 'main' into drifkin/array-head-count-simple 2025-06-23 10:37:31 -07:00
Michael Yang
0a066cfd91 Reapply "feat: incremental gguf parser (#10822)" (#11114) (#11119)
* Reapply "feat: incremental gguf parser (#10822)" (#11114)

This reverts commit a6e64fbdf2.

* fix older ggufs
2025-06-20 11:11:40 -07:00
Jeffrey Morgan
a6e64fbdf2 Revert "feat: incremental gguf parser (#10822)" (#11114)
This reverts commit 6b04cad7e8.
2025-06-18 05:42:44 -07:00
曹家巧
60cfa2a203 cache: fix comment function name in cache.go (#11110) 2025-06-18 05:21:45 -07:00
Jeffrey Morgan
9f8a18ec05 tools: loosen tool parsing to allow for more formats (#11030) 2025-06-12 14:18:54 -07:00
Michael Yang
6b04cad7e8 feat: incremental gguf parser (#10822)
* incremental gguf parser
* gguf: update test to not rely on gguf on disc
* re-use existing create gguf
* read capabilities from gguf kv
* kv exists
* update tests
* s/doneFunc/successFunc/g
* new buffered reader

---------

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-06-12 11:04:11 -07:00
Jeffrey Morgan
09d308d6b6 Revert "server: add model capabilities to the list endpoint (#10174)" (#11004)
This reverts commit 0943001193.
2025-06-06 23:29:14 -04:00
Devon Rifkin
a3b6886b7d move thinking logic into its own package (#10990)
move thinking logic into its own package
2025-06-06 12:02:20 -07:00
Devon Rifkin
0683efa637 export ThinkingParser 2025-06-05 10:22:32 -07:00
JasonHonKL
0943001193 server: add model capabilities to the list endpoint (#10174) 2025-06-04 11:39:48 -07:00
Devon Rifkin
5f57b0ef42 add thinking support to the api and cli (#10584)
- Both `/api/generate` and `/api/chat` now accept a `"think"`
  option that allows specifying whether thinking mode should be on or
  not
- Templates get passed this new option so, e.g., qwen3's template can
  put `/think` or `/no_think` in the system prompt depending on the
  value of the setting
- Models' thinking support is inferred by inspecting model templates.
  The prefix and suffix the parser uses to identify thinking support is
  also automatically inferred from templates
- Thinking control & parsing is opt-in via the API to prevent breaking
  existing API consumers. If the `"think"` option is not specified, the
  behavior is unchanged from previous versions of ollama
- Add parsing for thinking blocks in both streaming/non-streaming mode
  in both `/generate` and `/chat`
- Update the CLI to make use of these changes. Users can pass `--think`
  or `--think=false` to control thinking, or during an interactive
  session they can use the commands `/set think` or `/set nothink`
- A `--hidethinking` option has also been added to the CLI. This makes
  it easy to use thinking in scripting scenarios like
  `ollama run qwen3 --think --hidethinking "my question here"` where you
  just want to see the answer but still want the benefits of thinking
  models
2025-05-28 19:38:52 -07:00
Kyle Steere
9239a254e0 server: abort download on empty digest
Signed-off-by: Kyle Steere <kyle.steere@chainguard.dev>
2025-05-27 11:28:48 -07:00
frob
eda472df1b server: add hint to the error message when model path access fails (#10843) 2025-05-24 13:17:04 -07:00
Parth Sareen
e8b981fa5d tools: refactor tool call parsing and enable streaming (#10415) 2025-05-23 14:19:31 -07:00
Daniel Hiltgen
d950ff12c0 sched: fix runner leak during reloading unload (#10819)
When the same model is being reloaded rapidly with client connections
being canceled before the model finishes loading, the queued unload
event could cause a leak of runners by deleting a different runner from
the loaded list.
2025-05-22 14:31:36 -07:00
Bruce MacDonald
fbe6ae285a server: improve tensor quantization fallback logic (#10806)
Fall back to alternative quantization types when a tensor's dimensions aren't divisible by the block size required for the original desired quantization type. If retried quantization types fail, the system ultimately falls back to F16 (half-precision floating point) which has a block size of 1 and can handle any tensor dimension.
2025-05-22 10:48:08 -07:00
Michael Yang
61aeaf7e81 remove support for multiple ggufs in a single file (#10722)
* remove support for multiple ggufs in a single file

this was an attempt to make it easier to import multimodal models into
ollama. this was rarely used and error prone so remove it

* fix: create fused model from blob
2025-05-21 13:55:31 -07:00
Daniel Hiltgen
1a0cfd080a avoid kv truncation during create (#10761) 2025-05-19 13:54:54 -07:00
Jesse Gross
94ab428e3f ggml: Seperate tensor load from backend creation
Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.
2025-05-19 09:54:22 -07:00
Daniel Hiltgen
ff80718e9c fix crash in old clients with quantization progress (#10710)
Older clients assumed the digest was at least 19 characters long so increase the size
of the dummy digest to avoid array out of bounds crashes.
2025-05-14 14:54:18 -07:00
Michael Yang
23125648b8 chore: update mllama to use ollama engine (#10637) 2025-05-13 17:36:02 -07:00
Jeffrey Morgan
c7f4ae7b9c server: add webp image input support (#10653) 2025-05-12 20:41:42 -07:00
Daniel Hiltgen
9d6df90805 Follow up to #10363 (#10647)
The quantization PR didn't block all unsupported file types,
which this PR fixes.  It also updates the API docs to reflect
the now reduced set of supported types.
2025-05-12 15:23:31 -07:00
Bruce MacDonald
ad035ad595 convert: quantize from safetensors needs kv (#10675)
When creating a quantized model from safetensors we
need the array KV values to be loaded.Changing this
value to -1 loads the KV values on the returned
layer to be used and saved during quantization.
2025-05-12 12:04:20 -07:00
Michael Yang
f95a1f2bef feat: add trace log level (#10650)
reduce prompt log to trace level
2025-05-12 11:43:00 -07:00
Michael Yang
0d6e35d3c6 fix: stream accumulator exits early (#10593)
the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly
since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.
2025-05-08 13:17:30 -07:00
Devon Rifkin
20c5fd39c8 Merge branch 'main' into drifkin/array-head-count-simple 2025-05-08 11:46:52 -07:00
Michael Yang
6e9a7a2568 lint: enable usetesting, disable tenv (#10594) 2025-05-08 11:42:14 -07:00
Daniel Hiltgen
5e380c3b42 sched: fix race leading to orphaned runners (#10599)
If a model is loading, and the request context is canceled during the load
by a client closing the connection, and another request is inbound for the
same model with a different configuration (context size, etc.) thus requiring
a reload, two unload events can be in flight.  The first shuts down the
original model load, but the second one caused the loss of the new
reloading runner reference, thus triggering the leak.

The primary fix is detecting the duplicate unload and ignoring the second
instance.  The load routine is also hardened to ensure we detect
clobbering an already present runner and unload it with a warning.
2025-05-07 09:38:17 -07:00
Jeffrey Morgan
392de84031 api: remove unused RetrieveModelResponse type (#10603) 2025-05-06 23:08:03 -07:00
Devon Rifkin
4090aca97b server: send 405 instead of 404 for unallowed methods (#10275)
Fixes: #5483
2025-05-06 14:45:37 -07:00
Michael Yang
92ce438de0 server: remove internal cmd (#10595) 2025-05-06 13:05:01 -07:00
Daniel Hiltgen
424810450f Move quantization to new backend (#10363)
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
2025-05-06 11:20:48 -07:00
Jeffrey Morgan
1703d1472e server: fix panic when runner.Options is nil (#10566) 2025-05-05 09:01:33 -07:00
Daniel Hiltgen
76ea735aaf sched: logging improvements (#10550)
This enhances our logging in the scheduler.  The initial "waiting for server" log
no longer claims an initial error state (now "not responding" which better reflects
the actual state).  Runners now have slog wiring to report more details about the
runner, including PID.
2025-05-03 12:01:56 -07:00
frob
e6d2d04121 image: add vision capability for projector-based models (#10509)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-05-01 16:50:20 -07:00
Devon Rifkin
ad3c7c9bda strip out thinking tags in message history for qwen3 & r1 (#10490)
* strip out thinking tags in message history for qwen3 & r1

This is in advance of "proper" support where we'll make reasoning
configurable and we'll parse out thinking/reasoning tags and provide
them to the caller. These models expect there to be no thinking tags in
the message history, so this should improve quality

* parse model names instead of hacky prefix check
2025-04-30 13:57:45 -07:00
Daniel Hiltgen
415c8fcc3d Fix "Stopping..." scheduler hang (#10487)
* Adjust initial scheduler refCount

Ensure we only set the refCount on success

* sched: fix lock order inversion deadlock

Under certain race conditions, there was a scenario where the scheduler would
get into a deadlock while trying to update free space information while a model
was trying to unload.
2025-04-30 11:26:52 -07:00
Devon Rifkin
fe5b9bb21b lower default num parallel to 2
this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k
2025-04-29 02:04:14 -07:00
Devon Rifkin
dd93e1af85 Revert "increase default context length to 4096 (#10364)"
This reverts commit 424f648632.
2025-04-28 16:54:11 -07:00
Devon Rifkin
d2ee599dcf load arrays with up to 1024 elements when estimating
This mirrors the old behavior before #10382
2025-04-27 13:45:13 -07:00
Michael Yang
340448d2d1 explicitly decode maxarraysize 1024 2025-04-25 16:59:01 -07:00
Michael Yang
214a7678ea fix superfluous call to WriteHeader
the first call to http.ResponseWriter.Write implicitly calls WriteHeader
with http.StatusOK if it hasn't already been called. once WriteHeader
has been called, subsequent calls has no effect. Write is called when
JSON encoding progressUpdateJSON{}. calls to
http.ResponseWriter.WriteHeader after the first encode is useless and
produces a warning:

http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77)
2025-04-25 16:58:49 -07:00
Devon Rifkin
424f648632 increase default context length to 4096 (#10364)
* increase default context length to 4096

We lower the default numParallel from 4 to 2 and use these "savings" to
double the default context length from 2048 to 4096.

We're memory neutral in cases when we previously would've used
numParallel == 4, but we add the following mitigation to handle some
cases where we would have previously fallen back to 1x2048 due to low
VRAM: we decide between 2048 and 4096 using a runtime check, choosing
2048 if we're on a one GPU system with total VRAM of <= 4 GB. We
purposefully don't check the available VRAM because we don't want the
context window size to change unexpectedly based on the available VRAM.

We plan on making the default even larger, but this is a relatively
low-risk change we can make to quickly double it.

* fix tests

add an explicit context length so they don't get truncated. The code
that converts -1 from being a signal for doing a runtime check isn't
running as part of these tests.

* tweak small gpu message

* clarify context length default

also make it actually show up in `ollama serve --help`
2025-04-22 16:33:24 -07:00
Michael Yang
88738b357b create tempdir in models directory
the models directory should have plenty of storage and also ensure
there's no cross-device copy
2025-04-18 18:13:05 -07:00
Blake Mizerany
4e535e6188 server/internal/registry: make pull send errors with Error field (#10326)
Previously, the pull handler would send an error message in the Status
field, this prevented the client from using the message as a signal to
stop. In the case of the "run" command, it would follow the pull with a
"show" which would print a nearly identical "not found" message for
unresolved models.

Fixes #10307
2025-04-18 18:12:28 -07:00
Blake Mizerany
1d99451ad7 server/internal/client/ollama: handle some network errors gracefully (#10317) 2025-04-17 12:43:09 -07:00
Blake Mizerany
369de832cd server/internal/registry: remove superfluous progress bar flush (#10303)
This removes the extra flushProgress() at the end of handlePull. It is
unnecessary because final progress updates are flushed in all cases of
the main select loop.
2025-04-16 14:43:07 -07:00
Blake Mizerany
3457a315b2 server/internal/client/ollama: cleanup use of multiple counters (#10304)
The completed and received counters must work in tandem and the code
should better reflect that. Previously, the act of updating them was 2-3
lines of code duplicated in multiple places. This consolidates them into
a single update closure for easy reading and maintenance.

This also simplifies error handling in places where we can use a return
parameter and defer to handle the error case for updates.

Also, remove the old Layer field from the trackingReader struct.
2025-04-16 14:33:40 -07:00
Daniel Hiltgen
56dc316a57 Give tests more time to run (#10306)
Fix flake failures on windows
2025-04-16 13:37:00 -07:00