Commit Graph

4234 Commits

Author SHA1 Message Date
Michael Yang
d931ee8f22 create blobs in parallel (#10135)
* default max term height
* error on out of tree files
2025-05-05 11:59:26 -07:00
Jesse Gross
7073600797 ggml: Reduce log level of "key not found"
Most of the time this is not an error.
2025-05-05 11:17:32 -07:00
Daniel Hiltgen
b1c40138da win: lint fix (#10571) 2025-05-05 11:08:12 -07:00
Ashok Gelal
17466217e5 Hide empty terminal window (#8668)
This hides the LlamaServer blank window when chatting outside of the terminal (say like with an app like Msty). This has no other side effects when invoking it the regular way.
2025-05-05 09:06:46 -07:00
Jeffrey Morgan
1703d1472e server: fix panic when runner.Options is nil (#10566) 2025-05-05 09:01:33 -07:00
Jeffrey Morgan
913905028b all: fix cgo compiler warnings on windows (#10563) 2025-05-05 08:02:39 -07:00
湛露先生
7e5c8eee5c file close check and close. (#10554)
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-05-04 15:37:59 -07:00
Daniel Hiltgen
6a74bba7e7 win: ensure ollama paths come first (#10549)
For all search path env vars make sure our dirs are first
to avoid potentially finding other incompatible libraries
on the users system.

Also fixes a minor build script glitch for windows rocm
2025-05-03 13:11:48 -07:00
Daniel Hiltgen
76ea735aaf sched: logging improvements (#10550)
This enhances our logging in the scheduler.  The initial "waiting for server" log
no longer claims an initial error state (now "not responding" which better reflects
the actual state).  Runners now have slog wiring to report more details about the
runner, including PID.
2025-05-03 12:01:56 -07:00
aritra saha
dd1d4e99e7 readme: add llama 4 models (#10530) 2025-05-02 19:45:02 -07:00
Jesse Gross
a6ef73f4f2 ggml: Fix race that resulted in "context canceled" when loading
Successfully completing processing with an errgroup cancels the
associated context. However, we also have a goroutine that is checking
for cancelation of the context. As a result, there is a race where
the goroutine can pick up the cancelation and report an error,
replacing the sucessful error message.

To avoid that, this replaces the goroutine with a cancelation check
when we are reading files. This also has the advantage of stopping
all reads relatively quickly on error and also ensuring that there are
no outstanding I/O operations when we return in this case.

The downside is that if a file read blocks forever (for example, over
the network) then cancelation of the context effectively won't be
honored. However, this is also true for other smaller files we read
and the tensors are read in small chunks (128K), so it's consistent
and better on balance overall.
2025-05-02 13:43:25 -07:00
Jesse Gross
c2f5d6662b ollamarunner: Re-enable worst case graph preallocation.
Worst case graph preallocation was disabled by a27462b
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.

This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.
2025-05-02 12:22:47 -07:00
Harsh Nevse
57fb759f3c readme: update link to langchain in community integrations (#10465) 2025-05-01 23:08:51 -07:00
Jeffrey Morgan
8dd12c873d llama: update to commit e1e8e099 (#10513) 2025-05-01 18:24:09 -07:00
frob
e6d2d04121 image: add vision capability for projector-based models (#10509)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-05-01 16:50:20 -07:00
Jesse Gross
074bac8447 kvcache: Log batch size if we can't find a slot
In some cases, we can't find a cache slot when using sliding window
attention. It would be helpful in this (and other cases) to know what
the batch size is.

Bug #10127
2025-05-01 16:26:36 -07:00
Jesse Gross
8e8f2c6d67 ollamarunner: Fix memory leak when processing images
The context (and therefore associated input tensors) was not being
properly closed when images were being processed. We were trying to
close them but in reality we were closing over an empty list, preventing
anything from actually being freed.

Fixes #10434
2025-05-01 15:15:24 -07:00
AliAhmedNada
938e8447e8 readme: add Jirapt project to community integrations (#10522) 2025-05-01 14:49:47 -07:00
aritra saha
d5d5f0c445 readme: change granite3.2 to granite3.3 (#10525)
Update the list for readme
2025-05-01 14:46:09 -07:00
Michael Yang
a7835c6716 fix: write gguf padding (#10510)
* add gguf_test

* fix padding

padding was being added to offset but not to the running count
2025-04-30 17:59:31 -07:00
Devon Rifkin
ad3c7c9bda strip out thinking tags in message history for qwen3 & r1 (#10490)
* strip out thinking tags in message history for qwen3 & r1

This is in advance of "proper" support where we'll make reasoning
configurable and we'll parse out thinking/reasoning tags and provide
them to the caller. These models expect there to be no thinking tags in
the message history, so this should improve quality

* parse model names instead of hacky prefix check
2025-04-30 13:57:45 -07:00
Daniel Hiltgen
415c8fcc3d Fix "Stopping..." scheduler hang (#10487)
* Adjust initial scheduler refCount

Ensure we only set the refCount on success

* sched: fix lock order inversion deadlock

Under certain race conditions, there was a scenario where the scheduler would
get into a deadlock while trying to update free space information while a model
was trying to unload.
2025-04-30 11:26:52 -07:00
Daniel Hiltgen
718eda1b3e Narrow set of paths we load GGML from (#10485)
Users may have other incompatible GGML installs on their systems.
This will prevent us from trying to load them from the path.
2025-04-30 11:25:22 -07:00
Shahin R
421b7edeb4 readme: add link to lumina, a lightweight React frontend client (#10378) 2025-04-30 09:50:47 -07:00
batuhankadioglu
7b68e254c2 all: update several golang.org/x packages (#10436) 2025-04-29 16:51:09 -07:00
Daniel Hiltgen
7bec2724a5 integration: fix embedding tests error handling (#10478)
The cleanup routine from InitServerconnection should run in the defer of the test case to properly detect failures and report the server logs
2025-04-29 11:57:54 -07:00
Jesse Gross
a27462b708 ollamarunner: Temporarily disable worst case graph preallocation
When we later have a large batch running purely on a CPU, this
results the error:
GGML_ASSERT(talloc->buffer_id >= 0)

Disabling this means that we will incrementally reallocate memory
as the graph grows.

Fixes #10410
2025-04-29 11:04:58 -07:00
crStiv
6bf0b8193a readme: fix typos (#10399) 2025-04-29 10:30:44 -07:00
Devon Rifkin
db428adbb8 Merge pull request #10468 from ollama/drifkin/num-parallel-1 2025-04-29 10:21:36 -07:00
Devon Rifkin
fe5b9bb21b lower default num parallel to 2
this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k
2025-04-29 02:04:14 -07:00
Devon Rifkin
6ec71d8fb6 Merge pull request #10452 from ollama/drifkin/4096-context-length
config: update default context length to 4096
2025-04-28 17:13:51 -07:00
Devon Rifkin
44b466eeb2 config: update default context length to 4096 2025-04-28 17:03:27 -07:00
Devon Rifkin
a25f3f8260 Merge pull request #10451 from ollama/revert-10364-drifkin/context-length
Revert "increase default context length to 4096"
2025-04-28 17:02:10 -07:00
Devon Rifkin
dd93e1af85 Revert "increase default context length to 4096 (#10364)"
This reverts commit 424f648632.
2025-04-28 16:54:11 -07:00
Michael Yang
5cfc1c39f3 model: fix build (#10416) v0.6.7-rc0 2025-04-25 19:24:48 -07:00
Michael Yang
f0ad49ea17 memory 2025-04-25 16:59:20 -07:00
Michael Yang
7ba9fa9c7d fixes for maverick 2025-04-25 16:59:20 -07:00
Michael Yang
8bf11b84c1 chunked attention 2025-04-25 16:59:20 -07:00
Michael Yang
470af8ab89 connect vision to text 2025-04-25 16:59:20 -07:00
Michael Yang
178761aef3 image processing
Co-authored-by: Patrick Devine <patrick@infrahq.com>
2025-04-25 16:59:20 -07:00
Michael Yang
f0c66e6dea llama4 2025-04-25 16:59:20 -07:00
Michael Yang
54055a6dae fix test 2025-04-25 16:59:01 -07:00
Michael Yang
340448d2d1 explicitly decode maxarraysize 1024 2025-04-25 16:59:01 -07:00
Michael Yang
ced7d0e53d fix parameter count 2025-04-25 16:59:01 -07:00
Michael Yang
a0dba0f8ae default slice values 2025-04-25 16:59:01 -07:00
Michael Yang
5e20b170a7 update comment 2025-04-25 16:59:01 -07:00
Michael Yang
d26c18e25c fix token type 2025-04-25 16:59:01 -07:00
Michael Yang
8d376acc9b zero means zero
use a default of 1024 when asking for zero is confusing since most calls
seem to assume 0 means do not ready any data
2025-04-25 16:59:01 -07:00
Michael Yang
dc1e81f027 convert: use -1 for read all 2025-04-25 16:59:01 -07:00
Michael Yang
5d0279164c generic ggml.array 2025-04-25 16:59:01 -07:00