Commit Graph

4740 Commits

Author SHA1 Message Date
Michael Yang
17008c5efe remove unused ops 2025-11-03 14:53:08 -08:00
Michael Yang
f0e0e2c289 deepseek2 2025-11-03 14:50:47 -08:00
Michael Yang
9228336ed0 qwen25vl 2025-11-03 14:36:42 -08:00
Michael Yang
742f3f149c qwen3vl 2025-11-03 14:36:42 -08:00
Michael Yang
dd8c226f17 mistral3 2025-11-03 14:36:42 -08:00
Michael Yang
5436bb5ff9 gptoss 2025-11-03 14:36:42 -08:00
Michael Yang
bef18cd700 gemma3n 2025-11-03 14:36:42 -08:00
Michael Yang
702d5c71c9 llama4 2025-11-03 14:36:42 -08:00
Michael Yang
eb3e1de5c5 bert 2025-11-03 12:59:55 -08:00
Michael Yang
4e3ad2c2ed use slice/chunks 2025-11-03 12:54:11 -08:00
Michael Yang
eadae522dc chunk, chunksections 2025-11-03 12:12:12 -08:00
Michael Yang
d3c208dd23 slice 2025-11-03 12:12:12 -08:00
Michael Yang
ce3eb0a315 chore(gptoss): cleanup dead code (#12932) 2025-11-03 11:27:15 -08:00
Ryan Coleman
60829f7ec6 readme: add Strands Agents to community integrations (#11740) 2025-11-02 16:01:28 -08:00
Attogram Project
9a50fd584c readme: add Ollama Bash Lib to community integrations (#12235) 2025-11-02 15:44:56 -08:00
Jesse Gross
392a270261 ggml: Avoid cudaMemsetAsync during memory fitting
We pass invalid pointers when we check the size of the required
compute graph before fitting. Some CUDA APIs validate these pointers
but we can just skip them during this phase. cudaMemsetAsync is one
of these that we weren't skipping but never took the code path that
used it. Now that we have enabled op_offload, we can hit it in
memory pressured situations.
2025-10-31 15:23:28 -07:00
Daniel Hiltgen
3bee3af6ed cpu: always ensure LibOllamaPath included (#12890)
In CPU only setups the LibOllamaPath was omitted causing
us not to load the ggml-cpu-XXX libraries during inference.
2025-10-31 14:37:29 -07:00
Daniel Hiltgen
83537993d7 logs: catch rocm errors (#12888)
This will help bubble up more crash errors
2025-10-31 09:54:25 -07:00
nicole pardal
7dd4862a89 embeddings: removed redundant TestAPIEmbeddings test (#12863)
This PR removes a redundant test from TestAPIEmbeddings
Contents of this test already exists in embed_test.go and model_arch_test.go
2025-10-30 17:12:33 -07:00
Daniel Hiltgen
db973c8fc2 win: avoid ID mixups on refresh (#12869)
On Windows AMD IDs are numeric, and can reorder based on the filter environment.
By passing in the filter env on a full discovery refresh, we'll only look at the actual devices
and ignore unsupported iGPUs.  Without this, on some systems iGPU VRAM was incorrectly
being used to populate the dGPU.
2025-10-30 15:12:14 -07:00
Jesse Gross
afaf7ce8c3 ggml: Enable op_offload to improve partial offload performance
When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.

The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).

There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.

Fixes #12037
2025-10-30 13:53:10 -07:00
Jesse Gross
26465fb85f ollamarunner: Worst case batch for token generation
We currently allocate the worst case batch for max sized
batches, which corresponds to prompt processing. However,
there are some cases where the generated graph is different
for small and large batches. To ensure that we don't need
to allocate memory later after layout has taken place, we
should run the worst case batch both ways and take the larger
amount of memory.

This does not noticeably affect loading speed as the most expensive
part of this logic is from image processing and that does not
occur during token generation.
2025-10-30 13:53:10 -07:00
Daniel Hiltgen
88236bc05f win: use copy for subprocess logs (#12864)
windows gets confused when we try to hand the stderr file descriptor to the subprocess children.  This ensures the log output
always shows up.
2025-10-30 13:22:00 -07:00
Patrick Devine
76eb7d0fff testing: test more models with tool calling (#12867) 2025-10-30 13:19:21 -07:00
Michael Yang
f67a6df110 interleaved mrope (#12807)
* ml(ggml): mrope
* interleave mrope
2025-10-30 11:29:00 -07:00
Michael Yang
75e75d9afe qwen3vl: enable flash attention by default (#12862) 2025-10-30 10:51:37 -07:00
Michael Yang
ed78e127d0 fix(cmd): unload model before removal (#12832)
this change fixes two bugs with `ollama rm`:

1. before a model is removed, it will first be stopped. this only
   happens for the first argument and skipped for all other models
2. models are unloaded indiscriminately. this errors for cloud models
   and should be omitted
2025-10-30 10:41:49 -07:00
Michael Yang
d432ade714 fix: qwen2.5vl, qwen3vl composite image (#12841)
this change fixes images with an alpha channel by overlaying the image
onto a white background
2025-10-30 10:33:19 -07:00
Michael Yang
06b3422d5f tests: add tests and docs for commonly used ops (#12844)
* mulmat
* permute
2025-10-30 10:32:45 -07:00
Athiban Sharon
cbe1cf06c4 Update README.md (#12822)
Fixed broken docs links
2025-10-30 13:14:39 -04:00
Grace
0a2d92081b Removing whitespace between Thinking and Content in Qwen3VL (#12838)
Eats extra whitespace at the end/beginning of content
2025-10-29 15:14:28 -07:00
Daniel Hiltgen
c88647104d int: harden server lifecycle (#12835)
this should reduce zombies during integration runs
2025-10-29 11:50:56 -07:00
Patrick Devine
05aff4a4f1 tests: fix embeddinggemma integration test (#12830) 2025-10-29 11:07:28 -07:00
Michael Yang
0d140bd1af fix: conv2d bias (#12834) 2025-10-29 11:03:43 -07:00
Jeffrey Morgan
93e45f0f0d docs: temporarily restore api.md and cleanup docs paths (#12818) 2025-10-28 23:25:48 -07:00
Jeffrey Morgan
a342160803 docs: fix root api documentation page (#12813) 2025-10-28 19:17:54 -07:00
Jeffrey Morgan
f6c29409dc docs: add new cloud model + fix openai redirect (#12812) 2025-10-28 19:09:07 -07:00
Michael Yang
7d25b9e194 feat(model): add qwen3vl (#12665) 2025-10-28 17:39:47 -07:00
Patrick Devine
36d64fb531 embed: add distance correlation test for library embed models (#12796) 2025-10-28 16:57:27 -07:00
Parth Sareen
d828517e78 docs: update readme and links (#12809) 2025-10-28 16:20:02 -07:00
Daniel Hiltgen
14977a9350 Fix vulkan PCI ID and ID handling (#12775)
* Fix vulkan PCI ID and ID handling

Intel GPUs may not report PCI IDs which was leading to incorrect overlap
detection.  Switch to using the existing PCI IDs, however AMD GPUs claim not to
report PCI IDs, but actually do, so try anyway, as this is required for ADLX to
find the GPUs on Windows. Numeric IDs lead to scheduling problems, so this also
switches Vulkan to use UUID based IDs. The GPU discovery patches have been
squashed into a single patch to simplify future rebases.

* review comments
2025-10-28 15:15:35 -07:00
Patrick Devine
29f63f37c8 Revert "server: Consolidate embedding truncation in runner (#12730)" (#12810)
This reverts commit 5d347f6d6f.
2025-10-28 14:49:14 -07:00
Parth Sareen
3d99d9779a docs: add docs for docs.ollama.com (#12805) 2025-10-28 13:18:48 -07:00
Parth Sareen
6d02a43a75 docs: rename to mdx to setup docs site (#12804) 2025-10-28 13:04:31 -07:00
Parth Sareen
5483497d7a Revert "docs: add reference to docs.ollama.com (#12800)" (#12803)
This reverts commit 934dd9e196.
2025-10-28 12:52:49 -07:00
Parth Sareen
934dd9e196 docs: add reference to docs.ollama.com (#12800) 2025-10-28 12:44:02 -07:00
Michael Yang
1188f408dd s/From*Slice/From*s/ (#12255) 2025-10-28 12:08:49 -07:00
nicole pardal
15c7d30d9a embedding tests: added check against exact base64 string (#12790) 2025-10-28 10:37:20 -07:00
Devon Rifkin
9862317174 Merge pull request #12793 from ollama/drifkin/12792_renderer-parser-from
create: inherit FROM model's renderer/parser
2025-10-28 00:15:46 -07:00
Michael Yang
ec9eb28f4c gemma3: make embedding non-causal (#12297) 2025-10-27 19:54:08 -07:00