Daniel Hiltgen
ab39e08eb9
llm: auto detect models that require Ollama Engine ( #1 )
2025-03-11 14:49:20 -07:00
jmorganca
11bfa62796
add trailing \n\n after <end_of_image> to match reference implementation
2025-03-11 14:49:20 -07:00
jmorganca
f63e62e546
reduce kernel size, add TODO for loading from config
2025-03-11 14:49:20 -07:00
jmorganca
65b0f329d1
Revert "Allow models to force a new batch"
...
This reverts commit c7eae586b899083acebcd9b3847b89ea78c2850c.
2025-03-11 14:49:20 -07:00
Jesse Gross
06007c0a18
Allow models to force a new batch
...
This is useful for a few things:
- Work around bugs, such as having 2 images in one batch
- Keep the image in a single batch for fully connected attention
- Improve performance by not evaluating embeddings multiple times
2025-03-11 14:49:20 -07:00
Jesse Gross
a8e83a7654
Disable causal attention based on batch index
...
Currently we are using positions, which are relative to a
sequence and may not be unique.
2025-03-11 14:49:20 -07:00
Jesse Gross
475005504e
Restrict Gemma to a single image per request
2025-03-11 14:49:20 -07:00
Jesse Gross
2c40c4d35e
Fix follow up images and images split across batches
2025-03-11 14:49:19 -07:00
Michael Yang
e95278932b
use non-causal mask only for image positions
2025-03-11 14:49:19 -07:00
Michael Yang
9d2a20a763
use non-causal mask for inputs with images
2025-03-11 14:49:19 -07:00
Patrick Devine
2e54d72fc3
fix gemma3 1b conversion
2025-03-11 14:49:19 -07:00
Michael Yang
6b32a2d549
compat with upstream gguf
2025-03-11 14:49:19 -07:00
Michael Yang
c5cbe4fc2a
fallback to cpu
2025-03-11 14:49:19 -07:00
Michael Yang
f888912870
fix vision encoder
2025-03-11 14:49:19 -07:00
Michael Yang
9e4642e9b3
ollama debug tensor
2025-03-11 14:49:19 -07:00
Michael Yang
6b0486c216
duplicate token_embd to output
2025-03-11 14:49:19 -07:00
Michael Yang
d368c039f0
skip repacking vision tensors
2025-03-11 14:49:19 -07:00
Patrick Devine
9b54267e69
fix configs
2025-03-11 14:49:19 -07:00
Michael Yang
46bb0169c4
update model
2025-03-11 14:49:19 -07:00
Michael Yang
8934324b72
use fast attention
2025-03-11 14:49:18 -07:00
Jesse Gross
0e886595bf
Fix tests and drift from main
2025-03-11 14:49:18 -07:00
Patrick Devine
c62861f4fa
fix conversion
2025-03-11 14:49:18 -07:00
Michael Yang
0df1800436
set non-causal attention
2025-03-11 14:49:18 -07:00
Patrick Devine
631fecc6d9
temporary work around for converting spm
2025-03-11 14:49:18 -07:00
Jesse Gross
4346c2409d
fix drift from main
2025-03-11 14:49:18 -07:00
Michael Yang
4b037a97dc
add gemma vision encoder
2025-03-11 14:49:17 -07:00
Patrick Devine
5f74d1fd47
gemma2 impl
2025-03-11 14:35:08 -07:00
Daniel Hiltgen
4dcf80167a
Build release for windows with local script ( #9636 )
2025-03-11 08:34:20 -07:00
Michael Yang
26a26998fb
Merge pull request #9590 from ollama/mxyng/dump-pad
...
fix: pad tensor item if ge zero
2025-03-10 16:34:55 -07:00
Michael Yang
9926eae015
fix: pad tensor item if ge zero
...
this produces a nicer output since both positive and negative values
produces the same width
2025-03-10 16:18:12 -07:00
Vincent Koc
8585b7b151
docs: add opik to observability integrations ( #9626 )
2025-03-10 16:15:10 -07:00
Parth Sareen
7e34f4fbfa
sample: add numerical stability to temperature/softmax transform ( #9631 )
2025-03-10 14:43:53 -07:00
Michael Yang
fe776293f7
Merge pull request #9569 from dwt/patch-1
...
Better WantedBy declaration
2025-03-10 14:09:37 -07:00
frob
d8a5d96b98
docs: Add OLLAMA_CONTEXT_LENGTH to FAQ. ( #9545 )
2025-03-10 11:02:54 -07:00
Xiaowei Zhu
757668c42f
docs: add SwiftChat ( #9540 )
2025-03-10 11:01:09 -07:00
Sam
96ec8afd09
docs(tool): add mcp-llm ( #9537 )
2025-03-10 09:52:02 -07:00
Jeffrey Morgan
e093db92c4
sample: temporarily use grammars for constrained generation in new engine ( #9586 )
2025-03-10 16:17:39 +01:00
Jesse Gross
a1cda80bcb
model: Update encoder cache to use multimodal input processing handler
...
The encoder cache needs to know the position of images in the input
stream so that it knows when to delete them. Previously images didn't
have a position, so we implied one by breaking batches before an
image and then assuming the image was in the first position. However,
multimodal objects are now given explicit positions in the input
stream, so we can use that instead.
Breaking batches was also a way to simulate a cross attention mask
for mllama. However, given that it only supports a single sequence
and a single image, this mask doesn't serve any real purpose.
Removing the batch break does not appear to affect the quality of
the output.
Most of this is simply moving the input data structures to a new
package to avoid import cycles.
2025-03-09 17:05:26 -07:00
Jesse Gross
4614fafae0
ollamarunner: Don't panic for unimplemented features at runtime.
...
It's ok to fail on startup but we shouldn't panic during runtime
based on user input. Downgrade the panic to a warning.
2025-03-08 18:58:18 -08:00
Jesse Gross
4100ed7bdd
ml: Add support for quantized KV cache
...
Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.
2025-03-07 18:43:39 -08:00
Jesse Gross
f52b2615ef
kvcache: Set context for shift offsets
2025-03-07 18:43:39 -08:00
Jesse Gross
25f9b152f9
ggml-backend: Ensure allocation meet backend requirements
...
Backends can impose additional alignment requirements on buffer sizes.
We should ensure that we meet these or allocations can fail.
2025-03-07 18:43:39 -08:00
Jesse Gross
6da8b6a879
kvcache: Support non-causal attention
...
Models can disable causality for all or part of their processing
while continuing to store data in the KV cache.
2025-03-07 18:39:27 -08:00
Jesse Gross
0daaaef8c9
ollamarunner: Quiet debug logging and panic on unimplemented features
...
Debug logging of every token has previously caused test timeouts
on slower machines.
2025-03-07 18:38:02 -08:00
Jesse Gross
98272fbd58
additional review comments
2025-03-07 14:08:21 -08:00
Michael Yang
b27e8f3f10
ml/backend/ggml: use backend buffer type
...
this ensures the tensor is created on the right buffer type for backends
such as cpu
2025-03-07 14:08:21 -08:00
Michael Yang
45df786f09
comments
2025-03-07 14:08:21 -08:00
Michael Yang
daaf42e4a4
ml/backend/ggml: clean up
2025-03-07 14:08:21 -08:00
Michael Yang
2dc60d4620
ml/backend/ggml: offload vision to cpu
...
temporary until tensor loading can accurately account for vision models
2025-03-07 14:08:21 -08:00
Michael Yang
b5312f30e8
ml/backend/ggml: handle tensor split
2025-03-07 14:08:21 -08:00