Patrick Devine
2e54d72fc3
fix gemma3 1b conversion
2025-03-11 14:49:19 -07:00
Michael Yang
6b32a2d549
compat with upstream gguf
2025-03-11 14:49:19 -07:00
Michael Yang
c5cbe4fc2a
fallback to cpu
2025-03-11 14:49:19 -07:00
Michael Yang
f888912870
fix vision encoder
2025-03-11 14:49:19 -07:00
Michael Yang
9e4642e9b3
ollama debug tensor
2025-03-11 14:49:19 -07:00
Michael Yang
6b0486c216
duplicate token_embd to output
2025-03-11 14:49:19 -07:00
Michael Yang
d368c039f0
skip repacking vision tensors
2025-03-11 14:49:19 -07:00
Patrick Devine
9b54267e69
fix configs
2025-03-11 14:49:19 -07:00
Michael Yang
46bb0169c4
update model
2025-03-11 14:49:19 -07:00
Michael Yang
8934324b72
use fast attention
2025-03-11 14:49:18 -07:00
Jesse Gross
0e886595bf
Fix tests and drift from main
2025-03-11 14:49:18 -07:00
Patrick Devine
c62861f4fa
fix conversion
2025-03-11 14:49:18 -07:00
Michael Yang
0df1800436
set non-causal attention
2025-03-11 14:49:18 -07:00
Patrick Devine
631fecc6d9
temporary work around for converting spm
2025-03-11 14:49:18 -07:00
Jesse Gross
4346c2409d
fix drift from main
2025-03-11 14:49:18 -07:00
Michael Yang
4b037a97dc
add gemma vision encoder
2025-03-11 14:49:17 -07:00
Patrick Devine
5f74d1fd47
gemma2 impl
2025-03-11 14:35:08 -07:00
Daniel Hiltgen
4dcf80167a
Build release for windows with local script ( #9636 )
2025-03-11 08:34:20 -07:00
Michael Yang
26a26998fb
Merge pull request #9590 from ollama/mxyng/dump-pad
...
fix: pad tensor item if ge zero
2025-03-10 16:34:55 -07:00
Michael Yang
9926eae015
fix: pad tensor item if ge zero
...
this produces a nicer output since both positive and negative values
produces the same width
2025-03-10 16:18:12 -07:00
Vincent Koc
8585b7b151
docs: add opik to observability integrations ( #9626 )
2025-03-10 16:15:10 -07:00
Parth Sareen
7e34f4fbfa
sample: add numerical stability to temperature/softmax transform ( #9631 )
2025-03-10 14:43:53 -07:00
Michael Yang
fe776293f7
Merge pull request #9569 from dwt/patch-1
...
Better WantedBy declaration
2025-03-10 14:09:37 -07:00
frob
d8a5d96b98
docs: Add OLLAMA_CONTEXT_LENGTH to FAQ. ( #9545 )
2025-03-10 11:02:54 -07:00
Xiaowei Zhu
757668c42f
docs: add SwiftChat ( #9540 )
2025-03-10 11:01:09 -07:00
Sam
96ec8afd09
docs(tool): add mcp-llm ( #9537 )
2025-03-10 09:52:02 -07:00
Jeffrey Morgan
e093db92c4
sample: temporarily use grammars for constrained generation in new engine ( #9586 )
2025-03-10 16:17:39 +01:00
Jesse Gross
a1cda80bcb
model: Update encoder cache to use multimodal input processing handler
...
The encoder cache needs to know the position of images in the input
stream so that it knows when to delete them. Previously images didn't
have a position, so we implied one by breaking batches before an
image and then assuming the image was in the first position. However,
multimodal objects are now given explicit positions in the input
stream, so we can use that instead.
Breaking batches was also a way to simulate a cross attention mask
for mllama. However, given that it only supports a single sequence
and a single image, this mask doesn't serve any real purpose.
Removing the batch break does not appear to affect the quality of
the output.
Most of this is simply moving the input data structures to a new
package to avoid import cycles.
2025-03-09 17:05:26 -07:00
Jesse Gross
4614fafae0
ollamarunner: Don't panic for unimplemented features at runtime.
...
It's ok to fail on startup but we shouldn't panic during runtime
based on user input. Downgrade the panic to a warning.
2025-03-08 18:58:18 -08:00
Jesse Gross
4100ed7bdd
ml: Add support for quantized KV cache
...
Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.
2025-03-07 18:43:39 -08:00
Jesse Gross
f52b2615ef
kvcache: Set context for shift offsets
2025-03-07 18:43:39 -08:00
Jesse Gross
25f9b152f9
ggml-backend: Ensure allocation meet backend requirements
...
Backends can impose additional alignment requirements on buffer sizes.
We should ensure that we meet these or allocations can fail.
2025-03-07 18:43:39 -08:00
Jesse Gross
6da8b6a879
kvcache: Support non-causal attention
...
Models can disable causality for all or part of their processing
while continuing to store data in the KV cache.
2025-03-07 18:39:27 -08:00
Jesse Gross
0daaaef8c9
ollamarunner: Quiet debug logging and panic on unimplemented features
...
Debug logging of every token has previously caused test timeouts
on slower machines.
2025-03-07 18:38:02 -08:00
Jesse Gross
98272fbd58
additional review comments
2025-03-07 14:08:21 -08:00
Michael Yang
b27e8f3f10
ml/backend/ggml: use backend buffer type
...
this ensures the tensor is created on the right buffer type for backends
such as cpu
2025-03-07 14:08:21 -08:00
Michael Yang
45df786f09
comments
2025-03-07 14:08:21 -08:00
Michael Yang
daaf42e4a4
ml/backend/ggml: clean up
2025-03-07 14:08:21 -08:00
Michael Yang
2dc60d4620
ml/backend/ggml: offload vision to cpu
...
temporary until tensor loading can accurately account for vision models
2025-03-07 14:08:21 -08:00
Michael Yang
b5312f30e8
ml/backend/ggml: handle tensor split
2025-03-07 14:08:21 -08:00
Michael Yang
26c2e0bd35
ml/backend/ggml: handle user specified cpu offloading
2025-03-07 14:08:21 -08:00
Michael Yang
bf920883d5
ml/backend/ggml: set cpu n_threads
2025-03-07 14:08:21 -08:00
Michael Yang
58b9ec1f6b
kvcache: update tests
2025-03-07 14:08:21 -08:00
Michael Yang
7bae7fa5ce
ml/backend/ggml: create tensor on specific backend
...
some tensors should be created on specific backends to reduce number of
copies and improve performance
2025-03-07 14:08:21 -08:00
Michael Yang
764e199d67
kvcache: create cache ctx per layer
...
each cache layer creates and maintains its own context instead of using
a large context for all layers
2025-03-07 14:08:21 -08:00
Michael Yang
bfce55db3d
model: load non-repeated tensors into multiple backends
...
some tensors are expected to be used in repeating layers but are not
themselves repeated. this change copies these tensors into the same
backends as their repeating counterparts to minimize copying tensors
between backends
2025-03-07 14:08:21 -08:00
Michael Yang
bab6f34dc0
ml/backend/ggml: update model loading for hybrid/multi backends
...
use a similar strategy as llama.cpp for deciding where tensors should be
allocated. this will be improved later to be aware of usable memory
before assigning the tensor
2025-03-07 14:08:21 -08:00
Parth Sareen
0682dae027
sample: improve ollama engine sampler performance ( #9374 )
...
This change bring in various interface cleanups along with greatly improving the performance of the sampler.
Tested with llama3.2 on local machine.
Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled.
Without topK performance is ~ 110 tokens/s
2025-03-07 12:37:48 -08:00
Breaker
1f6986e919
readme: add QwQ to the supported models list ( #9565 )
2025-03-07 09:30:07 -08:00
Jeffrey Morgan
4289c74359
llama: fix kv loading on snowflake-arctic-embed models ( #9536 )
2025-03-07 09:25:34 -08:00