* origin/main:
Revert "CI: switch back to x86 macos builder" (#11588)
mac: disable bf16 on unsupported OS versions (#11585)
CI: switch back to x86 macos builder (#11572)
Increase performance for Gemma3n models on NVGPUs by enabling CUDA Graph execution (#11525)
kvcache: Don't shift empty batches
docs: fix typos and remove trailing whitespaces (#11554)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* Enable CUDA Graphs for gemma3n.
Similar to
https://github.com/ggml-org/llama.cpp/pull/14741,
though ollama has a slightly different model graph
than llama.cpp which requires different workaround
checks.
* Remove residual check by reshaping differently in gemma3n model
This should make the heuristics more robust
When we context shift, we delete half the context and apply RoPE
with an offset to the other half. We used to RoPE across the entire
context in a single pass with a zero offset for the deleted
section. With the change to shifting in batches, we can skip any
batches where all of the offsets would be zero. This typically
reduces the number of operations by half.
* origin/main:
readme: add Mayan EDMS to community integrations (#11543)
kvcache: Group shift operations into batches
CONTRIBUTING: fix typo in commit message example (#11528)
Currently, when we need to do a shift on the cache, it is one
RoPE operation on the entire size of the cache (per layer). In
some cases, this can create a compute graph that is larger than
the forward pass since the forward pass is working in batches.
Since we don't consider shifting in our memory estimates, it's
possible for this to cause a crash if we run out of memory.
By limiting the size of the RoPE calls to batch size chunks, we
ensure that the shift will never exceed the size of the forward
pass, since the forward pass will also contain a RoPE of the same
size. This does not have a sigificant impact on performance since
RoPE is a math operation that is mostly proportional to the size
of its inputs.
In theory defrag could have the same issue since it also creates a
compute graph outside of the forward pass, however, since it is
only copies, it does not require any working space.
* origin/main:
readme: add GMAI - Gradle Managed to community integrations (#11461)
tools: fix parsing issue when a tool name is a substring of another (#11456)
readme: update argo description to support deep research (#11455)
ci: switch mac builder to arm64 (#11379)
docs: add the no-Modelfile function of `ollama create` (#9077)
openai: allow openai endpoint to accept webp images (#11412)
readme: update the llama.cpp github link (#11427)
compile bf16 support into ggml-metal (#11430)
cmd: add default assistant role to message construction (#11431)
api: fix unreachable status err (#11423)
docs: fix typo in macos.md (#11425)
StatusError was unreachable, the client always checked for error messages in the response body first, and the server always includes error messages with HTTP error status codes.
* origin/main:
docs: update modelfile.md to reflect current default num_ctx (#11189)
ggml: Use assigned layers when reporting loading stats
ggml: Disable unused pipeline parallelism
Only load supported models on new engine (#11362)
Reporting params.NumGPULayers can be misleading because it is the
requested number of layers, not the actual number that is loaded.
While they are often the same, there are cases where they might mismatch,
such as if the GPU backend is missing.