Callers can set a backend buffer type to be no-alloc, meaning that
it does not allocate memory for tensors or operations. This can
be used for calculating memory requirements. Tensors and graphs
must be recreated with no-alloc set to false before loading data.
Defaults to false for newly created backend buffer types.
For many backend data structures, GGML defines a typedef of a pointer
type and returns these from functions. In most cases, CGo understands
that these are interchangable but some parts of Go (such as generics)
think they are two different types. We should prefer the form that
GGML uses.
gpt-oss works best with a context length of at least 8k. However,
for GPUs with limited amount of VRAM, there is a significant
performance hit to this increased context. In these cases, we
switch to the Ollama default of 4k
We were missing passing along thinking if content was nil (as opposed
to empty string)
Also added a test for content not being passed, which was the real cause
of <https://github.com/ollama/ollama/issues/11704>, since with the way
`Content` is typed, not passing it and empty string are distinct
Added support for converting both `name` and `tool_call_id` fields,
which different clients might provide. `name` is a legacy field from the
OpenAI completions API. For `tool_call_id` we inspect previous messages
and look for a matching tool call ID and grab its name
Issue: https://github.com/ollama/ollama/issues/11704
Previously our OpenAI chat completions compat layer assumed that tool
calls and content would never be provided together, but this is not a
correct assumption. Content is only optional when tool calls are
present, but tool calls and content can be provided together
Fixes: https://github.com/ollama/ollama/issues/11704
afaik gpt-oss is the first model that meaningfully transforms tool
function definitions in its template. We found that relatively common
definitions that include `anyOf` were not working because the template
was assuming that types were always defined via a `type` field.
anyOf allows for fully recursive types, so I exposed a
`toTypeScriptType()` function to handle this recursive logic in go and
keep the templates cleaner. The gpt-oss templates will need to be
updated to use this.
We should keep building out our function definition support to more
fully support the parts of json schema that make sense for this use
case, but in the meantime this will unblock some users (e.g., zed's
ollama integration w/ gpt-oss). Probably the most urgent is proper array
support
KV cache quantization has a dependency on the flash attention kernel.
We currently cannot use flash attention with gpt-oss as it requires
additional operations.
The model definition does not call flash attention, so it works
regardless of the setting but the cache will pick up the
quantization type. This updates the flash attention setting earlier
in the loading flow so that all downstream settings are also set correctly.
Fixes: #11671
* bf16
* tests
* gpt-oss
* enable gptoss for engine
* rough estimate
* convert to mxfp4
* handle safetensors U8
* clamp glu/linear
* update tokenizer
* MXFP4 support
This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.
* Unit tests for MXFP4 support
This exercises various operations and shapes on both CPU and GPU (if detected
on the system)
* cuda graph
* unit test adjustments
* cuda: optimize memory access
Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
* mac: fix crash on old macos versions
cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors. Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.
* server: Minimum context length for gptoss
This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.
* ggml: Multiply by numParallel for gptoss sliding window
When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.
* gpt-oss integration
includes harmony parser and thinking levels, etc.
* fix sync
* fix tests
* fix lint
---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>
There is a bug when using sliding window attention where we run
out of KV cache slots. This is likely due to not correctly removing
all of the entries as they slide out of range. This adds additional
logging when this occurs to track down the source.
Bug #10127
Models that use sliding window attention can only resume a sequence
from the cache if it falls within the saved windows. This works well
if the next message picks up where the old one left off. However, it
generally prevents a partial prefix match unless the entire conversation
falls within the sliding window.
This can be a problem with reasoning models where the traces are
supposed to be removed from future messages, forcing the entire
history to be re-evaluated.
This change allows models to specify that a larger amount of the
history be retained in memory, to allow more partial resumption.
It still respects the window that the model was trained on for
token generation.
* Enable CUDA Graphs for gemma3n.
Similar to
https://github.com/ggml-org/llama.cpp/pull/14741,
though ollama has a slightly different model graph
than llama.cpp which requires different workaround
checks.
* Remove residual check by reshaping differently in gemma3n model
This should make the heuristics more robust
When we context shift, we delete half the context and apply RoPE
with an offset to the other half. We used to RoPE across the entire
context in a single pass with a zero offset for the deleted
section. With the change to shifting in batches, we can skip any
batches where all of the offsets would be zero. This typically
reduces the number of operations by half.
Currently, when we need to do a shift on the cache, it is one
RoPE operation on the entire size of the cache (per layer). In
some cases, this can create a compute graph that is larger than
the forward pass since the forward pass is working in batches.
Since we don't consider shifting in our memory estimates, it's
possible for this to cause a crash if we run out of memory.
By limiting the size of the RoPE calls to batch size chunks, we
ensure that the shift will never exceed the size of the forward
pass, since the forward pass will also contain a RoPE of the same
size. This does not have a sigificant impact on performance since
RoPE is a math operation that is mostly proportional to the size
of its inputs.
In theory defrag could have the same issue since it also creates a
compute graph outside of the forward pass, however, since it is
only copies, it does not require any working space.
StatusError was unreachable, the client always checked for error messages in the response body first, and the server always includes error messages with HTTP error status codes.