We don't check the return status after computing the graph, which
can silently lead to bad outputs if we try to keep going and future
computation succeeds. This appears to happens in certain cases on
Apple M2 devices.
Fixes#11070
Fixes issue where tool calls that don't expect any parameters were
not being parsed. This also fixes two additional issues: one where
2+ tool calls would not be correctly parsed, and cases where tool calls
with invalid parameters would still get parsed
The current splitDim function only operates on tensors that are split evenly which isn't always the case, e.g. a QKV tensor. This change allows the function to be used for arbitrary splits
This enables matching up devices and information reported by the backend
with system management libraries such as nvml to get accurate free
memory reporting.
"POST predict" basically means that the runner has crashed, which
can have many reasons. However, many people think this is a specific
error and either report only this message or group together unrelated
bugs. This replaces it with a more friendly and helpful message.
- Both `/api/generate` and `/api/chat` now accept a `"think"`
option that allows specifying whether thinking mode should be on or
not
- Templates get passed this new option so, e.g., qwen3's template can
put `/think` or `/no_think` in the system prompt depending on the
value of the setting
- Models' thinking support is inferred by inspecting model templates.
The prefix and suffix the parser uses to identify thinking support is
also automatically inferred from templates
- Thinking control & parsing is opt-in via the API to prevent breaking
existing API consumers. If the `"think"` option is not specified, the
behavior is unchanged from previous versions of ollama
- Add parsing for thinking blocks in both streaming/non-streaming mode
in both `/generate` and `/chat`
- Update the CLI to make use of these changes. Users can pass `--think`
or `--think=false` to control thinking, or during an interactive
session they can use the commands `/set think` or `/set nothink`
- A `--hidethinking` option has also been added to the CLI. This makes
it easy to use thinking in scripting scenarios like
`ollama run qwen3 --think --hidethinking "my question here"` where you
just want to see the answer but still want the benefits of thinking
models
Computing an attention mask for a large context and max batch is
expensive - over 100ms. Models like Gemma3 that have multiple types
of caches and custom attention masks need to do this 4 times, so this
adds approximately 500ms to startup time when using 128k context
When we are reserving the worst case graph, we don't need the mask,
only its shape, so we can skip this.
This commit updates the README to include macLlama within the community integrations section.
macLlama is a native macOS application built for lightweight and efficient LLM interaction. Key features include:
* **Lightweight & Native:** Designed to be resource-friendly and perform optimally on macOS.
* **Chat-like Interface:** Provides a user-friendly, conversational interface.
* **Multiple Window Support:** Allows users to manage multiple conversations simultaneously.
The primary goal of macLlama is to offer a simple and easy-to-run LLM experience on macOS.