* PDH free memory skeleton
* Add PDH printing
* Add LUID support for Vulkan
* wire luid from ggml-vulkan to mem-dxgi-pdh file
* Fix to ggml-impl
* Continue skeleton
* Implemented ggml_dxgi_pdh_get_device_memory
* fix comments
* Fix - change value GB to bytes
* add ifdefs to only support windows and not linux
* modify error codes
* Finished ggml_dxgi_pdh_init() function
* completed ggml_dxgi_pdh_release()
* Formatting changes, add static to functions
* fix build errors
* fix go build error
* fix luid - now should match between dxgi and vulkan
* Fix the free memory reporting (was using copy by value, change to reference)
* keep only dxgi1_2.h
* Modifications based on PR feedback
* fix merge conflicts (2) and fix desc1.description printout
* move dxgi + pdh api calls to before the vendor specific library calls
* change from 3 samples to 1 sample for PDH
* modify when old_mode is set
* add fix for building MacOS
* fix release and returns for other vendors
* add patch file
* app: add code for macOS and Windows apps under 'app'
* app: add readme
* app: windows and linux only for now
* ci: fix ui CI validation
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
The initial implementation of qwen3-vl:235b exceeded the maximum graph
size based on the number of tensors. Although this was later fixed
through the use of the mrope operation, we are close to the limit in
some cases. This updates to track the current llama.cpp usage of GGML.
We pass invalid pointers when we check the size of the required
compute graph before fitting. Some CUDA APIs validate these pointers
but we can just skip them during this phase. cudaMemsetAsync is one
of these that we weren't skipping but never took the code path that
used it. Now that we have enabled op_offload, we can hit it in
memory pressured situations.
On Windows AMD IDs are numeric, and can reorder based on the filter environment.
By passing in the filter env on a full discovery refresh, we'll only look at the actual devices
and ignore unsupported iGPUs. Without this, on some systems iGPU VRAM was incorrectly
being used to populate the dGPU.
When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.
The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).
There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.
Fixes#12037
We currently allocate the worst case batch for max sized
batches, which corresponds to prompt processing. However,
there are some cases where the generated graph is different
for small and large batches. To ensure that we don't need
to allocate memory later after layout has taken place, we
should run the worst case batch both ways and take the larger
amount of memory.
This does not noticeably affect loading speed as the most expensive
part of this logic is from image processing and that does not
occur during token generation.
this change fixes two bugs with `ollama rm`:
1. before a model is removed, it will first be stopped. this only
happens for the first argument and skipped for all other models
2. models are unloaded indiscriminately. this errors for cloud models
and should be omitted
* Fix vulkan PCI ID and ID handling
Intel GPUs may not report PCI IDs which was leading to incorrect overlap
detection. Switch to using the existing PCI IDs, however AMD GPUs claim not to
report PCI IDs, but actually do, so try anyway, as this is required for ADLX to
find the GPUs on Windows. Numeric IDs lead to scheduling problems, so this also
switches Vulkan to use UUID based IDs. The GPU discovery patches have been
squashed into a single patch to simplify future rebases.
* review comments
On main, the `RENDERER` and `PARSER` fields from the `Modelfile` don't
get propagated to a new model created with a `req.From` parameter. This
is easily triggered via `ollama run qwen3-coder`, then running some save
command like `/save qwen3-coder-custom`.
Added a regression test for this, and then open the config for the
"from" model in order to use its renderer/parser as a default for the
new model. This will fix the CLI and also API-based creates.
Fixes: https://github.com/ollama/ollama/issues/12792
Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.
If we create a memory layout that should fit based on report free VRAM
but allocation still fails, we start applying a backoff. This reduces
free VRAM by an exponential percentage (1%, 2%, 4%...). However, the
points chosen tend to be too dense at the beginning and too sparse at
the end. Therefore, this switches to an incremental backoff (10%, 20%,
30%...).
* DRY out the runner lifecycle code
Now that discovery uses the runners as well, this unifies the runner spawning code
into a single place. This also unifies GPU discovery types with the newer ml.DeviceInfo
* win: make incremental builds better
Place build artifacts in discrete directories so incremental builds don't have to start fresh
* Adjust sort order to consider iGPUs
* handle cpu inference oom scenarios
* review comments
We currently short circuit generation of the cache mask and just
generate an empty tensor of the correct size. However, in some
cases, this can also skip a cast operation. This can result in the
worst case graph being not fully worst case.
We don't actually need the fast path for mask generation, so it's
better to just use the normal code path.