We pass invalid pointers when we check the size of the required
compute graph before fitting. Some CUDA APIs validate these pointers
but we can just skip them during this phase. cudaMemsetAsync is one
of these that we weren't skipping but never took the code path that
used it. Now that we have enabled op_offload, we can hit it in
memory pressured situations.
On Windows AMD IDs are numeric, and can reorder based on the filter environment.
By passing in the filter env on a full discovery refresh, we'll only look at the actual devices
and ignore unsupported iGPUs. Without this, on some systems iGPU VRAM was incorrectly
being used to populate the dGPU.
When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.
The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).
There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.
Fixes#12037
We currently allocate the worst case batch for max sized
batches, which corresponds to prompt processing. However,
there are some cases where the generated graph is different
for small and large batches. To ensure that we don't need
to allocate memory later after layout has taken place, we
should run the worst case batch both ways and take the larger
amount of memory.
This does not noticeably affect loading speed as the most expensive
part of this logic is from image processing and that does not
occur during token generation.
this change fixes two bugs with `ollama rm`:
1. before a model is removed, it will first be stopped. this only
happens for the first argument and skipped for all other models
2. models are unloaded indiscriminately. this errors for cloud models
and should be omitted
* Fix vulkan PCI ID and ID handling
Intel GPUs may not report PCI IDs which was leading to incorrect overlap
detection. Switch to using the existing PCI IDs, however AMD GPUs claim not to
report PCI IDs, but actually do, so try anyway, as this is required for ADLX to
find the GPUs on Windows. Numeric IDs lead to scheduling problems, so this also
switches Vulkan to use UUID based IDs. The GPU discovery patches have been
squashed into a single patch to simplify future rebases.
* review comments