ollama/llama/patches/0024-ggml-Enable-resetting-backend-devices.patch at parth/gpt-oss-structured-outputs

mirror of https://github.com/ollama/ollama.git synced 2025-11-11 23:57:37 +01:00

Files

Jesse Gross 9d97e6a9f1 ggml: Avoid allocating CUDA primary context on unused GPUs

The recent memory management changes caused all GPUs to be visible
to the runner, regardless of whether they are ultimately used. This
caused CUDA devices to allocate a primary context (~300 MB VRAM) on
each GPU, for each model. This is unnecessary, so we can both avoid
touching GPUs that we exclude in the early stage of allocation and
freeing the memory for any that we touch but don't use.

The issue will continue to exist for the old engine, since it touches
all devices during initialization.

2025-08-27 16:24:18 -07:00

6.2 KiB

Raw Permalink Blame History

View Raw

6.2 KiB Raw Permalink Blame History

6.2 KiB

Raw Permalink Blame History