We pass invalid pointers when we check the size of the required
compute graph before fitting. Some CUDA APIs validate these pointers
but we can just skip them during this phase. cudaMemsetAsync is one
of these that we weren't skipping but never took the code path that
used it. Now that we have enabled op_offload, we can hit it in
memory pressured situations.
When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.
The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).
There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.
Fixes#12037
* feat: Bump llama.cpp to df1b612
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(mtmd): Correctly encode text chunks during mtmd tokenization
There can be text chunks that appear interspersed with the image embeddings
that contain template delimiter tokens for some models. These need to be
correctly translated to text tokens.
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* tests: Use MtmdChunk in image_test
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: Fix unnecessary conversion linting
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(ggml): Revert changes to ggml_hip.cpp
These changes were done largely by our code assistant and are likely wrong
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Revert changes in mem_nvml.cpp
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Update sync point to 1deee0
This brings in several more optimization commits and model support for
EmbeddingGemma
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Update patches for 1deee0
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: sync for bump to 1deee0
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Bad patch updates with errant `+`
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Bump llama.cpp/ggml to 7049736
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: format-patches after latest bump
Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The GGML CUDA backend allocates additional memory for intermediate
results during calculation. This memory isn't currently allocated
during worst case graph reservation and therefore not included in
scheduling. This means that as these buffers potentially grow
with context length, we could crash.
This extends the memory allocation system down layer from the GGML
graph to the CUDA layer, preallocating the worst case memory there
as well.
Fixes#11753
This changes the memory allocation strategy from upfront estimation to
tracking actual allocations done by the engine and reacting to that. The
goal is avoid issues caused by both under-estimation (crashing) and
over-estimation (low performance due to under-utilized GPUs).
It is currently opt-in and can be enabled for models running on the
Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
cases is unchanged and will continue to use the existing estimates.