perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU
intensive tasks (Compute+Floats) in a go routine so we can prepare the next
batch in parallel to reduce the amount of time the GPU stalls waiting for the
next batch of work.
This commit is contained in:
Daniel Hiltgen
2025-08-13 12:32:25 -07:00
parent f804e8a460
commit 31f64183dc
15 changed files with 531 additions and 150 deletions

View File

@@ -19,6 +19,8 @@ import (
)
func TestMaxQueue(t *testing.T) {
t.Skip("this test needs to be re-evaluated to use a proper embedding model")
if os.Getenv("OLLAMA_TEST_EXISTING") != "" {
t.Skip("Max Queue test requires spawning a local server so we can adjust the queue size")
return