perf: build graph for next batch in parallel to keep GPU busy

This refactors the main run loop of the ollama runner to perform the main GPU intensive tasks (Compute+Floats) in a go routine so we can prepare the next batch in parallel to reduce the amount of time the GPU stalls waiting for the next batch of work.
2025-11-11 17:26:58 +01:00 · 2025-08-13 12:32:25 -07:00
parent f804e8a460
commit 31f64183dc
15 changed files with 531 additions and 150 deletions
--- a/integration/max_queue_test.go
+++ b/integration/max_queue_test.go
@@ -19,6 +19,8 @@ import (
 )

 func TestMaxQueue(t *testing.T) {
+	t.Skip("this test needs to be re-evaluated to use a proper embedding model")
+
 	if os.Getenv("OLLAMA_TEST_EXISTING") != "" {
 		t.Skip("Max Queue test requires spawning a local server so we can adjust the queue size")
 		return