From 20c3266e943f62ef7947f00b563de5f6c790ecb7 Mon Sep 17 00:00:00 2001 From: Daniel Hiltgen Date: Tue, 8 Jul 2025 12:08:37 -0700 Subject: [PATCH] Reduce default parallelism to 1 (#11330) The current scheduler algorithm of picking the paralellism based on available VRAM complicates the upcoming dynamic layer memory allocation algorithm. This changes the default to 1, with the intent going forward that parallelism is explicit and will no longer be dynamically determined. Removal of the dynamic logic will come in a follow up. --- docs/faq.md | 4 ++-- envconfig/config.go | 2 +- server/sched.go | 4 +--- 3 files changed, 4 insertions(+), 6 deletions(-) diff --git a/docs/faq.md b/docs/faq.md index 6fe6334146..8931b6aa83 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -292,7 +292,7 @@ If too many requests are sent to the server, it will respond with a 503 error in ## How does Ollama handle concurrent requests? -Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing. +Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it can be configured to allow parallel request processing. If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads. @@ -301,7 +301,7 @@ Parallel request processing for a given model results in increasing the context The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms: - `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference. -- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory. +- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default is 1, and will handle 1 request per model at a time. - `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512 Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM. diff --git a/envconfig/config.go b/envconfig/config.go index 763f046466..7fc0188703 100644 --- a/envconfig/config.go +++ b/envconfig/config.go @@ -219,7 +219,7 @@ func Uint(key string, defaultValue uint) func() uint { var ( // NumParallel sets the number of parallel model requests. NumParallel can be configured via the OLLAMA_NUM_PARALLEL environment variable. - NumParallel = Uint("OLLAMA_NUM_PARALLEL", 0) + NumParallel = Uint("OLLAMA_NUM_PARALLEL", 1) // MaxRunners sets the maximum number of loaded models. MaxRunners can be configured via the OLLAMA_MAX_LOADED_MODELS environment variable. MaxRunners = Uint("OLLAMA_MAX_LOADED_MODELS", 0) // MaxQueue sets the maximum number of queued requests. MaxQueue can be configured via the OLLAMA_MAX_QUEUE environment variable. diff --git a/server/sched.go b/server/sched.go index e71cdd1bd5..2842bb3a0a 100644 --- a/server/sched.go +++ b/server/sched.go @@ -57,9 +57,7 @@ type Scheduler struct { var defaultModelsPerGPU = 3 // Default automatic value for parallel setting -// Model will still need to fit in VRAM. If this setting won't fit -// we'll back off down to 1 to try to get it to fit -var defaultParallel = 2 +var defaultParallel = 1 var ErrMaxQueue = errors.New("server busy, please try again. maximum pending requests exceeded")