diff --git a/docs/faq.md b/docs/faq.md index 6fe6334146..8931b6aa83 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -292,7 +292,7 @@ If too many requests are sent to the server, it will respond with a 503 error in ## How does Ollama handle concurrent requests? -Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it is configured to allow parallel request processing. +Ollama supports two levels of concurrent processing. If your system has sufficient available memory (system memory when using CPU inference, or VRAM for GPU inference) then multiple models can be loaded at the same time. For a given model, if there is sufficient available memory when the model is loaded, it can be configured to allow parallel request processing. If there is insufficient available memory to load a new model request while one or more models are already loaded, all new requests will be queued until the new model can be loaded. As prior models become idle, one or more will be unloaded to make room for the new model. Queued requests will be processed in order. When using GPU inference new models must be able to completely fit in VRAM to allow concurrent model loads. @@ -301,7 +301,7 @@ Parallel request processing for a given model results in increasing the context The following server settings may be used to adjust how Ollama handles concurrent requests on most platforms: - `OLLAMA_MAX_LOADED_MODELS` - The maximum number of models that can be loaded concurrently provided they fit in available memory. The default is 3 * the number of GPUs or 3 for CPU inference. -- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory. +- `OLLAMA_NUM_PARALLEL` - The maximum number of parallel requests each model will process at the same time. The default is 1, and will handle 1 request per model at a time. - `OLLAMA_MAX_QUEUE` - The maximum number of requests Ollama will queue when busy before rejecting additional requests. The default is 512 Note: Windows with Radeon GPUs currently default to 1 model maximum due to limitations in ROCm v5.7 for available VRAM reporting. Once ROCm v6.2 is available, Windows Radeon will follow the defaults above. You may enable concurrent model loads on Radeon on Windows, but ensure you don't load more models than will fit into your GPUs VRAM. diff --git a/envconfig/config.go b/envconfig/config.go index 763f046466..7fc0188703 100644 --- a/envconfig/config.go +++ b/envconfig/config.go @@ -219,7 +219,7 @@ func Uint(key string, defaultValue uint) func() uint { var ( // NumParallel sets the number of parallel model requests. NumParallel can be configured via the OLLAMA_NUM_PARALLEL environment variable. - NumParallel = Uint("OLLAMA_NUM_PARALLEL", 0) + NumParallel = Uint("OLLAMA_NUM_PARALLEL", 1) // MaxRunners sets the maximum number of loaded models. MaxRunners can be configured via the OLLAMA_MAX_LOADED_MODELS environment variable. MaxRunners = Uint("OLLAMA_MAX_LOADED_MODELS", 0) // MaxQueue sets the maximum number of queued requests. MaxQueue can be configured via the OLLAMA_MAX_QUEUE environment variable. diff --git a/server/sched.go b/server/sched.go index e71cdd1bd5..2842bb3a0a 100644 --- a/server/sched.go +++ b/server/sched.go @@ -57,9 +57,7 @@ type Scheduler struct { var defaultModelsPerGPU = 3 // Default automatic value for parallel setting -// Model will still need to fit in VRAM. If this setting won't fit -// we'll back off down to 1 to try to get it to fit -var defaultParallel = 2 +var defaultParallel = 1 var ErrMaxQueue = errors.New("server busy, please try again. maximum pending requests exceeded")