The quantization PR didn't block all unsupported file types,
which this PR fixes. It also updates the API docs to reflect
the now reduced set of supported types.
When creating a quantized model from safetensors we
need the array KV values to be loaded.Changing this
value to -1 loads the KV values on the returned
layer to be used and saved during quantization.
the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly
since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.
The correct constant to remove all entries to the end of the sequence
for the Ollama engine is math.MaxInt32. -1 is used by the old engine.
The impact of this is currently minimal because it would only occur
in situations that are not supported by the implemented models or
rarely used options.
If a model is loading, and the request context is canceled during the load
by a client closing the connection, and another request is inbound for the
same model with a different configuration (context size, etc.) thus requiring
a reload, two unload events can be in flight. The first shuts down the
original model load, but the second one caused the loss of the new
reloading runner reference, thus triggering the leak.
The primary fix is detecting the duplicate unload and ignoring the second
instance. The load routine is also hardened to ensure we detect
clobbering an already present runner and unload it with a warning.
This reduces the size of our Windows installer payloads by ~256M by dropping
support for nvidia drivers older than Feb 2023. Hardware support is unchanged.
Linux default bundle sizes are reduced by ~600M to 1G.
* Move quantization logic to GGML via new backend
This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.
* Remove "add model quantizations"
This is no longer needed now that quantization is implemented in Go+GGML code directly.
Some options listed in api/types.go are not supported in
newer models, or have been deprecated in the past. This is
the first of a series of PRs to clean up the API options
This hides the LlamaServer blank window when chatting outside of the terminal (say like with an app like Msty). This has no other side effects when invoking it the regular way.
For all search path env vars make sure our dirs are first
to avoid potentially finding other incompatible libraries
on the users system.
Also fixes a minor build script glitch for windows rocm