Also, our commit messages have been getting better, but we can do
better, and be more consistent. This adds more clarity on how to write
commit messages and provides examples of good and bad messages.
Also, our contributing guide was lacking helpful guidance on how to
start change proposals. This commit adds the start of that section.
Soon, we should add a proposal template to the issue tracker with a link
back to the proposal section, which should also be expanded upon.
As are adding support for weighted sampling we have seen some performance
regressions, bypassing the sampler logic for now and defaulting to greedy
until we can benchmark the new sampler logic.
update Context.Forward to accept multiple tensors to match
Context.Compute signature
update Context.Forward to return Context such that it can be chained
with Context.Compute
This fixes panics introduced in 2412adf42b8380748ac79476e273f5b337c3b977
when Gin ungracefully assumes that the http.ResponseWriter implements
http.CloseNotifier and http.Flusher, which our new statusCodeRecorder
does not. This is a temporary fix until we can pour the rest of the Gin
out.
This commit introduces a new API implementation for handling
interactions with the registry and the local model cache. The new API is
located in server/internal/registry. The package name is "registry" and
should be considered temporary; it is hidden and not bleeding outside of
the server package. As the commits roll in, we'll start consuming more
of the API and then let reverse osmosis take effect, at which point it
will surface closer to the root level packages as much as needed.
* Windows ARM build
Skip cmake, and note it's unused in the developer docs.
* Win: only check for ninja when we need it
On windows ARM, the cim lookup fails, but we don't need ninja anyway.
* Add cuda Blackwell architecture for v12
* Win: Split rocm out to separate zip file
* Reduce CC matrix
The 6.2 and 7.2 architectures only appear on Jetsons, so they were wasting space.
The 5.0 should be forward compatible with 5.2 and 5.3.
During work on our new registry client, I ran into frustrations with CI
where a misspelling in a comment caused the linter to fail, which caused
the tests to not run, which caused the build to not be cached, which
caused the next run to be slow, which caused me to be sad.
This commit address these issues, and pulls in some helpful changes
we've had in CI on ollama.com for some time now.
They are:
* Always run tests, even if the other checks fail.
Tests are the most important part of CI, and should always run. Failures
in tests can be correlated with failures in other checks, and can help
surface the root cause of the failure sooner. This is especially
important when the failure is platform specific, and the tests are not
platform independent.
* Check that `go generate` is clean.
This prevents 'go generate' abuse regressions. This codebase used to use
it to generate platform specific binary build artifacts. Let's make sure
that does not happen again and this powerful tool is used correctly, and
the generated code is checked in.
Also, while adding `go generate` the check, it was revealed that the
generated metal code was putting dates in the comments, resulting in
non-deterministic builds. This is a bad practice, and this commit fixes
that. Git tells us the most important date: the commit date along with
other associated changes.
* Check that `go mod tidy` is clean.
A new job to check that `go mod tidy` is clean was added, to prevent
easily preventable merge conflicts or go.mod changes being deferred to a
future PR that is unrelated to the change that caused the go.mod to
change.
* More robust caching.
We now cache the go build cache, and the go mod download cache
independently. This is because the download cache contains zips that can
be unpacked in parallel faster than they can be fetched and extracted by
tar. This speeds up the build significantly.
The linter is hostile enough. It does not need to also punish us with
longer build times due to small failures like misspellings.
* Bump cuda and rocm versions
Update ROCm to linux:6.3 win:6.2 and CUDA v12 to 12.8.
Yum has some silent failure modes, so largely switch to dnf.
* Fix windows build script
centos-7 images have been deprecated upstream and replaced with
almalinux-8 images instead, requiring some small extra work.
Signed-off-by: José Pekkarinen <jose.pekkarinen@foxhound.fi>
This commit copies (without history) the bmizerany/ollama-go repository
with the intention of integrating it into the ollama as a replacement
for the pushing, and pulling of models, and management of the cache they
are pushed and pulled from.
New homes for these packages will be determined as they are integrated
and we have a better understanding of proper package boundaries.
The route assembly in Handler lacked clear organization making it
difficult scan for routes and their relationships to each other. This
commit aims to fix that by reordering the assembly of routes to group
them by category and purpose.
Also, be more specific about what "config" refers to (it is about CORS
if you were wondering... I was.)
There are two benefits to doing this:
- Provide a library function that models can use, reducing code for
each model implementation
- Enables a single place to drop in optimized implementations of
attention based on the backend or other factors. One is provided for
GGML.
On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Currently Rows is called as the last step in a model computation
to get the values for the output tokens. However, if we move it
earlier in the process then we can trim out computations that
never get used. This is similar to how models are defined in
llama.cpp.
Changing the model definition in this way improves token generation
performance by approximately 8%.
We don't need to create and destroy the GGML scheduler for every
context. This introduces extra CPU overhead for every forward
pass and extra memory for contexts that don't actually get scheduled
(for example, KV caches). We can instead just have one scheduler
for the backend and reset it each time we call Compute.
This improves token generation performance by 1-2% and removes
scheduler create/destroy from profile traces.
Currently the following parameters are in the runner but not used:
- numGPULayers
- mainGPU
- threads
- tensorSplit
This passes them through to the backend, which is where they would
actually get used. However, the GGML backend does not yet do anything
with them.
Added unit tests to verify error handling behavior in the Client.stream and Client.do methods.
Tests cover various error scenarios including:
- Error responses with status codes >= 400
- Error messages with successful status codes
- Empty error messages
- Successful responses
clang outputs are faster. we were previously building with clang via gcc
wrapper in cgo but this was missed during the build updates so there was
a drop in performance