some tensors are expected to be used in repeating layers but are not
themselves repeated. this change copies these tensors into the same
backends as their repeating counterparts to minimize copying tensors
between backends
use a similar strategy as llama.cpp for deciding where tensors should be
allocated. this will be improved later to be aware of usable memory
before assigning the tensor
This change bring in various interface cleanups along with greatly improving the performance of the sampler.
Tested with llama3.2 on local machine.
Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled.
Without topK performance is ~ 110 tokens/s
The problem with default.target is that it always points to the target that is currently started. So if you boot into single user mode or the rescue mode still Ollama tries to start.
I noticed this because either tried (and failed) to start all the time during a system update, where Ollama definitely is not wanted.
Various vision models have different requirements for how they
receive their inputs. For example:
- Mllama wants images together with text and the image embeddings
don't themselves have positions or get stored in the main KV cache
- Llava-style models feed in embeddings similar to tokens and
images correspond to a varying number of tokens in the cache.
In addition, the strategy for providing inputs must support batching
and multiple sequences, which are managed by the runner. At the same
time, we want to keep data handling fully in the model so that new
architectures are not bottlenecked by runner code which does not
understand their particular requirements.
This provides a method for models to edit the input stream so that
it meets their needs while still being in a format that the runner
understands. This allows the runner to avoid special processing
for different models.
In addition, this fixes a regression where non-vision models may
try to incorrectly interpret images.
We sometimes tokenize partial strings. For example, with
multimodal inputs, we split the input string around the images
and then tokenize each piece. In these cases, we should only add
the special tokens on the first piece.
This commit replaces the old pull implementation in the server package
with the new, faster, more robust pull implementation in the registry
package.
The new endpoint, and now the remove endpoint too, are behind the
feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT
environment variable include "client2".
Currently, the progress indication is wired to perform the same as the
previous implementation to avoid making changes to the CLI, and because
the status reports happen at the start of the download, and the end of
the write to disk, the progress indication is not as smooth as it could
be. This is a known issue and will be addressed in a future change.
This implementation may be ~0.5-1.0% slower in rare cases, depending on
network and disk speed, but is generally MUCH faster and more robust
than the its predecessor in all other cases.
- output backend system info when initializing the backend. this ensures
this information is always present without needing to be called
explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name
* Include unified vision layers in memory prediction
For newer vision models with a single gguf, include
the projection estimates.
* Adjust CLI to handle both styles of vision model metadata
* Wire up new tokenizers for new engine
If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp. This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.
This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.
* Lay foundation for auto selection of new engine
This reintroduces aggressive pruning on model deletion as a temporary
measure until a more controlled garbage collection (GC) mechanism is
implemented.
Issues with the current approach:
1. Users may accidentally delete a model (`ollama rm llama3.3` instead
of `ollama rm llama3.2`), requiring a full re-download unless another
model references the same blobs.
2. Users may assume a deleted model is still referenced elsewhere, but
due to prior updates or deletions, the references no longer exist,
leading to unnecessary re-downloads.
Soon, we should implement a structured GC mechanism to retain
unreferenced blobs for a configurable period before removal, which will
run on "ollama rm" and other commands we deem appropriate.
Users that want to immediately remove unreferenced blobs can use a new
prune command that will allow them to specify the age and class of blobs
to remove.
Example usage:
# Run basic blob GC
$ ollama prune
# Remove unreferenced blobs older than 7 days
$ ollama prune --age 7d
# Remove all blobs, referenced or not, older than 7 days (and their manifests?)
$ ollama prune --age 7d --all
# Remove all unreferenced blobs immediately
$ ollama prune --age 0 --all
# Remove all blobs
$ ollama prune --age 0 --all
This should provide a safer and more predictable cleanup process.
Previously, developers without the synctest experiment enabled would see
build failures when running tests in some server/internal/internal
packages using the synctest package. This change makes the transition to
use of the package less painful but guards the use of the synctest
package with build tags.
synctest is enabled in CI. If a new change will break a synctest
package, it will break in CI, even if it does not break locally.
The developer docs have been updated to help with any confusion about
why package tests pass locally but fail in CI.
Previously, using a Registry required a DiskCache to be passed in for
use in various methods. This was a bit cumbersome, as the DiskCache is
required for most operations, and the DefaultCache is used in most of
those cases. This change makes the DiskCache an optional field on the
Registry struct.
This also changes DefaultCache to initialize on first use. This is to
not burden clients with the cost of creating a new cache per use, or
having to hold onto a cache for the lifetime of the Registry.
Also, slip in some minor docs updates for Trace.
Reverts ccache installation to be done manually via curl instead of
using the dnf package manager as this has side effects of prepending
ccache's install directory to the front of the PATH
The extended name format is a superset of the name format that only the
client needs to know about, not the server or other dependents of the
name package, so move the split logic into the client package.
Also, take advantage of knowing about the extended name format to allow
the client to use the extended name format when unlinking to verify they
are unlinking the manifest with the content they intend.
The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.
Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.
Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.
The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.
Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.
To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
- for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.
This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.
This commit is a step towards a goal to make names less ceremonial
outside of the registry client. Clients of the registry package can
treat names as opaque strings, and the registry package will handle
parsing, validating, and normalizing names.
Ideally we end up with the names package tucked away in an internal
package for good. We'll see how things go.
Also, this package name is not permanent. This another step in the
on-going process of refactoring the server code, and at some point it
will most likely be renamed/moved.
More validation during the safetensor creation process.
Properly handle relative paths (like ./model.safetensors) while rejecting absolute paths
Add comprehensive test coverage for various paths
No functionality changes for valid inputs - existing workflows remain unaffected
Leverages Go 1.24's new os.Root functionality for secure containment
Focuses initial Blackwell support on compute capability 12.0
which includes the 50x series of GeForce cards. In the future
additional compute capabilities may be added
Also, require the -as flag to be set when importing a model. This
prevents the confusing error message "invalid name".
Also, allow short names to be used when importing a model and
auto-complete the name with the default mask.
Also, our commit messages have been getting better, but we can do
better, and be more consistent. This adds more clarity on how to write
commit messages and provides examples of good and bad messages.
Also, our contributing guide was lacking helpful guidance on how to
start change proposals. This commit adds the start of that section.
Soon, we should add a proposal template to the issue tracker with a link
back to the proposal section, which should also be expanded upon.
As are adding support for weighted sampling we have seen some performance
regressions, bypassing the sampler logic for now and defaulting to greedy
until we can benchmark the new sampler logic.