4064 Commits

Author SHA1 Message Date
Daniel Hiltgen
1fdb351c37
New engine: vision models and auto-fallback (#9113)
* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine
2025-03-04 09:03:46 -08:00
Blake Mizerany
7a01ad7614
server/internal/registry: reintroduce pruning on model deletion (#9489)
This reintroduces aggressive pruning on model deletion as a temporary
measure until a more controlled garbage collection (GC) mechanism is
implemented.

Issues with the current approach:

1. Users may accidentally delete a model (`ollama rm llama3.3` instead
   of `ollama rm llama3.2`), requiring a full re-download unless another
   model references the same blobs.

2. Users may assume a deleted model is still referenced elsewhere, but
   due to prior updates or deletions, the references no longer exist,
   leading to unnecessary re-downloads.

Soon, we should implement a structured GC mechanism to retain
unreferenced blobs for a configurable period before removal, which will
run on "ollama rm" and other commands we deem appropriate.

Users that want to immediately remove unreferenced blobs can use a new
prune command that will allow them to specify the age and class of blobs
to remove.

Example usage:

    # Run basic blob GC
    $ ollama prune

    # Remove unreferenced blobs older than 7 days
    $ ollama prune --age 7d

    # Remove all blobs, referenced or not, older than 7 days (and their manifests?)
    $ ollama prune --age 7d --all

    # Remove all unreferenced blobs immediately
    $ ollama prune --age 0 --all

    # Remove all blobs
    $ ollama prune --age 0 --all

This should provide a safer and more predictable cleanup process.
v0.5.13
2025-03-03 19:11:16 -08:00
Blake Mizerany
55ab9f371a
server/.../backoff,syncs: don't break builds without synctest (#9484)
Previously, developers without the synctest experiment enabled would see
build failures when running tests in some server/internal/internal
packages using the synctest package. This change makes the transition to
use of the package less painful but guards the use of the synctest
package with build tags.

synctest is enabled in CI. If a new change will break a synctest
package, it will break in CI, even if it does not break locally.

The developer docs have been updated to help with any confusion about
why package tests pass locally but fail in CI.
2025-03-03 16:45:40 -08:00
KindBrave
fefbf8f74b
docs: add Ollama Android Chat community integration 2025-03-03 16:38:32 -08:00
Michael Yang
b428ddd796 docker: use go version from go.mod 2025-03-03 13:02:02 -08:00
Michael Yang
ba7d31240e fix: own lib/ollama directory
expand backend loading error handling to catch more problems and log
them instead of panicing
2025-03-03 13:01:18 -08:00
CYJiang
d25efe3954
cmd: add default err return for stop (#9458) 2025-03-03 12:13:41 -08:00
Mark
36dfb906bb
docs: don't use self-closing tag for anchor element (#9456) 2025-03-03 11:56:34 -08:00
aritra saha
a6f0f908b9
docs: update phi3-mini to phi4-mini (#9424)
* Update README.md

removed phi 3 mini and added phi4-mini

* Update README.md

---------

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-03-03 11:09:21 -08:00
İbrahim Çetin
3b1ddb2b3a
docs: add reins to community integrations (#9411) 2025-03-03 11:06:30 -08:00
Jeffrey Morgan
1579c4f06d
build: install binutils alongside gcc in Dockerfile (#9475) v0.5.13-rc6 2025-03-03 01:20:49 -08:00
Blake Mizerany
3519dd1c6e
server/internal/client/ollama: hold DiskCache on Registry (#9463)
Previously, using a Registry required a DiskCache to be passed in for
use in various methods. This was a bit cumbersome, as the DiskCache is
required for most operations, and the DefaultCache is used in most of
those cases. This change makes the DiskCache an optional field on the
Registry struct.

This also changes DefaultCache to initialize on first use. This is to
not burden clients with the cost of creating a new cache per use, or
having to hold onto a cache for the lifetime of the Registry.

Also, slip in some minor docs updates for Trace.
2025-03-02 20:55:44 -08:00
Jeffrey Morgan
e41c4cbea7
build: install ccache manually in Dockerfile (#9464)
Reverts ccache installation to be done manually via curl instead of
using the dnf package manager as this has side effects of prepending
ccache's install directory to the front of the PATH
v0.5.13-rc5
2025-03-02 16:48:31 -08:00
Blake Mizerany
ee048b76d4
server/internal/client/ollama: handle extended names in client/ollama (#9454)
The extended name format is a superset of the name format that only the
client needs to know about, not the server or other dependents of the
name package, so move the split logic into the client package.

Also, take advantage of knowing about the extended name format to allow
the client to use the extended name format when unlinking to verify they
are unlinking the manifest with the content they intend.
2025-03-02 13:30:41 -08:00
Soulter
af68d60a58
readme: add AstrBot to community integrations (#9442) 2025-03-01 21:58:34 -08:00
Jesse Gross
21aa666a1e ml: Enable support for flash attention
The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.

Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.
2025-03-01 20:53:23 -08:00
Jesse Gross
ee141cc821 ml: Empty tensor constructor for tensors
In cases where we allocate a tensor and then fully overwrite it with
copied data, it is wasteful to first zero out the memory.
2025-03-01 20:53:23 -08:00
Jesse Gross
55e5776c44 ggml-backend: Store parent backend as part of tensor
It can be important for a tensor to know what backend it came from -
for example, to know if flash attention is enabled.
2025-03-01 20:53:23 -08:00
Jesse Gross
854a9195f3 attention: Remove unnecessary contiguous operations
Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.

The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.

Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.

To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
 - for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.

This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.
2025-03-01 20:53:23 -08:00
Jeffrey Morgan
96a97adf9b
build: use correct GGML_HIP_NO_VMM compiler definition for ggml-hip (#9451) v0.5.13-rc4 2025-03-01 17:00:31 -08:00
Jeffrey Morgan
e75c6126e9
build: set GGML_CUDA_NO_VMM for ggml-hip target (#9449) v0.5.13-rc3 2025-03-01 14:02:19 -08:00
Blake Mizerany
cda6f5c66c
server/internal/internal/names: validate names (#9400)
This commit is a step towards a goal to make names less ceremonial
outside of the registry client. Clients of the registry package can
treat names as opaque strings, and the registry package will handle
parsing, validating, and normalizing names.

Ideally we end up with the names package tucked away in an internal
package for good. We'll see how things go.

Also, this package name is not permanent. This another step in the
on-going process of refactoring the server code, and at some point it
will most likely be renamed/moved.
2025-03-01 13:15:14 -08:00
Bruce MacDonald
bebb6823c0
server: validate local path on safetensor create (#9379)
More validation during the safetensor creation process.
Properly handle relative paths (like ./model.safetensors) while rejecting absolute paths
Add comprehensive test coverage for various paths
No functionality changes for valid inputs - existing workflows remain unaffected
Leverages Go 1.24's new os.Root functionality for secure containment
v0.5.13-rc2
2025-02-28 16:10:43 -08:00
Michael Yang
31e472baa4 runner: defer context cancel
defer the cancel to guarantee it runs
2025-02-28 22:27:28 +00:00
Michael Yang
657685e85d fix: replace deprecated functions 2025-02-28 21:29:34 +00:00
Jeffrey Morgan
a14912858e
build: add compute capability 12.0 to CUDA 12 preset (#9426)
Focuses initial Blackwell support on compute capability 12.0
which includes the 50x series of GeForce cards. In the future
additional compute capabilities may be added
2025-02-28 13:12:31 -08:00
Blake Mizerany
eed11ded30
server/.../safetensors: fix offsets and include all model parts (#9427)
Also, require the -as flag to be set when importing a model. This
prevents the confusing error message "invalid name".

Also, allow short names to be used when importing a model and
auto-complete the name with the default mask.
2025-02-28 13:08:10 -08:00
Michael Yang
b42aba40ed cuda: enable flash attention
ggml added an option to disable flash attention so explicitly enable it
2025-02-28 19:40:34 +00:00
王贺
25885e5335
docs: Add 1Panel to Community Integrations (#9312) 2025-02-28 09:53:03 -08:00
Jeffrey Morgan
98d44fa39d
llama: add phi4 mini support (#9403) v0.5.13-rc1 2025-02-27 19:30:32 -08:00
Blake Mizerany
2099e2d267
CONTRIBUTING: provide clarity on good commit messages, and bad (#9405)
Also, our commit messages have been getting better, but we can do
better, and be more consistent. This adds more clarity on how to write
commit messages and provides examples of good and bad messages.

Also, our contributing guide was lacking helpful guidance on how to
start change proposals. This commit adds the start of that section.

Soon, we should add a proposal template to the issue tracker with a link
back to the proposal section, which should also be expanded upon.
2025-02-27 19:22:26 -08:00
Bruce MacDonald
0c1041ad85
runner: default to greedy sampler for performance (#9407)
As are adding support for weighted sampling we have seen some performance
regressions, bypassing the sampler logic for now and defaulting to greedy
until we can benchmark the new sampler logic.
2025-02-27 16:41:20 -08:00
Parth Sareen
c245b0406f
sample: remove transforms from greedy sampling (#9377) 2025-02-27 15:44:53 -08:00
Michael Yang
8b194b7520 kvcache: update tests 2025-02-27 22:27:16 +00:00
Michael Yang
3e8b8a1933 ml: update Context.Forward interface
update Context.Forward to accept multiple tensors to match
Context.Compute signature

update Context.Forward to return Context such that it can be chained
with Context.Compute
2025-02-27 22:27:16 +00:00
Blake Mizerany
41dc280491
server/internal/registry: implement CloseNotify and Flush (for now) (#9402)
This fixes panics introduced in 2412adf42b8380748ac79476e273f5b337c3b977
when Gin ungracefully assumes that the http.ResponseWriter implements
http.CloseNotifier and http.Flusher, which our new statusCodeRecorder
does not. This is a temporary fix until we can pour the rest of the Gin
out.
2025-02-27 14:00:37 -08:00
Michael Yang
53d2990d9b model: add bos token if configured 2025-02-27 21:04:59 +00:00
Jesse Gross
e185c08ad9 go.mod: Use full version for go 1.24.0
Otherwise on Linux I get:
go: download go1.24 for linux/amd64: toolchain not available
2025-02-27 13:01:32 -08:00
Blake Mizerany
2412adf42b
server/internal: replace model delete API with new registry handler. (#9347)
This commit introduces a new API implementation for handling
interactions with the registry and the local model cache. The new API is
located in server/internal/registry. The package name is "registry" and
should be considered temporary; it is hidden and not bleeding outside of
the server package. As the commits roll in, we'll start consuming more
of the API and then let reverse osmosis take effect, at which point it
will surface closer to the root level packages as much as needed.
2025-02-27 12:04:53 -08:00
Steven Hartland
be2ac1ed93
docs: fix api examples link (#9360)
Fix the examples link in the go package documentation for the API.
2025-02-27 10:51:12 -08:00
Eries Trisnadi
dc13813a03
server: allow vscode-file origins (#9313) 2025-02-27 10:39:43 -08:00
Michael Yang
d6af13efed runner: simplify tensor split parsing 2025-02-27 18:36:46 +00:00
Michael Yang
a59f665235 ml/backend/ggml: fix debug logging 2025-02-27 18:30:57 +00:00
Daniel Hiltgen
688925aca9
Windows ARM build (#9120)
* Windows ARM build

Skip cmake, and note it's unused in the developer docs.

* Win: only check for ninja when we need it

On windows ARM, the cim lookup fails, but we don't need ninja anyway.
v0.5.13-rc0
2025-02-27 09:02:25 -08:00
Blake Mizerany
76e903cf9d
.github/workflows: swap order of go test and golangci-lint (#9389)
The linter is secondary to the tests, so it should run after the tests,
exposing test failures faster.
2025-02-26 23:03:48 -08:00
Jeffrey Morgan
a5272130c4
ml/backend/ggml: follow on fixes after updating vendored code (#9388)
Fixes sync filters and lowers CUDA version to 11.3 in test.yaml
2025-02-26 22:33:53 -08:00
Jeffrey Morgan
d7d7e99662
llama: update llama.cpp vendor code to commit d7cfe1ff (#9356) 2025-02-26 20:34:44 -08:00
Gordon Kamer
2db96c18e7
readme: add Nichey to community integrations (#9370) 2025-02-26 10:40:53 -08:00
Daniel Hiltgen
e12af460ed
Add cuda Blackwell architecture for v12 (#9350)
* Add cuda Blackwell architecture for v12

* Win: Split rocm out to separate zip file

* Reduce CC matrix

The 6.2 and 7.2 architectures only appear on Jetsons, so they were wasting space.
The 5.0 should be forward compatible with 5.2 and 5.3.
2025-02-26 09:20:52 -08:00
Jeffrey Morgan
3ad4bc8afe
llama: removed unused 'vendoring' file (#9351) 2025-02-25 14:33:03 -08:00