Add metadata and tensor information to the show command to be able to
see more information about a model. This outputs the same data as
shown on the model details page on ollama.com
This is useful for a few things:
- Work around bugs, such as having 2 images in one batch
- Keep the image in a single batch for fully connected attention
- Improve performance by not evaluating embeddings multiple times
This commit replaces the old pull implementation in the server package
with the new, faster, more robust pull implementation in the registry
package.
The new endpoint, and now the remove endpoint too, are behind the
feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT
environment variable include "client2".
Currently, the progress indication is wired to perform the same as the
previous implementation to avoid making changes to the CLI, and because
the status reports happen at the start of the download, and the end of
the write to disk, the progress indication is not as smooth as it could
be. This is a known issue and will be addressed in a future change.
This implementation may be ~0.5-1.0% slower in rare cases, depending on
network and disk speed, but is generally MUCH faster and more robust
than the its predecessor in all other cases.
* Include unified vision layers in memory prediction
For newer vision models with a single gguf, include
the projection estimates.
* Adjust CLI to handle both styles of vision model metadata
* Wire up new tokenizers for new engine
If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp. This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.
This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.
* Lay foundation for auto selection of new engine
This reintroduces aggressive pruning on model deletion as a temporary
measure until a more controlled garbage collection (GC) mechanism is
implemented.
Issues with the current approach:
1. Users may accidentally delete a model (`ollama rm llama3.3` instead
of `ollama rm llama3.2`), requiring a full re-download unless another
model references the same blobs.
2. Users may assume a deleted model is still referenced elsewhere, but
due to prior updates or deletions, the references no longer exist,
leading to unnecessary re-downloads.
Soon, we should implement a structured GC mechanism to retain
unreferenced blobs for a configurable period before removal, which will
run on "ollama rm" and other commands we deem appropriate.
Users that want to immediately remove unreferenced blobs can use a new
prune command that will allow them to specify the age and class of blobs
to remove.
Example usage:
# Run basic blob GC
$ ollama prune
# Remove unreferenced blobs older than 7 days
$ ollama prune --age 7d
# Remove all blobs, referenced or not, older than 7 days (and their manifests?)
$ ollama prune --age 7d --all
# Remove all unreferenced blobs immediately
$ ollama prune --age 0 --all
# Remove all blobs
$ ollama prune --age 0 --all
This should provide a safer and more predictable cleanup process.
Previously, developers without the synctest experiment enabled would see
build failures when running tests in some server/internal/internal
packages using the synctest package. This change makes the transition to
use of the package less painful but guards the use of the synctest
package with build tags.
synctest is enabled in CI. If a new change will break a synctest
package, it will break in CI, even if it does not break locally.
The developer docs have been updated to help with any confusion about
why package tests pass locally but fail in CI.
Previously, using a Registry required a DiskCache to be passed in for
use in various methods. This was a bit cumbersome, as the DiskCache is
required for most operations, and the DefaultCache is used in most of
those cases. This change makes the DiskCache an optional field on the
Registry struct.
This also changes DefaultCache to initialize on first use. This is to
not burden clients with the cost of creating a new cache per use, or
having to hold onto a cache for the lifetime of the Registry.
Also, slip in some minor docs updates for Trace.
The extended name format is a superset of the name format that only the
client needs to know about, not the server or other dependents of the
name package, so move the split logic into the client package.
Also, take advantage of knowing about the extended name format to allow
the client to use the extended name format when unlinking to verify they
are unlinking the manifest with the content they intend.
This commit is a step towards a goal to make names less ceremonial
outside of the registry client. Clients of the registry package can
treat names as opaque strings, and the registry package will handle
parsing, validating, and normalizing names.
Ideally we end up with the names package tucked away in an internal
package for good. We'll see how things go.
Also, this package name is not permanent. This another step in the
on-going process of refactoring the server code, and at some point it
will most likely be renamed/moved.
More validation during the safetensor creation process.
Properly handle relative paths (like ./model.safetensors) while rejecting absolute paths
Add comprehensive test coverage for various paths
No functionality changes for valid inputs - existing workflows remain unaffected
Leverages Go 1.24's new os.Root functionality for secure containment
Also, require the -as flag to be set when importing a model. This
prevents the confusing error message "invalid name".
Also, allow short names to be used when importing a model and
auto-complete the name with the default mask.
This fixes panics introduced in 2412adf42b8380748ac79476e273f5b337c3b977
when Gin ungracefully assumes that the http.ResponseWriter implements
http.CloseNotifier and http.Flusher, which our new statusCodeRecorder
does not. This is a temporary fix until we can pour the rest of the Gin
out.
This commit introduces a new API implementation for handling
interactions with the registry and the local model cache. The new API is
located in server/internal/registry. The package name is "registry" and
should be considered temporary; it is hidden and not bleeding outside of
the server package. As the commits roll in, we'll start consuming more
of the API and then let reverse osmosis take effect, at which point it
will surface closer to the root level packages as much as needed.
This commit copies (without history) the bmizerany/ollama-go repository
with the intention of integrating it into the ollama as a replacement
for the pushing, and pulling of models, and management of the cache they
are pushed and pulled from.
New homes for these packages will be determined as they are integrated
and we have a better understanding of proper package boundaries.
The route assembly in Handler lacked clear organization making it
difficult scan for routes and their relationships to each other. This
commit aims to fix that by reordering the assembly of routes to group
them by category and purpose.
Also, be more specific about what "config" refers to (it is about CORS
if you were wondering... I was.)
This provides integration with the new Ollama engine
(5824541 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.
In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
- Parallel processing
- Memory management for defragmentation and shifting
- Multi-modal modals
Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:
Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve
Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1
feat: add new Ollama engine using ggml through cgo
This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.
- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations
This is the first implementation of the new engine. Follow up PRs will implement more features:
- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
In some cases, downloads slow due to disk i/o or other factors,
causing the download to restart a part. This causes the download
to "reverse" in percent completion. By increasing the timeout to 30s,
this should happen less frequently.
* add build to .dockerignore
* test: only build one arch
* add build to .gitignore
* fix ccache path
* filter amdgpu targets
* only filter if autodetecting
* Don't clobber gpu list for default runner
This ensures the GPU specific environment variables are set properly
* explicitly set CXX compiler for HIP
* Update build_windows.ps1
This isn't complete, but is close. Dependencies are missing, and it only builds the "default" preset.
* build: add ollama subdir
* add .git to .dockerignore
* docs: update development.md
* update build_darwin.sh
* remove unused scripts
* llm: add cwd and build/lib/ollama to library paths
* default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS
* add additional cmake output vars for msvc
* interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12
* remove unncessary filepath.Dir, cleanup
* add hardware-specific directory to path
* use absolute server path
* build: linux arm
* cmake install targets
* remove unused files
* ml: visit each library path once
* build: skip cpu variants on arm
* build: install cpu targets
* build: fix workflow
* shorter names
* fix rocblas install
* docs: clean up development.md
* consistent build dir removal in development.md
* silence -Wimplicit-function-declaration build warnings in ggml-cpu
* update readme
* update development readme
* llm: update library lookup logic now that there is one runner (#8587)
* tweak development.md
* update docs
* add windows cuda/rocm tests
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
* llama: wire up builtin runner
This adds a new entrypoint into the ollama CLI to run the cgo built runner.
On Mac arm64, this will have GPU support, but on all other platforms it will
be the lowest common denominator CPU build. After we fully transition
to the new Go runners more tech-debt can be removed and we can stop building
the "default" runner via make and rely on the builtin always.
* build: Make target improvements
Add a few new targets and help for building locally.
This also adjusts the runner lookup to favor local builds, then
runners relative to the executable, and finally payloads.
* Support customized CPU flags for runners
This implements a simplified custom CPU flags pattern for the runners.
When built without overrides, the runner name contains the vector flag
we check for (AVX) to ensure we don't try to run on unsupported systems
and crash. If the user builds a customized set, we omit the naming
scheme and don't check for compatibility. This avoids checking
requirements at runtime, so that logic has been removed as well. This
can be used to build GPU runners with no vector flags, or CPU/GPU
runners with additional flags (e.g. AVX512) enabled.
* Use relative paths
If the user checks out the repo in a path that contains spaces, make gets
really confused so use relative paths for everything in-repo to avoid breakage.
* Remove payloads from main binary
* install: clean up prior libraries
This removes support for v0.3.6 and older versions (before the tar bundle)
and ensures we clean up prior libraries before extracting the bundle(s).
Without this change, runners and dependent libraries could leak when we
update and lead to subtle runtime errors.
New lines can be an important part of a user's prompt and trimming
it can alter the results. We previously only trimmed prompts with
images but refactoring brought this behavior to all prompts, where
it became more noticable.
The /generate endpoint adds less whitespace and therefore doesn't
need to trim it out - this brings the same behavior to /chat.
Thanks to @gabe-l-hart for spotting the issue!
Fixes#7795
This changes makeRequest to update the http client Transport if and only
if testMakeRequestDialContext is set. This is to avoid overriding the
default Transport when testMakeRequestDialContext is nil, which broke
existing behavior, included proxies, timeouts, and other behaviors.
Fixes#7829Fixes#7788
In the past the ollama.com server would return a JWT that contained
information about the user being authenticated. This was used to return
different error messages to the user. This is no longer possible since the
token used to authenticate does not contain information about the user
anymore. Removing this code that no longer works.
Follow up changes will improve the error messages returned here, but good to
clean up first.
This change allows for mixed-case model names to be pushed, pulled,
copied, and created, which was previously disallowed because the Ollama
registry was backed by a Docker registry that enforced a naming
convention that disallowed mixed-case names, which is no longer the
case.
This does not break existing, intended, behaviors.
Also, make TestCase test a story of creating, updating, pulling, and
copying a model with case variations, ensuring the model's manifest is
updated correctly, and not duplicated across different files with
different case variations.
The Go runner does not have a problem with supporting parallel
requests for most multimodal models. Now that we won't be potentially
falling back to server.cpp, this restriction can be lifted.
However, the new mllama model can't support parallel requests, so we
will need to keep a restriction for that.