148 Commits

Author SHA1 Message Date
Michael Yang
9e4642e9b3 ollama debug tensor 2025-03-11 14:49:19 -07:00
Jeffrey Morgan
e093db92c4
sample: temporarily use grammars for constrained generation in new engine (#9586) 2025-03-10 16:17:39 +01:00
Jeffrey Morgan
4289c74359
llama: fix kv loading on snowflake-arctic-embed models (#9536) 2025-03-07 09:25:34 -08:00
Michael Yang
05a01fdecb ml/backend/ggml: consolidate system info logging
- output backend system info when initializing the backend. this ensures
  this information is always present without needing to be called
  explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name
2025-03-04 15:14:31 -08:00
Michael Yang
ba7d31240e fix: own lib/ollama directory
expand backend loading error handling to catch more problems and log
them instead of panicing
2025-03-03 13:01:18 -08:00
Michael Yang
657685e85d fix: replace deprecated functions 2025-02-28 21:29:34 +00:00
Jeffrey Morgan
98d44fa39d
llama: add phi4 mini support (#9403) 2025-02-27 19:30:32 -08:00
Michael Yang
a59f665235 ml/backend/ggml: fix debug logging 2025-02-27 18:30:57 +00:00
Jeffrey Morgan
d7d7e99662
llama: update llama.cpp vendor code to commit d7cfe1ff (#9356) 2025-02-26 20:34:44 -08:00
Jeffrey Morgan
3ad4bc8afe
llama: removed unused 'vendoring' file (#9351) 2025-02-25 14:33:03 -08:00
Jeffrey Morgan
8c13cfa4dd
ml/backend/ggml: fix crash on windows paths with wide characters (#9305) 2025-02-23 19:13:53 -08:00
Michael Yang
bda4ef6c56 reorder patches 2025-02-20 03:49:24 +00:00
Michael Yang
1e438b237c
Merge pull request #9203 from ollama/mxyng/sapphirerapids
build: remove backend build for sapphirerapids
2025-02-19 21:42:00 +00:00
Jeffrey Morgan
d2eb226c91
llama: add patch to fix ggml backend reg on Linux with utf-8 characters in the path (#9159) 2025-02-18 22:46:17 -05:00
Michael Yang
5f8c03189e build: remove backend build for sapphirerapids
sapphire rapids has amx support but it ends up having a negative
performance impact.

emerald rapids also has amx support with a positive performance impact
however there's no reasonable way in ggml to differentiate between the
two. the impact is small (~6%) so disable amx entirely for simplicity
2025-02-18 14:47:58 -08:00
Jeffrey Morgan
6600bd7d91
ml/backend/ggml: stable sort devices by score (#9081) 2025-02-13 18:42:36 -08:00
Jesse Gross
ed443a0393 Runner for Ollama engine
This provides integration with the new Ollama engine
(5824541 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1
2025-02-13 17:09:26 -08:00
Michael Yang
49df03da9a
fix: harden backend loading (#9024)
* wrap ggml_backend_load_best in try/catch
* ignore non-ollama paths
2025-02-11 15:36:53 -08:00
Jeffrey Morgan
f4711da7bd
ml/backend/ggml: fix crash on dlopen for non-AVX systems (#8976) 2025-02-10 09:52:12 -08:00
Azis Alvriyanto
b901a712c6
docs: improve syntax highlighting in code blocks (#8854) 2025-02-07 09:55:07 -08:00
Diego Pereira
928911bc68
runner: avoid buffer overwrite when generating multiple embeddings (#8714)
Shield the code processing the embedding result
from subsequent calls that may overwrite the same
buffer to process a second input when retrieving
model embeddings.
2025-02-05 16:53:33 -08:00
Michael Yang
5b446cc815
chore: update gitattributes (#8860)
* chore: update gitattributes
* chore: add build info source
2025-02-05 16:37:18 -08:00
Jeffrey Morgan
cd3fbf1c49
llama: use dynamic backend loading for mllama and clip (#8835) 2025-02-05 09:46:56 -08:00
Michael Yang
548a9f56a6 Revert "cgo: use O3"
This reverts commit bea1f1fac6b6b51bb3b8a666789c518b7aaa8b94.
2025-01-31 10:25:39 -08:00
Michael Yang
bea1f1fac6 cgo: use O3 2025-01-30 12:21:50 -08:00
Michael Yang
dcfb7a105c
next build (#8539)
* add build to .dockerignore

* test: only build one arch

* add build to .gitignore

* fix ccache path

* filter amdgpu targets

* only filter if autodetecting

* Don't clobber gpu list for default runner

This ensures the GPU specific environment variables are set properly

* explicitly set CXX compiler for HIP

* Update build_windows.ps1

This isn't complete, but is close.  Dependencies are missing, and it only builds the "default" preset.

* build: add ollama subdir

* add .git to .dockerignore

* docs: update development.md

* update build_darwin.sh

* remove unused scripts

* llm: add cwd and build/lib/ollama to library paths

* default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS

* add additional cmake output vars for msvc

* interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12

* remove unncessary filepath.Dir, cleanup

* add hardware-specific directory to path

* use absolute server path

* build: linux arm

* cmake install targets

* remove unused files

* ml: visit each library path once

* build: skip cpu variants on arm

* build: install cpu targets

* build: fix workflow

* shorter names

* fix rocblas install

* docs: clean up development.md

* consistent build dir removal in development.md

* silence -Wimplicit-function-declaration build warnings in ggml-cpu

* update readme

* update development readme

* llm: update library lookup logic now that there is one runner (#8587)

* tweak development.md

* update docs

* add windows cuda/rocm tests

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-01-29 15:03:38 -08:00
Jeffrey Morgan
61676fb506
llama: move grammar tests to llama_test.go (#8411) 2025-01-14 12:55:45 -08:00
Jeffrey Morgan
1deafd8254
llama: update vendored code to commit 46e3556 (#8308) 2025-01-08 11:22:01 -08:00
Ubaldo Porcheddu
3919f4ba3d
llama: fix runner api example url in README.md (#8307) 2025-01-04 15:45:16 -08:00
Parth Sareen
290cf2040a
llama: test key order preservation in schema_to_grammar (#8078)
This change adds a test to catch a regression in schema_to_grammar where
the order of keys in the JSON schema is not preserved in the generated
grammar, which is critical for step-by-step reasoning.
2024-12-18 19:44:50 -08:00
Jesse Gross
08a832b482 llama: Ensure KV cache is fully defragmented.
Sometimes the KV cache requires defragmentation even without
triggering the threshold heuristic. In this case, decoding
will not being able to find a KV cache slot. This is particularly
difficult for the caller to handle if it happens in between
ubatches. To avoid this, we should immediately trigger a defrag.

In addition, a heavily fragmented cache can require more than
max_moves to defragment. Currently, we stop when we hit the limit
but this can leave a cache that still does not have adequate space
even after defragmentation is triggered. Instead, we should do
multiple batches of processing until everything is complete.

Fixes #7949
2024-12-17 14:01:19 -08:00
Jeffrey Morgan
7a81daf026
llama: update vendor code to commit ba1cb19c (#8101) 2024-12-14 14:55:51 -08:00
Daniel Hiltgen
60f75560a2
runner: switch logging back to stderr (#8091)
This puts the low-level runner logging back on stderr for consistency with prior releases
2024-12-13 14:36:50 -08:00
Pascal Patry
c216850523
llama: parse JSON schema using nlohmann::ordered_json to maintain ordering (#8071) 2024-12-12 09:57:28 -08:00
Parth Sareen
18f6a98bd6
llama: enable JSON schema key ordering for generating grammars (#8055) 2024-12-11 17:17:36 -08:00
Blake Mizerany
9039c821a2
llama: preserve field order in user-defined JSON schemas (#8002)
Previously we decoded and re-encoded JSON schemas during validation,
which served no purpose since json.RawMessage already validates JSON
syntax. Worse, the re-encoding lost field ordering from the original
schema, which affects inference quality during step-by-step reasoning.

While fixing this ordering issue by using json.RawMessage directly,
testing revealed that schema_to_grammar (from llama.cpp) also fails to
preserve field order during grammar generation. This appears to be the
root cause of inference degradation.

This change prevents us from mangling the user's original schema order,
but we still need to address the ordering issue in schema_to_grammar.
That will be a separate change.

Updates #7978
2024-12-11 14:07:30 -08:00
Jeffrey Morgan
527cc97899
llama: update vendored code to commit 40c6d79f (#7875) 2024-12-10 19:21:34 -08:00
Daniel Hiltgen
b9ccb3741e
Remove unused runner CpuFeatures (#8032)
The final implementation of #7499 removed dynamic vector requirements
in favor of a simpler filename based model, and this was left over logic that
is no longer needed.
2024-12-10 12:59:39 -08:00
Daniel Hiltgen
4879a234c4
build: Make target improvements (#7499)
* llama: wire up builtin runner

This adds a new entrypoint into the ollama CLI to run the cgo built runner.
On Mac arm64, this will have GPU support, but on all other platforms it will
be the lowest common denominator CPU build.  After we fully transition
to the new Go runners more tech-debt can be removed and we can stop building
the "default" runner via make and rely on the builtin always.

* build: Make target improvements

Add a few new targets and help for building locally.
This also adjusts the runner lookup to favor local builds, then
runners relative to the executable, and finally payloads.

* Support customized CPU flags for runners

This implements a simplified custom CPU flags pattern for the runners.
When built without overrides, the runner name contains the vector flag
we check for (AVX) to ensure we don't try to run on unsupported systems
and crash.  If the user builds a customized set, we omit the naming
scheme and don't check for compatibility.  This avoids checking
requirements at runtime, so that logic has been removed as well.  This
can be used to build GPU runners with no vector flags, or CPU/GPU
runners with additional flags (e.g. AVX512) enabled.

* Use relative paths

If the user checks out the repo in a path that contains spaces, make gets
really confused so use relative paths for everything in-repo to avoid breakage.

* Remove payloads from main binary

* install: clean up prior libraries

This removes support for v0.3.6 and older versions (before the tar bundle)
and ensures we clean up prior libraries before extracting the bundle(s).
Without this change, runners and dependent libraries could leak when we
update and lead to subtle runtime errors.
2024-12-10 09:47:19 -08:00
Parth Sareen
630e7dc6ff
api: structured outputs - chat endpoint (#7900)
Adds structured outputs to chat endpoint
---------

Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Hieu Nguyen <hieunguyen1053@outlook.com>
2024-12-04 16:31:19 -08:00
Sam
1bdab9fdb1
llm: introduce k/v context quantization (vRAM improvements) (#6279) 2024-12-03 15:57:19 -08:00
Jeffrey Morgan
39e29ae5dd
llama: fix typo and formatting in readme (#7876) 2024-11-28 17:27:11 -08:00
ItzCrazyKns
e3936d4fb3
Support Multiple LoRa Adapters (#7667)
Closes #7627
2024-11-27 11:00:04 -08:00
Jesse Gross
71e6a0d0d1 runner.go: Don't try to extract image tags for text models
When processing a prompt, we look for image tags of the form
[img-0], which are inserted by the Ollama server process.
However, this can cause errors if the original prompt has these
tags - typically an image not found error is returned.

This changes tag searching behavior to be similar to the 0.3.x
series, which will largely avoid these problems. However,they can
still happen when input text with these tags is used with image
models. The correct solution is to escape the tags but this is a
larger issue with special sequences in general so this is an
incremental fix that should avoid the problem for the majority
of cases.
2024-11-26 13:23:24 -08:00
Jesse Gross
2cd11ae365 runner.go: Add unit tests for context shifting
This also makes it easier to truncate long inputs the same as
shifting but does not actually implement it. This type of
truncation has a trade off between quality and time to first
token.
2024-11-26 11:21:35 -08:00
Jesse Gross
3478b2cf14 runner.go: Fix deadlock with many concurrent requests
If there are no avilable slots for new sequences then a request
will not be added to the processing queue but will continue on
to wait for a response that never comes. Besides never giving a
response to the request, this prevents the model from being
unloaded due to the outstanding request.

To prevent this, there are semaphores that prevent more requests
from being processed than there are slots - one in the Ollama
server and one in the runner.
 - The Ollama server one works but it is not designed to protect
the runner's data internal structures and the runner can return a
final response before clearing its data structures.
 - The internal runner semaphore has similar behavior where it
 can release the semaphore when it issues a response. This is
 wrong - it should only release the semaphore after it has
 cleared the data structure.

In addition, we should return an error if a slot is not found
rather than deadlocking in the event we ever get to this spot.

Fixes #7779
2024-11-22 16:14:51 -08:00
Daniel Hiltgen
b85520bfb9
logs: explain client aborts better (#7783)
Users get confused by "Failed to acquire semaphore" error="context canceled"
messages in the logs, which are actually clients giving up.  While there could be
a legitimate hang bug in the system, sometimes this is just short client timeouts
with an overloaded system, so this should help users understand what's going on
better.
2024-11-22 08:05:32 -08:00
boessu
1a742f54c9
readme: update AMD ROCm links (#7213) 2024-11-20 23:48:55 -08:00
Jesse Gross
c4b34f2a2a runner.go: Truncate inputs that exceed context rather than shifting
Previous versions of the runner would truncate inputs to the context
window before beginning processing. The main processing loop relied
on this behavior if the context needed to be shifted later (due to
token generation). If truncation did not occur then invariants
would be broken, causing crashes or infinite loops.

Later versions attempted to fix these bugs and make the logic less
subtle so that all inputs could be handled. Truncation was removed
to make things consistent.

However, truncation is much faster than processing and shifting, so
removing it caused performance problems when the input vastly exceeded
the context size. This restores the input truncation as a performance
optimization while keeping the more robust processing logic.

Fixes #7762
2024-11-20 12:49:24 -08:00
Jesse Gross
c3ff916431 runner.go: Don't add inputs to cache view until actually processed
We need to track which tokens are in the cache ourselves. We currently
add tokens to the cache tracker when we add them to batch but they are
not actually in the cache until we call Decode. This can cause
confusion when we are shifting the cache.

Avoids "could not find a KV slot for the batch" issues.

Bug #7545
2024-11-20 12:49:24 -08:00