4138 Commits

Author SHA1 Message Date
Blake Mizerany
2c8f95de19 cmd: default to client2 and simplify pull progress display
Switch the pull and remove operations to use the client2 registry by default,
removing the need to pass an experimental flag.

Also simplify the progress UI during model pulls. Previously, each layer
displayed its own progress bar, resulting in noisy and repetitive output:

	pulling manifest
	pulling aeda25e63ebd... 100% ▕██████████████████████████ 3.3 GB
	pulling e0a42594d802... 100% ▕██████████████████████████  358 B
	...
	writing manifest
	success

This change replaces that with a single progress bar for the entire pull,
followed by a single "Done." message:

	Downloading gemma3: 100% ▕█████████████████████████████▏ 3.3 GB

This provides a cleaner and more intuitive experience, and aligns better
with how users think about pulling models as a unit, rather than a collection
of layers.

To support older clients that still rely on a fixed-width Digest field,
we format the Digest to be at least 20 characters long. The value includes
padding and a truncated model name to prevent out-of-bounds access in
legacy clients. This is a temporary compatibility hack and can be removed
once all clients have adopted the new API.

Updates server behavior to handle all combinations of new and old
clients.
2025-04-08 16:50:55 -07:00
Alex Rozgo
2f723ac2d6
types: allow tool function parameters with a single type or an array of types (#9434) 2025-04-07 14:27:01 -07:00
Devon Rifkin
249fbbe52f
Merge pull request #10169 from ollama/drifkin/fix-contributing-formatting
CONTRIBUTING: fix code block formatting
2025-04-07 14:02:35 -07:00
Devon Rifkin
c38680b8a1
CONTRIBUTING: fix code block formatting
There were only 3 spaces instead of 4, so the example was being considered to include html elements
2025-04-07 13:53:33 -07:00
Michael Yang
16fca86c4a digest files in parallel 2025-04-07 09:46:31 -07:00
Daniel Hipke
0f3f9e353d
ml/backend/ggml: create a new file descriptor for tensor (#10133)
improves model loading times on network-based filesystems
such as GCS fuse by creating a dedicated file descriptor for each
section of the file being read, reducing seeking
v0.6.5-rc1 v0.6.5
2025-04-04 17:04:24 -07:00
Bruce MacDonald
6bd0a983cd model: support for mistral-small in the ollama runner
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
v0.6.5-rc0
2025-04-03 16:57:36 -07:00
Michael Yang
1861fbdeb5
Merge pull request #9873 from ollama/mxyng/fs-config
fs: move ml.Config to fs package
2025-04-03 14:05:21 -07:00
Michael Yang
3b96a93672 fs: move ml.Config to fs package 2025-04-03 13:12:24 -07:00
Bruce MacDonald
e53b3cbd0c
llm: set done reason at server level (#9830)
No functional change. Many different done reasons can be set at the runner
level, so rather than obsuring them we should return them to the server
process and let it choose what to do with the done reason. This separates
the API concerns from the runner.
2025-04-03 10:19:24 -07:00
Jeffrey Morgan
b51e0f397c
model: fix issues with spm tokenizer for Gemma 3 (#10081) v0.6.4-rc0 v0.6.4 2025-04-02 13:22:56 -07:00
jmorganca
b42970063d kvcache: Add check for values that fall out of sliding window cache
The sliding window cache trims entries that are outside the window for
the latest token. This works when we are extending the cache, such as
when the conversation continues. However, if we have a partial overlap
in conversation (including the BOS tokens), then we resume from a past
point in the conversation and the needed tokens are no longer stored
in memory. This verifies that the new window overlaps with the old one
before reusing the cache.

Co-authored-by: Jesse Gross <jesse@ollama.com>
2025-04-02 11:55:48 -07:00
Jesse Gross
493385eb3e ollamarunner: Don't truncate a SameBatch
When truncating inputs to the the context window at the beginning of
a sequence, we remove the minimum amount possible. However, this
may cause us to truncate to the middle of a set of inputs that
the model specified should not be split up. To avoid this, we
need to remove the rest of the partial batch.
2025-04-02 10:40:38 -07:00
Bruce MacDonald
9876c9faa4
chore(all): replace instances of interface with any (#10067)
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.
2025-04-02 09:44:27 -07:00
IsAurora6
4e415029b3
readme: add Casibase to community integrations (#10057) 2025-04-02 01:27:16 -07:00
Bruce MacDonald
e172f095ba
api: return model capabilities from the show endpoint (#10066)
With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.
2025-04-01 15:21:46 -07:00
Ilian
c001b98087
docs: add TagSpaces to community integrations (#9983) 2025-03-31 17:28:59 -07:00
Abyss-c0re
23fc8e92eb
docs: add DeepShell to community projects (#9955)
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-03-31 17:23:04 -07:00
湛露先生
4059a297a6
discover: /proc/cpuinfo file open and close. (#9950)
Signed-off-by: zhanluxianshen <zhanluxianshen@163.com>
2025-03-31 17:07:42 -07:00
Bruce MacDonald
66b2539238
runner: clear cache when shift is not possible (#9433)
Clear KV cache when shift operation is not supported by model.
Added KvCacheCanShift() check to handle models that can't perform cache shifts,
falling back to full cache clear while preserving logical token history to
maintain expected behavior when context window fills up.
2025-03-31 12:54:45 -07:00
Blake Mizerany
ef27d52e79
server/internal/client/ollama: cache completed chunks (#9933)
This change adds tracking of download chunks during the pull process so
that subsequent pulls can skip downloading already completed chunks.
This works across restarts of ollama.

Currently, download state will be lost if a prune is triggered during a
pull (e.g. restart or remove). This issue should be addressed in a
follow-up PR.
2025-03-30 23:54:54 -07:00
Jesse Gross
b2a465296d runner: Release semaphore and improve error messages on failures
If we have an error after creating a new sequence but before
finding a slot for it, we return without releasing the semaphore.
This reduces our parallel sequences and eventually leads to deadlock.

In practice this should never happen because once we have acquired
the semaphore, we should always be able to find a slot. However, the
code is clearly not correct.
2025-03-30 19:21:54 -07:00
Jesse Gross
5d097277ef ollamarunner: Ensure batch size limits are not exceeded
With the llama runner, we can generate up to NUM_PARALLEL batches
at once, which will then get broken up to into individual batches
to get executed by llama.cpp (i.e. we add up to 2048 tokens and
this gets split into 4 batches of 512 tokens at default settings).

This splitting can improve parallelism on multi-GPU systems because
the individual batches can move though the pipeline without blocking
on the first one to fully complete. However, we don't yet support
this in the Ollama runner, partially because it makes it hard to
enforce model-specified batch constraints, which didn't exist
previously.

The result is that we will try to execute the full, unsplit batch.
This could result in out of memory or insufficient KV cache space
errors.

This triggers batch breaking when the total inputs from all sequences
exceeds the batch size, rather than per-sequence. In order to ensure
fairness, it also reintroduces round-robinning around sequences so
that we don't let one busy sequence starve the others.
2025-03-30 19:21:01 -07:00
Leandro Borges Ferreira
071a9872cb
readme: add Writeopia to community integrations (#10042) 2025-03-30 17:28:06 -07:00
CYJiang
0bd0454ea7
server: organize error types (#9465)
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-03-28 11:50:22 -07:00
Jesse Gross
01aa788722 ml: Remove Output from Context interface
Model implementations should use Input for all of their tensors
supplied to the model. This includes tensors that relate to the
outputs, which is confusing since there is also an Output funciton.

Since Output is only used internally in GGML and not used by any
model implementations, we can remove it from the interface to
reduce confusion.
2025-03-27 12:19:43 -07:00
saman-amd
ead27aa9fe
Add gfx1200 & gfx1201 support on linux (#9878) 2025-03-27 07:35:19 -07:00
Parth Sareen
b816ff86c9
docs: make context length faq readable (#10006) 2025-03-26 17:34:18 -07:00
molbal
e5d84fb90b
docs: add molbal/orca-cli to community integrations (#9909) v0.6.3-rc1 v0.6.3 2025-03-26 13:39:01 -07:00
Hengky Steen
dd66712e31
docs: add ollamb to community projects 2025-03-26 13:38:05 -07:00
Jesse Gross
f66216e399 ggml: Support heterogeneous KV cache layer sizes in memory estimation
Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890
2025-03-26 13:16:03 -07:00
Jesse Gross
f4f0992b6e llm: Fix debug logging for memory estimates 2025-03-26 13:16:03 -07:00
Jesse Gross
1feff61977 kvcache: Sliding window cache only needs a single batch total
When computing the size of the cache for sliding window attention,
we don't need to multiple the batch size by the number of parallel
sequences - the batch size is constant.

This also simplifies the check for whether to allocate the cache
size based on capacity or window size as the batch size is already
incorporated into the capacity when handled by the runner.
2025-03-26 13:16:03 -07:00
copeland3300
5e0b904e88
docs: add flags to example linux log output command (#9852) 2025-03-25 09:52:23 -07:00
Matheus C. França
131f0355a5
readme: add ollama-d library (#9907) 2025-03-24 09:25:58 -07:00
Blake Mizerany
ce929984a3
server/internal/client/ollama: fix file descriptor management in Pull (#9931)
Close chunked writers as soon as downloads complete, rather than
deferring closure until Pull exits. This prevents exhausting file
descriptors when pulling many layers.

Instead of unbounded defers, use a WaitGroup and background goroutine
to close each chunked writer as soon as its downloads finish.

Also rename 'total' to 'received' for clarity.
2025-03-21 16:16:38 -07:00
Michael Yang
4b34930a31
Merge pull request #9897 from ollama/mxyng/chunk-load
ml/backend/ggml: load tensors in 128KiB chunks
v0.6.3-rc0
2025-03-21 14:47:13 -07:00
Michael Yang
74bd09652d ml/backend/ggml: load tensors in 32KiB chunks 2025-03-21 14:43:52 -07:00
Bruce MacDonald
fb6252d786
benchmark: performance of running ollama server (#8643) 2025-03-21 13:08:20 -07:00
Blake Mizerany
c794fef2f2
server/internal/client/ollama: persist through chunk download errors (#9923) 2025-03-21 13:03:43 -07:00
Parth Sareen
00ebda8cc4
Revert "parser: remove role validation from Modelfile parser" (#9917)
This reverts commit ffbfe833da387f9b6806fe887b85992c11d26eaa.
2025-03-21 12:38:09 -07:00
Parth Sareen
d14ce75b95
docs: update final response for /api/chat stream (#9919) 2025-03-21 12:35:47 -07:00
Jesse Gross
2d6eac9084 kvcache: Optimize sliding window attention
Currently sliding window attention allocates and uses the full
context size and just masks out any tokens that are outside of the
window. However, we really only need (roughly) the sliding window
size.

At large context sizes this improves two things:
 - Memory allocated - since the fully context size is allocated up front,
   memory requirements drop substantially. On Gemma3:4b with a 32k
   context window, total memory usage (including weights and non-sliding
   layers) drops from ~20GB to ~8GB.
 - Computation - ranges that are completely outside of the sliding
   window are now removed from the tensors that are returned from the
   cache rather than simply being masked out. This results in more
   efficient processing, scaling with the size of the context that
   has actually been used.

Notable, this does not update the scheduler for any model to be aware of
the smaller memory requirements. This is difficult for Gemma3 because
the layers are heterogeneous between sliding and non-sliding attention.
As a result, while actual memory consumption will be reduced, the
scheduler will over-estimate the requirements of the model. This means
that splitting between GPUs or GPUs and CPUs will still be suboptimal.

Bug #9730
2025-03-21 11:20:19 -07:00
Jesse Gross
3ed7ad3ab3 kvcache: Pass granular cache size into implementations
Currently the runner computes the kv size needed and creates a
cache of that size. This is the context size times number of
parallel sequences.

Cache implementations can make better decisions about their memory
usage, so instead pass in the required capacity, number of sequences
and maximum batch size. For now, the causal cache just uses this to
compute the size in the same way as before.
2025-03-21 11:20:19 -07:00
Patrick Devine
6d1103048e
fix: show correct bool value for kv in verbose show information (#9928) 2025-03-21 11:13:54 -07:00
Jesse Gross
0ff28758b3 ollamarunner: Provide mechanism for backends to report loading progress
This enables the runner to report progress back to the Ollama server,
both for showing status to the user and also to prevent the server
from killing the runner if it thinks things have stalled.

Most of the infrastructure was already there, this extends it to
be available to the backends.
2025-03-21 10:44:26 -07:00
Jesse Gross
d3e9ca3eda kvcache: Account for source tensors in defrag operation count
Defragging the KV cache can generate a lot of operations, so we
need to be careful that we don't overflow the number that the graph
can support. We currently account for all of the nodes that we add
to the graph for each move but we also need to include the original
cache tensors as well.

Fixes #9904
2025-03-21 10:42:19 -07:00
Jesse Gross
0fbfcf3c9c model: Pass input tensor instead of raw data to models
Rather than directly giving the input data to models, we can
pass a tensor instead. In the short term, this saves some duplicated
code.

Longer term, we will want to overlap setting up the next batch with
processing of the current one. In this case, we will only have the
shape of tensor but it will not be loaded with data at the time of
graph generation. By passing only a tensor to models now, we set up
this possibility and prevent them from relying on data that they won't
have in the future.

Although the same could be done for Positions and Outputs, in some
cases we either need the raw input data or don't use them at all.
Therefore, for now we leave them as they are and allow models to
convert them to tensors as needed.
2025-03-20 13:28:13 -07:00
Jesse Gross
0c220935bd input: Rename Options to Batch
Options is no longer very descriptive of this struct.
2025-03-20 13:28:13 -07:00
rylativity
ffbfe833da
parser: remove role validation from Modelfile parser (#9874)
* updates parser/parser.go to allow arbitrary roles in Modelfile MESSAGE blocks
2025-03-20 13:11:17 -07:00