4139 Commits

Author SHA1 Message Date
Michael Yang
7946618dc3 metal: op_neg 2025-04-02 16:55:06 -07:00
Michael Yang
a9220da3b6 s/gelu/silu/ 2025-04-02 16:13:19 -07:00
Michael Yang
394b69dece mistral3 quantization 2025-04-02 13:17:06 -07:00
Michael Yang
bfdd02472c remove unused rope 2025-04-02 12:24:02 -07:00
Michael Yang
7465c0118e mistral3 memory
mistral3 graph is very similar to gemma3 so use that for now
2025-04-02 12:11:15 -07:00
Michael Yang
87cf2fa1b8 compute image embeddings once 2025-04-02 11:44:06 -07:00
Michael Yang
2ab14468a8 ml: add repeat op
repeat is a convenience operation for repeating a tensor n times along a
dimention. this can replace instances where the same tensors are stacked
together
2025-04-01 15:25:20 -07:00
Michael Yang
90d8a1e8a2 mistral-small: use ollama engine 2025-04-01 13:58:23 -07:00
Michael Yang
1ca8eb5c05 use fast attention 2025-04-01 13:58:23 -07:00
Michael Yang
8aec3e1374 fix image embeddings 2025-04-01 13:02:14 -07:00
Michael Yang
e6b561005e fix patch batch 2025-03-31 17:21:47 -07:00
Michael Yang
6184028fc0 fix convert 2025-03-31 13:12:34 -07:00
Michael Yang
557c641697 2d rope 2025-03-31 10:34:17 -07:00
Michael Yang
863ba57477 fixes 2025-03-25 13:57:24 -07:00
Bruce MacDonald
dce7cf2a1a remove debugging code 2025-03-24 09:48:20 -07:00
jmorganca
62108621d5 update comment 2025-03-23 23:11:47 -07:00
jmorganca
a1c8b0fdb0 cleanup 2025-03-23 22:49:56 -07:00
jmorganca
3daa26e8e8 remove unneeded conversion replacement 2025-03-23 21:57:35 -07:00
jmorganca
1663ef289c remove large files 2025-03-23 21:43:17 -07:00
jmorganca
4586e137fe wip 2025-03-23 21:41:18 -07:00
jmorganca
cfeca27133 wip 2025-03-23 01:01:23 -07:00
jmorganca
4530661799 wip 2025-03-22 23:20:39 -07:00
jmorganca
8dd2a81f8c wip 2025-03-22 22:33:39 -07:00
jmorganca
caddb1e4cf rebased 2025-03-22 10:15:52 -07:00
Bruce MacDonald
4d8dac8ffc wip 2025-03-22 10:03:23 -07:00
Bruce MacDonald
63e6509ec0 vision conversion 2025-03-22 10:03:22 -07:00
Bruce MacDonald
6f34126dcc image processing 2025-03-22 10:03:22 -07:00
Bruce MacDonald
ecc0ef468f split text model to its own file 2025-03-22 10:03:22 -07:00
Bruce MacDonald
9b57238834 ... 2025-03-22 10:03:22 -07:00
Bruce MacDonald
3b4ad00a4b mistral3 arch 2025-03-22 10:03:22 -07:00
Bruce MacDonald
9a12fd1067 wip: test fixes 2025-03-22 10:03:22 -07:00
Bruce MacDonald
edac05387f convert: mistral-3.1-2503 text component 2025-03-22 10:03:22 -07:00
Bruce MacDonald
e65cf9dc94 minimal convert 2025-03-22 10:03:22 -07:00
jmorganca
7e3c62f388 wip 2025-03-22 10:03:22 -07:00
jmorganca
a75703b2cc wip 2025-03-22 10:03:22 -07:00
Bruce MacDonald
c24e8860c1 model: support for mistral-small in the ollama runner
Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.
2025-03-22 10:03:22 -07:00
Blake Mizerany
ce929984a3
server/internal/client/ollama: fix file descriptor management in Pull (#9931)
Close chunked writers as soon as downloads complete, rather than
deferring closure until Pull exits. This prevents exhausting file
descriptors when pulling many layers.

Instead of unbounded defers, use a WaitGroup and background goroutine
to close each chunked writer as soon as its downloads finish.

Also rename 'total' to 'received' for clarity.
2025-03-21 16:16:38 -07:00
Michael Yang
4b34930a31
Merge pull request #9897 from ollama/mxyng/chunk-load
ml/backend/ggml: load tensors in 128KiB chunks
v0.6.3-rc0
2025-03-21 14:47:13 -07:00
Michael Yang
74bd09652d ml/backend/ggml: load tensors in 32KiB chunks 2025-03-21 14:43:52 -07:00
Bruce MacDonald
fb6252d786
benchmark: performance of running ollama server (#8643) 2025-03-21 13:08:20 -07:00
Blake Mizerany
c794fef2f2
server/internal/client/ollama: persist through chunk download errors (#9923) 2025-03-21 13:03:43 -07:00
Parth Sareen
00ebda8cc4
Revert "parser: remove role validation from Modelfile parser" (#9917)
This reverts commit ffbfe833da387f9b6806fe887b85992c11d26eaa.
2025-03-21 12:38:09 -07:00
Parth Sareen
d14ce75b95
docs: update final response for /api/chat stream (#9919) 2025-03-21 12:35:47 -07:00
Jesse Gross
2d6eac9084 kvcache: Optimize sliding window attention
Currently sliding window attention allocates and uses the full
context size and just masks out any tokens that are outside of the
window. However, we really only need (roughly) the sliding window
size.

At large context sizes this improves two things:
 - Memory allocated - since the fully context size is allocated up front,
   memory requirements drop substantially. On Gemma3:4b with a 32k
   context window, total memory usage (including weights and non-sliding
   layers) drops from ~20GB to ~8GB.
 - Computation - ranges that are completely outside of the sliding
   window are now removed from the tensors that are returned from the
   cache rather than simply being masked out. This results in more
   efficient processing, scaling with the size of the context that
   has actually been used.

Notable, this does not update the scheduler for any model to be aware of
the smaller memory requirements. This is difficult for Gemma3 because
the layers are heterogeneous between sliding and non-sliding attention.
As a result, while actual memory consumption will be reduced, the
scheduler will over-estimate the requirements of the model. This means
that splitting between GPUs or GPUs and CPUs will still be suboptimal.

Bug #9730
2025-03-21 11:20:19 -07:00
Jesse Gross
3ed7ad3ab3 kvcache: Pass granular cache size into implementations
Currently the runner computes the kv size needed and creates a
cache of that size. This is the context size times number of
parallel sequences.

Cache implementations can make better decisions about their memory
usage, so instead pass in the required capacity, number of sequences
and maximum batch size. For now, the causal cache just uses this to
compute the size in the same way as before.
2025-03-21 11:20:19 -07:00
Patrick Devine
6d1103048e
fix: show correct bool value for kv in verbose show information (#9928) 2025-03-21 11:13:54 -07:00
Jesse Gross
0ff28758b3 ollamarunner: Provide mechanism for backends to report loading progress
This enables the runner to report progress back to the Ollama server,
both for showing status to the user and also to prevent the server
from killing the runner if it thinks things have stalled.

Most of the infrastructure was already there, this extends it to
be available to the backends.
2025-03-21 10:44:26 -07:00
Jesse Gross
d3e9ca3eda kvcache: Account for source tensors in defrag operation count
Defragging the KV cache can generate a lot of operations, so we
need to be careful that we don't overflow the number that the graph
can support. We currently account for all of the nodes that we add
to the graph for each move but we also need to include the original
cache tensors as well.

Fixes #9904
2025-03-21 10:42:19 -07:00
Jesse Gross
0fbfcf3c9c model: Pass input tensor instead of raw data to models
Rather than directly giving the input data to models, we can
pass a tensor instead. In the short term, this saves some duplicated
code.

Longer term, we will want to overlap setting up the next batch with
processing of the current one. In this case, we will only have the
shape of tensor but it will not be loaded with data at the time of
graph generation. By passing only a tensor to models now, we set up
this possibility and prevent them from relying on data that they won't
have in the future.

Although the same could be done for Positions and Outputs, in some
cases we either need the raw input data or don't use them at all.
Therefore, for now we leave them as they are and allow models to
convert them to tensors as needed.
2025-03-20 13:28:13 -07:00
Jesse Gross
0c220935bd input: Rename Options to Batch
Options is no longer very descriptive of this struct.
2025-03-20 13:28:13 -07:00