28 Commits

Author SHA1 Message Date
Bruce MacDonald
23267d783b ml: structured rope config to allow specifying context len
This commit refactors the Rotary Position Embedding (RoPE) implementation across the codebase to use a structured configuration approach instead of individual parameters.

Key changes:
- Add new RoPEConfig struct with fields for dimension, type, base frequency, and scaling
- Add RopeType enum to formalize different RoPE implementation variants
- Add YarnConfig struct and related configuration for YaRN (Yet Another RoPE extensioN) context extension
- Update RoPE method signature across all tensor interfaces and implementations
- Refactor all model implementations (llama, gemma2, gemma3, mllama) to use the new configuration structure

This change improves code organization, makes the RoPE configuration more explicit, and provides better support for different RoPE variants and context extension methods.
2025-04-01 15:57:26 -07:00
Jesse Gross
0fbfcf3c9c model: Pass input tensor instead of raw data to models
Rather than directly giving the input data to models, we can
pass a tensor instead. In the short term, this saves some duplicated
code.

Longer term, we will want to overlap setting up the next batch with
processing of the current one. In this case, we will only have the
shape of tensor but it will not be loaded with data at the time of
graph generation. By passing only a tensor to models now, we set up
this possibility and prevent them from relying on data that they won't
have in the future.

Although the same could be done for Positions and Outputs, in some
cases we either need the raw input data or don't use them at all.
Therefore, for now we leave them as they are and allow models to
convert them to tensors as needed.
2025-03-20 13:28:13 -07:00
Jesse Gross
0c220935bd input: Rename Options to Batch
Options is no longer very descriptive of this struct.
2025-03-20 13:28:13 -07:00
Jeffrey Morgan
da0e345200
ml: use input context for extracting outputs (#9875) 2025-03-18 18:08:19 -07:00
Jesse Gross
282bfaaa95 ollamarunner: Use a separate context per multimodal input
Currently there is a single context per sequence, shared all by
all multimodal inputs. Since we build a vision encoder graph per
image, with a large number of inputs we can eventually hit the
maximum number of graph nodes per context.

This changes to use a separate context for each image, ensuring
that available resource limits are consistent.
2025-03-14 15:38:54 -07:00
Jesse Gross
9679f40146 ml: Allow models to constrain inputs to a single batch
Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.

Fixes #9697
2025-03-14 15:38:54 -07:00
Michael Yang
5e2e0b46b1 fix: error if image requested without vision model 2025-03-13 10:52:09 -07:00
Bruce MacDonald
a70820daa0
models/gemma3: remove final logit softcap (#9692)
Softcap isn't in the whitepaper/implementation for the language model so we should remove it. There is no discernible difference in output with it removed.
2025-03-12 10:17:57 -07:00
jmorganca
83f0ec8269 all: address linter errors 2025-03-11 14:49:20 -07:00
Michael Yang
63a394068c use 2d pooling 2025-03-11 14:49:20 -07:00
jmorganca
11bfa62796 add trailing \n\n after <end_of_image> to match reference implementation 2025-03-11 14:49:20 -07:00
jmorganca
f63e62e546 reduce kernel size, add TODO for loading from config 2025-03-11 14:49:20 -07:00
jmorganca
65b0f329d1 Revert "Allow models to force a new batch"
This reverts commit c7eae586b899083acebcd9b3847b89ea78c2850c.
2025-03-11 14:49:20 -07:00
Jesse Gross
06007c0a18 Allow models to force a new batch
This is useful for a few things:
 - Work around bugs, such as having 2 images in one batch
 - Keep the image in a single batch for fully connected attention
 - Improve performance by not evaluating embeddings multiple times
2025-03-11 14:49:20 -07:00
Jesse Gross
a8e83a7654 Disable causal attention based on batch index
Currently we are using positions, which are relative to a
sequence and may not be unique.
2025-03-11 14:49:20 -07:00
Jesse Gross
2c40c4d35e Fix follow up images and images split across batches 2025-03-11 14:49:19 -07:00
Michael Yang
e95278932b use non-causal mask only for image positions 2025-03-11 14:49:19 -07:00
Michael Yang
9d2a20a763 use non-causal mask for inputs with images 2025-03-11 14:49:19 -07:00
Michael Yang
6b32a2d549 compat with upstream gguf 2025-03-11 14:49:19 -07:00
Michael Yang
f888912870 fix vision encoder 2025-03-11 14:49:19 -07:00
Patrick Devine
9b54267e69 fix configs 2025-03-11 14:49:19 -07:00
Michael Yang
46bb0169c4 update model 2025-03-11 14:49:19 -07:00
Michael Yang
8934324b72 use fast attention 2025-03-11 14:49:18 -07:00
Patrick Devine
c62861f4fa fix conversion 2025-03-11 14:49:18 -07:00
Michael Yang
0df1800436 set non-causal attention 2025-03-11 14:49:18 -07:00
Jesse Gross
4346c2409d fix drift from main 2025-03-11 14:49:18 -07:00
Michael Yang
4b037a97dc add gemma vision encoder 2025-03-11 14:49:17 -07:00
Patrick Devine
5f74d1fd47 gemma2 impl 2025-03-11 14:35:08 -07:00