Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.
The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.
Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.
To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
- for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.
This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.
There are two benefits to doing this:
- Provide a library function that models can use, reducing code for
each model implementation
- Enables a single place to drop in optimized implementations of
attention based on the backend or other factors. One is provided for
GGML.
On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
feat: add new Ollama engine using ggml through cgo
This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.
- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations
This is the first implementation of the new engine. Follow up PRs will implement more features:
- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>