ollama

mirror of https://github.com/ollama/ollama.git synced 2025-03-20 23:02:48 +01:00

Author	SHA1	Message	Date
Michael Yang	4ea4d2b189	Merge pull request #9703 from ollama/mxyng/gemma3-memory count gemma3 vision tensors	2025-03-13 16:56:34 -07:00
Michael Yang	8d76fa23ef	count non-repeating vision layers	2025-03-13 16:53:29 -07:00
Michael Yang	65b88c544f	fix divide by zero	2025-03-13 16:35:00 -07:00
Michael Yang	a422ba39c9	roughly count gemma3 graph the largest operation is by far (q @ k) so just count that for simplicity	2025-03-13 16:35:00 -07:00
Michael Yang	d2ec22371e	count all vision tensors	2025-03-13 16:35:00 -07:00
Michael Yang	033cec232a	count gemma3 vision tensors	2025-03-13 16:34:42 -07:00
Patrick Devine	4bed739259	add verbose mode to the show command (#9640 ) Add metadata and tensor information to the show command to be able to see more information about a model. This outputs the same data as shown on the model details page on ollama.com	2025-03-13 14:24:27 -07:00
Daniel Hiltgen	ab39e08eb9	llm: auto detect models that require Ollama Engine (#1 )	2025-03-11 14:49:20 -07:00
Patrick Devine	5f74d1fd47	gemma2 impl	2025-03-11 14:35:08 -07:00
Daniel Hiltgen	1fdb351c37	New engine: vision models and auto-fallback (#9113 ) * Include unified vision layers in memory prediction For newer vision models with a single gguf, include the projection estimates. * Adjust CLI to handle both styles of vision model metadata * Wire up new tokenizers for new engine If we're loading the new engine, utilize the new model text processor instead of calling into cgo wrappers for llama.cpp. This also cleans up some tech debt from the older tokenization flow for the C++ server which was no longer used. This also adjusts the grammar handling logic to pass through to the new engine instead of utilizing the cgo schema to grammar call. * Lay foundation for auto selection of new engine	2025-03-04 09:03:46 -08:00
Michael Yang	53d2990d9b	model: add bos token if configured	2025-02-27 21:04:59 +00:00
Michael Yang	b16367b4b2	fix: add back bf16 support this was accidentally removed when moving fs/ggml from its previous location	2025-02-25 19:26:14 +00:00
Michael Yang	58245413f4	next ollama runner (#7913 ) feat: add new Ollama engine using ggml through cgo This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this. - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go` - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go` - `ml.Tensor` defines the interface for a tensor and tensor operations This is the first implementation of the new engine. Follow up PRs will implement more features: - non-greedy sampling (#8410) - integration with Ollama and KV caching (#8301) - more model support (#9080) with more coming soon Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-02-13 16:31:21 -08:00

13 Commits