ollama

mirror of https://github.com/ollama/ollama.git synced 2025-03-20 06:42:41 +01:00

Author	SHA1	Message	Date
Michael Yang	26c2e0bd35	ml/backend/ggml: handle user specified cpu offloading	2025-03-07 14:08:21 -08:00
Michael Yang	bf920883d5	ml/backend/ggml: set cpu n_threads	2025-03-07 14:08:21 -08:00
Michael Yang	58b9ec1f6b	kvcache: update tests	2025-03-07 14:08:21 -08:00
Michael Yang	7bae7fa5ce	ml/backend/ggml: create tensor on specific backend some tensors should be created on specific backends to reduce number of copies and improve performance	2025-03-07 14:08:21 -08:00
Michael Yang	764e199d67	kvcache: create cache ctx per layer each cache layer creates and maintains its own context instead of using a large context for all layers	2025-03-07 14:08:21 -08:00
Michael Yang	bfce55db3d	model: load non-repeated tensors into multiple backends some tensors are expected to be used in repeating layers but are not themselves repeated. this change copies these tensors into the same backends as their repeating counterparts to minimize copying tensors between backends	2025-03-07 14:08:21 -08:00
Michael Yang	bab6f34dc0	ml/backend/ggml: update model loading for hybrid/multi backends use a similar strategy as llama.cpp for deciding where tensors should be allocated. this will be improved later to be aware of usable memory before assigning the tensor	2025-03-07 14:08:21 -08:00
Parth Sareen	0682dae027	sample: improve ollama engine sampler performance (#9374 ) This change bring in various interface cleanups along with greatly improving the performance of the sampler. Tested with llama3.2 on local machine. Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled. Without topK performance is ~ 110 tokens/s	2025-03-07 12:37:48 -08:00
Breaker	1f6986e919	readme: add QwQ to the supported models list (#9565 )	2025-03-07 09:30:07 -08:00
Jeffrey Morgan	4289c74359	llama: fix kv loading on snowflake-arctic-embed models (#9536 )	2025-03-07 09:25:34 -08:00
‮rekcäH nitraM‮	25248f4bd5	Better WantedBy declaration The problem with default.target is that it always points to the target that is currently started. So if you boot into single user mode or the rescue mode still Ollama tries to start. I noticed this because either tried (and failed) to start all the time during a system update, where Ollama definitely is not wanted.	2025-03-07 10:26:31 +01:00
Jesse Gross	a7e63b82be	ollamarunner: Improve multimodal input handling Various vision models have different requirements for how they receive their inputs. For example: - Mllama wants images together with text and the image embeddings don't themselves have positions or get stored in the main KV cache - Llava-style models feed in embeddings similar to tokens and images correspond to a varying number of tokens in the cache. In addition, the strategy for providing inputs must support batching and multiple sequences, which are managed by the runner. At the same time, we want to keep data handling fully in the model so that new architectures are not bottlenecked by runner code which does not understand their particular requirements. This provides a method for models to edit the input stream so that it meets their needs while still being in a format that the runner understands. This allows the runner to avoid special processing for different models. In addition, this fixes a regression where non-vision models may try to incorrectly interpret images.	2025-03-06 16:54:16 -08:00
Jesse Gross	b70fc4d51e	model: Don't unconditionally add special tokens We sometimes tokenize partial strings. For example, with multimodal inputs, we split the input string around the images and then tokenize each piece. In these cases, we should only add the special tokens on the first piece.	2025-03-06 16:54:16 -08:00
Blake Mizerany	e2252d0fc6	server/internal/registry: take over pulls from server package (#9485 ) This commit replaces the old pull implementation in the server package with the new, faster, more robust pull implementation in the registry package. The new endpoint, and now the remove endpoint too, are behind the feature gate "client2" enabled only by setting the OLLAMA_EXPERIMENT environment variable include "client2". Currently, the progress indication is wired to perform the same as the previous implementation to avoid making changes to the CLI, and because the status reports happen at the start of the download, and the end of the write to disk, the progress indication is not as smooth as it could be. This is a known issue and will be addressed in a future change. This implementation may be ~0.5-1.0% slower in rare cases, depending on network and disk speed, but is generally MUCH faster and more robust than the its predecessor in all other cases.	2025-03-05 14:48:18 -08:00
Daniel Hiltgen	cae5d4d4ea	Win: doc new rocm zip file (#9367 ) To stay under the 2G github artifact limit, we're splitting ROCm out like we do on linux.	2025-03-05 14:11:21 -08:00
Michael Yang	05a01fdecb	ml/backend/ggml: consolidate system info logging - output backend system info when initializing the backend. this ensures this information is always present without needing to be called explicitly - convert to structured logging - enumerate devices rather than backends since devices are ordered - track device indices grouped by device name	2025-03-04 15:14:31 -08:00
aritra saha	8fe6f69f28	docs: add granite-3.2 to the readme	2025-03-04 11:10:56 -08:00
Daniel Hiltgen	1fdb351c37	New engine: vision models and auto-fallback (#9113 ) * Include unified vision layers in memory prediction For newer vision models with a single gguf, include the projection estimates. * Adjust CLI to handle both styles of vision model metadata * Wire up new tokenizers for new engine If we're loading the new engine, utilize the new model text processor instead of calling into cgo wrappers for llama.cpp. This also cleans up some tech debt from the older tokenization flow for the C++ server which was no longer used. This also adjusts the grammar handling logic to pass through to the new engine instead of utilizing the cgo schema to grammar call. * Lay foundation for auto selection of new engine	2025-03-04 09:03:46 -08:00
Blake Mizerany	7a01ad7614	server/internal/registry: reintroduce pruning on model deletion (#9489 ) This reintroduces aggressive pruning on model deletion as a temporary measure until a more controlled garbage collection (GC) mechanism is implemented. Issues with the current approach: 1. Users may accidentally delete a model (`ollama rm llama3.3` instead of `ollama rm llama3.2`), requiring a full re-download unless another model references the same blobs. 2. Users may assume a deleted model is still referenced elsewhere, but due to prior updates or deletions, the references no longer exist, leading to unnecessary re-downloads. Soon, we should implement a structured GC mechanism to retain unreferenced blobs for a configurable period before removal, which will run on "ollama rm" and other commands we deem appropriate. Users that want to immediately remove unreferenced blobs can use a new prune command that will allow them to specify the age and class of blobs to remove. Example usage: # Run basic blob GC $ ollama prune # Remove unreferenced blobs older than 7 days $ ollama prune --age 7d # Remove all blobs, referenced or not, older than 7 days (and their manifests?) $ ollama prune --age 7d --all # Remove all unreferenced blobs immediately $ ollama prune --age 0 --all # Remove all blobs $ ollama prune --age 0 --all This should provide a safer and more predictable cleanup process. v0.5.13	2025-03-03 19:11:16 -08:00
Blake Mizerany	55ab9f371a	server/.../backoff,syncs: don't break builds without synctest (#9484 ) Previously, developers without the synctest experiment enabled would see build failures when running tests in some server/internal/internal packages using the synctest package. This change makes the transition to use of the package less painful but guards the use of the synctest package with build tags. synctest is enabled in CI. If a new change will break a synctest package, it will break in CI, even if it does not break locally. The developer docs have been updated to help with any confusion about why package tests pass locally but fail in CI.	2025-03-03 16:45:40 -08:00
KindBrave	fefbf8f74b	docs: add Ollama Android Chat community integration	2025-03-03 16:38:32 -08:00
Michael Yang	b428ddd796	docker: use go version from go.mod	2025-03-03 13:02:02 -08:00
Michael Yang	ba7d31240e	fix: own lib/ollama directory expand backend loading error handling to catch more problems and log them instead of panicing	2025-03-03 13:01:18 -08:00
CYJiang	d25efe3954	cmd: add default err return for stop (#9458 )	2025-03-03 12:13:41 -08:00
Mark	36dfb906bb	docs: don't use self-closing tag for anchor element (#9456 )	2025-03-03 11:56:34 -08:00
aritra saha	a6f0f908b9	docs: update phi3-mini to phi4-mini (#9424 ) * Update README.md removed phi 3 mini and added phi4-mini * Update README.md --------- Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-03-03 11:09:21 -08:00
İbrahim Çetin	3b1ddb2b3a	docs: add reins to community integrations (#9411 )	2025-03-03 11:06:30 -08:00
Jeffrey Morgan	1579c4f06d	build: install binutils alongside gcc in Dockerfile (#9475 ) v0.5.13-rc6	2025-03-03 01:20:49 -08:00
Blake Mizerany	3519dd1c6e	server/internal/client/ollama: hold DiskCache on Registry (#9463 ) Previously, using a Registry required a DiskCache to be passed in for use in various methods. This was a bit cumbersome, as the DiskCache is required for most operations, and the DefaultCache is used in most of those cases. This change makes the DiskCache an optional field on the Registry struct. This also changes DefaultCache to initialize on first use. This is to not burden clients with the cost of creating a new cache per use, or having to hold onto a cache for the lifetime of the Registry. Also, slip in some minor docs updates for Trace.	2025-03-02 20:55:44 -08:00
Jeffrey Morgan	e41c4cbea7	build: install ccache manually in Dockerfile (#9464 ) Reverts ccache installation to be done manually via curl instead of using the dnf package manager as this has side effects of prepending ccache's install directory to the front of the PATH v0.5.13-rc5	2025-03-02 16:48:31 -08:00
Blake Mizerany	ee048b76d4	server/internal/client/ollama: handle extended names in client/ollama (#9454 ) The extended name format is a superset of the name format that only the client needs to know about, not the server or other dependents of the name package, so move the split logic into the client package. Also, take advantage of knowing about the extended name format to allow the client to use the extended name format when unlinking to verify they are unlinking the manifest with the content they intend.	2025-03-02 13:30:41 -08:00
Soulter	af68d60a58	readme: add AstrBot to community integrations (#9442 )	2025-03-01 21:58:34 -08:00
Jesse Gross	21aa666a1e	ml: Enable support for flash attention The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.	2025-03-01 20:53:23 -08:00
Jesse Gross	ee141cc821	ml: Empty tensor constructor for tensors In cases where we allocate a tensor and then fully overwrite it with copied data, it is wasteful to first zero out the memory.	2025-03-01 20:53:23 -08:00
Jesse Gross	55e5776c44	ggml-backend: Store parent backend as part of tensor It can be important for a tensor to know what backend it came from - for example, to know if flash attention is enabled.	2025-03-01 20:53:23 -08:00
Jesse Gross	854a9195f3	attention: Remove unnecessary contiguous operations Prior to performing attention, we need to permute query, key and value. Currently we call Contiguous after each of these permutations, which is correct but expensive. Avoiding the 3 calls to Contiguous increases performance by over 20%. The permutations of query and key do not violate the continuity rules for mulmat and the Contiguous call can be simply removed. Value requires a different permutation and does require Contiguous. However, we can use the copy into the cache as a way to perform this without further overhead. To support this and avoid unexpected tensor shapes that are seen by models, we need tighter integration between attention, cache and backend. Future optimization will also likely need this structure - for example, flash attention has special padding requirements in the cache and other backends may have their own needs. This further contains the operations that go into attention so that these and other optimizations can be handled transparently. Models that have special requirements for attention can still implement their own version of it.	2025-03-01 20:53:23 -08:00
Jeffrey Morgan	96a97adf9b	build: use correct GGML_HIP_NO_VMM compiler definition for ggml-hip (#9451 ) v0.5.13-rc4	2025-03-01 17:00:31 -08:00
Jeffrey Morgan	e75c6126e9	build: set GGML_CUDA_NO_VMM for ggml-hip target (#9449 ) v0.5.13-rc3	2025-03-01 14:02:19 -08:00
Blake Mizerany	cda6f5c66c	server/internal/internal/names: validate names (#9400 ) This commit is a step towards a goal to make names less ceremonial outside of the registry client. Clients of the registry package can treat names as opaque strings, and the registry package will handle parsing, validating, and normalizing names. Ideally we end up with the names package tucked away in an internal package for good. We'll see how things go. Also, this package name is not permanent. This another step in the on-going process of refactoring the server code, and at some point it will most likely be renamed/moved.	2025-03-01 13:15:14 -08:00
Bruce MacDonald	bebb6823c0	server: validate local path on safetensor create (#9379 ) More validation during the safetensor creation process. Properly handle relative paths (like ./model.safetensors) while rejecting absolute paths Add comprehensive test coverage for various paths No functionality changes for valid inputs - existing workflows remain unaffected Leverages Go 1.24's new os.Root functionality for secure containment v0.5.13-rc2	2025-02-28 16:10:43 -08:00
Michael Yang	31e472baa4	runner: defer context cancel defer the cancel to guarantee it runs	2025-02-28 22:27:28 +00:00
Michael Yang	657685e85d	fix: replace deprecated functions	2025-02-28 21:29:34 +00:00
Jeffrey Morgan	a14912858e	build: add compute capability 12.0 to CUDA 12 preset (#9426 ) Focuses initial Blackwell support on compute capability 12.0 which includes the 50x series of GeForce cards. In the future additional compute capabilities may be added	2025-02-28 13:12:31 -08:00
Blake Mizerany	eed11ded30	server/.../safetensors: fix offsets and include all model parts (#9427 ) Also, require the -as flag to be set when importing a model. This prevents the confusing error message "invalid name". Also, allow short names to be used when importing a model and auto-complete the name with the default mask.	2025-02-28 13:08:10 -08:00
Michael Yang	b42aba40ed	cuda: enable flash attention ggml added an option to disable flash attention so explicitly enable it	2025-02-28 19:40:34 +00:00
王贺	25885e5335	docs: Add 1Panel to Community Integrations (#9312 )	2025-02-28 09:53:03 -08:00
Jeffrey Morgan	98d44fa39d	llama: add phi4 mini support (#9403 ) v0.5.13-rc1	2025-02-27 19:30:32 -08:00
Blake Mizerany	2099e2d267	CONTRIBUTING: provide clarity on good commit messages, and bad (#9405 ) Also, our commit messages have been getting better, but we can do better, and be more consistent. This adds more clarity on how to write commit messages and provides examples of good and bad messages. Also, our contributing guide was lacking helpful guidance on how to start change proposals. This commit adds the start of that section. Soon, we should add a proposal template to the issue tracker with a link back to the proposal section, which should also be expanded upon.	2025-02-27 19:22:26 -08:00
Bruce MacDonald	0c1041ad85	runner: default to greedy sampler for performance (#9407 ) As are adding support for weighted sampling we have seen some performance regressions, bypassing the sampler logic for now and defaulting to greedy until we can benchmark the new sampler logic.	2025-02-27 16:41:20 -08:00
Parth Sareen	c245b0406f	sample: remove transforms from greedy sampling (#9377 )	2025-02-27 15:44:53 -08:00

1 2 3 4 5 ...

4081 Commits