ollama

mirror of https://github.com/ollama/ollama.git synced 2025-08-25 00:41:22 +02:00

Author	SHA1	Message	Date
Devon Rifkin	2cb0a580f3	thinking: fix double emit when no opening tag The thinking parser will automatically transition to being a pass-through if non-whitespace is seen before an opening tag. However, we weren't clearing the buffer after the first non-whitespace input, so in practice the first token would be emitted twice. Added a test that demonstrated this, and then fixed the bug.	2025-08-21 21:03:12 -07:00
Parth Sareen	7cce5aac76	harmony: move harmony parsing into a package (#12016 )	2025-08-21 13:56:22 -07:00
Michael Yang	4ae4f47b16	gpt-oss: convert from hugging face format (#11907 )	2025-08-20 15:39:18 -07:00
Jesse Gross	073fa31df5	llm: Don't always evict models in CPU-only mode With old memory estimates, it's currently impossible to load more than one model at a time when no GPUs are available. This is because the check for whether we need to evict a model looks to see if all layers of the new model can be loaded onto GPUs, which is never true if there are no GPUs. Before the memory management changes, there was a special code path for CPU-only systems. This problem does not exist with new memory estimates. Fixes #11974	2025-08-20 14:31:02 -07:00
Michael Yang	91fc3c48e3	openai: remove reasoning as an api.Options (#11993 )	2025-08-20 12:21:42 -07:00
Devon Rifkin	6de62664d9	Merge pull request #11973 from ollama/drifkin/bpe model: fix boundary in bpe	2025-08-19 22:58:33 -07:00
Devon Rifkin	463a6caad8	model: add bpe roundtripping tests	2025-08-19 22:05:48 -07:00
Devon Rifkin	fc5fb09f51	model: fix boundary in bpe 0x007e is a tilde and was getting adjusted (+0x00a2) to 0x0120 in the encode, but then in the decode it was getting adjusted down (-0x0100) to 0x0020. The boundary for the +0x00a2 case has been adjusted to fix this Fixes: #11966	2025-08-19 18:34:49 -07:00
Jesse Gross	05ccb17c6e	kvcache: Use Cast instead of Copy for flash attention masks Flash attention kernels require the mask of the KV cache be a F16 rather than an F32. We can use the GGML operation ggml_cast to do this rather than doing it ourselves, which allows reuse of a preallocated buffer in the graph rather than allocating a new one for each batch. This improves token generation performance with flash attention by 10-30% (with gpt-oss). This also makes performance with flash attention better than without it, as expected.	2025-08-19 12:36:28 -07:00
Michael Yang	f804e8a460	disable output_all (#11959 )	2025-08-18 17:45:40 -07:00
Kostis	9cfbffafc5	readme: add any-agent to community integrations (#11950 )	2025-08-18 14:21:36 -07:00
Ruslan Suleymanov	470d580205	readme: add Andes to community integrations (#11952 )	2025-08-18 14:20:28 -07:00
Devon Rifkin	b517bb1c19	Merge pull request #11910 from ollama/drifkin/harmony-fn-names harmony: convert fn names to be valid ts identifiers	2025-08-18 14:17:47 -07:00
Jesse Gross	e3ade453a8	llm: Check for nil memory data before printing We dump out our best memory estimate after we complete processing for any reason, including errors. This is helpful for finding what what stopped us in error conditions but in some cases we might not have gotten even the first result yet. Fixes #11957	2025-08-18 14:05:22 -07:00
Devon Rifkin	048bd4472a	harmony: convert fn names to be valid ts identifiers In <https://github.com/ollama/ollama/issues/11704#issuecomment-3177380197> I noticed that hyphens in function names could possibly cause the model to become confused. Later in that issue I found other explanations, but at a minimum tool names with spaces in them are confusing to the model because of the prompt format. In this change I create a mapper that converts arbitrary tool names into valid typescript identifiers. It's a little overly strict in that it doesn't allow all unicode characters that might be valid in ts identifiers, but it's still very permissive. Since mappings aren't reversible, we must temporarily store this mapping in order to unmap it if the model comes back with a call. We also handle the case where multiple mappings collide into the same mapping and append a counter to the end to make them unique	2025-08-18 14:05:16 -07:00
Devon Rifkin	ec8bf5e6c5	Merge pull request #11875 from ollama/drifkin/print-template server: add debug option for printing out prompt instead of calling model	2025-08-18 14:03:14 -07:00
Kostis	709bbb0b6d	readme: add any-llm to community integrations (#11956 )	2025-08-18 13:13:26 -07:00
Jody Doolittle	abeec240f9	readme: add Serene Pub to community integrations (#11946 )	2025-08-18 13:12:41 -07:00
Michael Yang	df335aac09	gpt-oss: disable quantized kv cache (#11929 )	2025-08-15 15:01:05 -07:00
Patrick Devine	026bc29237	cli: show the default context length env setting in online help (#11928 )	2025-08-15 14:59:52 -07:00
Thomas Pelster	883d031268	docs: added missing comma in 'Ollama's Javascript library'' (#11915 )	2025-08-15 14:45:01 -07:00
Daniel Hiltgen	5271ff8559	handle cgo flags in docker build (#11909 ) Docker build requires build-args to be defined. This ensures the release.yaml settings will be used.	2025-08-15 14:39:35 -07:00
Daniel Hiltgen	d6f7233a1c	test: improve scheduler/concurrency stress tests (#11906 ) * test: improve scheduler/concurrency stress tests The scheduler test used to use approximate memory figures and would often over or under shoot a systems capcity leading to flaky test results. This should improve the reliability of this scenario by leveraging ps output to determinie exactly how many models it takes to trigger thrashing. The concurrency test is also refined to target num_parallel + 1 and handle timeouts better. With these refinements, TestMultiModelConcurrency was redundant * test: add parallel generate with history TestGenerateWithHistory will help verify caching and context are properly handled while making requests * test: focus embed tests on embedding models remove non-embedding models from the embedding tests	2025-08-15 14:37:54 -07:00
Devon Rifkin	8de1da4767	server: add debug option for printing out prompt instead of calling model	2025-08-15 13:52:50 -07:00
Daniel Hiltgen	d925b5350c	Revert "cuda: leverage JIT for smaller footprint (#11635 )" (#11913 ) This reverts commit `dc5a645434`.	2025-08-14 21:19:23 -07:00
Daniel Hiltgen	6eaf194b85	fix arm linux build when HWCAP2_SVE2 undefined (#11908 )	2025-08-14 16:38:53 -07:00
Jesse Gross	d5a0d8d904	llm: New memory management This changes the memory allocation strategy from upfront estimation to tracking actual allocations done by the engine and reacting to that. The goal is avoid issues caused by both under-estimation (crashing) and over-estimation (low performance due to under-utilized GPUs). It is currently opt-in and can be enabled for models running on the Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other cases is unchanged and will continue to use the existing estimates.	2025-08-14 15:24:01 -07:00
Michael Yang	ef7d26ba2c	convert: skip reading into memory when possible (#11507 ) if there's no transformation to the tensor and the input and output types match, copy directly into the writer. also read from a bufio with a 32K buffer	2025-08-14 15:03:57 -07:00
Michael Yang	1a19df1f3a	update vendored llama.cpp and ggml (#11823 ) * TEMPORARY: Update the llama.cpp upstream to my fork's Granite Four branch This will be redone once my branch is merged upstream in llama.cpp * feat: Update all patches There are a number that are no longer needed at all: - 0003-embeddings: Embeddings entirely overhauled on master - 0008-ensure-KV-cache-is-fully-defragmented: KV caching entirely overhauled on master - 0019-metal-add-mean-kernel-14267: Merged upstream - 0020-CUDA-add-mean-operation-14313: Merged upstream * feat: Sync llama.cpp and ggml * fix: Update rsync-filter for all moved/new/removed files * fix: Add files missing from sync * fix: Update ggml rsync-filter for new ggml-cpu/arch subdirs * fix: Add ggml files missing from sync * fix: Narrow llama.cpp rsync-filter to not include mtmd main tool cpp files * fix: Remove mtmd main cpp files * fix: Add missing include in sampling_ext.cpp * fix: Update llama.go to use mtmd instead of clip/llava * fix: Add patch for mtmd_input_text * chore: Ignore .patched in the patch directory fix: Fix support for arch-specific ggml-cpu source files with new arrangement In https://github.com/ggml-org/llama.cpp/pull/13892, all arch-specific implementations were split out into a nested tree structure under ggml-cpu/arch. This conflicts with standard CGO layout where all arch-specific source files are expected to live in the same directory as the parent go module and use suffixes based on GOOS and GOARCH. As such, there were really two options for getting this to work: 1. Add a patch on top of the GGML sync to rearrange the files to match the GO layout convention 2. Use CGO directives to conditionally include the nested source files in the compilation units This commit does (2) in order to minimize the set of changes needed on top of the upstream file layout. To get this to work, there are two key things needed: 1. In cpu.go, #cgo directives are added to explicitly set __${GOARCH}__ in the preprocessor directives 2. In arch-impls.c\|cpp, use an #ifdef \| #elif defined \| #endif chain to explicitly include the .c\|.cpp files for the given architecture from the nested directory * fix: Use mtmd_helper to correctly load the bitmap for the image * fix: Apply patch for mtmd_text_input * fix: Add missing stb to llama.cpp rsync-filter * fix: Add sync'ed stb vendored header * fix: Use c++17 and include vendor for go wrapper modules * fix: Update patch 0015 for upstream implementation of uuid * feat: Bump to the latest tip of the branch * fix: Update patches for bump * feat: Bump back to the cenral repo and point at the latest master This includes granite 4 and a number of other model architectures! * fix: Revert changes to ggml export GPU UUID patch * fix: Add patch for GGML_VERSION and GGML_COMMIT constants * feat: Sync all patched code * build: Include cmake/common.cmake in ggml sync * build: Add top-level include for GNUINstallDirs in CMakeLists.txt This is used to populate CMAKE_INSTALL_BINDIR * fix: Add a patch to avoid power throttling API on non-msvc windows builds * fix: Sync patch changes for ggml-cpu.c * feat: Bump llama.cpp to 4a4f42 This picks up support for Kimi K2 and PLaMO-2 * feat: Sync llama.cpp * fix: Handle multi-chunk image encodings from mtmd * fix: Re-number patches after merge with `main` * feat: Bump to 41e78c in the makefile * fix: Fix Solar and argsort/copy patches after bump * fix: Remove Gemma3n CUDA Graphs patch It was implemented upstream: https://github.com/ggml-org/llama.cpp/pull/14741 * feat: Sync llama.cpp / ggml after latest bump * build: Remove unnecessary CFLAGS definitions in cpu.go * fix: Remove unnecessary additions in the rsync-filter * fix: Remove unused vendored code for chat template parsing * Revert "fix: Remove Gemma3n CUDA Graphs patch" This reverts commit `d724caced3`. * fix: Update 0020 CUDA Graphs for gemma3n to keep both llama.cpp and ollama fixes https://github.com/ollama/ollama/pull/11195#issuecomment-3137312394 * fix: Sync ggml-cuda.cu after keeping both style cuda graph fixes for gemma3n * unwind mxfp4 patch Prepare to bump ggml with their impl for mxfp4 * bump * fix windows build error * Convert tensors at load time Repack the mxfp4 tensors as ggmls kernels expect them to be. * convert mlp bf16 to f32 * buffer the conversion better * reshape earlier * openai swiglu * add ids * split qkv, gate_up * fix nested alt tags * fast attention * remove debug messages * fix lint * remove redundant test * remap values only if source/target are different * add back i32->i32 copy * refactor cpu quants * clean up vendor * update patch instructions * clean up patches * remove webgpu * update mem * also handle gpt-oss * revert convert changes --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2025-08-14 14:42:58 -07:00
Daniel Hiltgen	7ccfd97a93	doc: clarify both rocm and main bundle necessary (#11900 ) Some users expect the rocm bundles to be self-sufficient, but are designed to be additive.	2025-08-14 12:54:55 -07:00
Daniel Hiltgen	c385ca8672	test: add valid responses (#11902 ) some of the new models need a few more valid responses to pass	2025-08-14 11:07:13 -07:00
Daniel Hiltgen	837379a94c	discovery: fix cudart driver version (#11614 ) We prefer the nvcuda library, which reports driver versions. When we dropped cuda v11, we added a safety check for too-old drivers. What we missed was the cudart fallback discovery logic didn't have driver version wired up. This fixes cudart discovery to expose the driver version as well so we no longer reject all GPUs if nvcuda didn't work.	2025-08-13 15:43:33 -07:00
Daniel Hiltgen	a24f90604f	int: adjust a few models for integration tests (#11872 )	2025-08-13 15:42:36 -07:00
Daniel Hiltgen	dc5a645434	cuda: leverage JIT for smaller footprint (#11635 ) Prior to this change our official binaries contained both JIT PTX code and the cubin binary code for our chosen compute capabilities. This change switches to only compile the PTX code and rely on JIT at runtime for generating the cubin specific to the users GPU. The cubins are cached on the users system, so they should only see a small lag on the very first model load for a given Ollama release. This also adds the first generation of Blackwell GPUs so they aren't reliant on the Hopper PTX. This change reduces the ggml-cuda.dll from 1.2G to 460M	2025-08-13 15:42:16 -07:00
youzichuan	bb71654ebe	chore: fix some inconsistent function name in comment Signed-off-by: youzichuan <youzichuan6@outlook.com>	2025-08-13 09:50:27 -07:00
Jesse Gross	a343ae53a4	ggml: Use ordinal IDs for AMD GPUs on Linux when UUID is unavailable Some AMD GPUs do not provide UUIDs and report only "XX". In these cases, we should use the ordinal ID as an alternate identifier. This is the same as we always need to do on Windows for AMD. In addition, this prints out the ID for each GPU when enumerating them for easier debugging in the future.	2025-08-12 16:56:14 -07:00
Michael Yang	d0cf6c8281	fix(openai): handle reasoning_effort (#11868 )	2025-08-12 11:02:01 -07:00
Jesse Gross	8f4ec9ab28	discover: CPU supports flash attention We already run flash attention on CPUs in cases where we have partial offloading but were disabling it if running on pure CPU, which is unnecessary.	2025-08-11 15:00:34 -07:00
Devon Rifkin	dbfd7bd027	Merge pull request #11861 from ollama/drifkin/fix-parsing-error server: fix error when parsing bad harmony tool calls	2025-08-11 14:59:57 -07:00
Devon Rifkin	ee04dbba51	server: fix error when parsing bad harmony tool calls Thanks @moll for reporting! Fixes: #11781	2025-08-11 14:09:13 -07:00
Daniel Andersen	ea7657b54a	sched: Add support for grouping GPUs (#10678 ) This patch modifies Ollama to allow grouping GPUs to memory-fit to the requested model, instead of the former algorithm of using one GPU distributing over all available GPUs. Benefits: - Lower amount of (PCIe-)bus communication between GPUs - especially when they are not very high speed - Allowing unallocated GPUs to get into power-saving mode. - Significantly reduce VRAM allocation when using more than 2 GPUs in a system - Due to the reduced memory allocation, you can run more models simultaneously.	2025-08-11 13:59:38 -07:00
Michael Vorburger	2c776f0780	CONTRIBUTING: Explicitly note docs:... as a good example (#11755 )	2025-08-09 18:12:30 -07:00
Jesse Gross	79f6376f5b	ggml: No-alloc mode Callers can set a backend buffer type to be no-alloc, meaning that it does not allocate memory for tensors or operations. This can be used for calculating memory requirements. Tensors and graphs must be recreated with no-alloc set to false before loading data. Defaults to false for newly created backend buffer types.	2025-08-08 14:57:13 -07:00
Jesse Gross	756c78cfc7	ggml: Support closing backends In order to iteratively find the best memory allocation, we need to be able to free backend memory so we can try again.	2025-08-08 14:57:13 -07:00
Jesse Gross	d7f4f788d1	ggml: Use GGML's typedef'ed pointer types For many backend data structures, GGML defines a typedef of a pointer type and returns these from functions. In most cases, CGo understands that these are interchangable but some parts of Go (such as generics) think they are two different types. We should prefer the form that GGML uses.	2025-08-08 14:57:13 -07:00
Daniel Hiltgen	114c3f2265	tests: add integration coverage for oss-gpt (#11696 ) Also wires up support to override the default "smol" model	2025-08-07 15:06:57 -07:00
Jesse Gross	f2e9c9aff5	server: Reduce gpt-oss context length for small VRAM GPUs gpt-oss works best with a context length of at least 8k. However, for GPUs with limited amount of VRAM, there is a significant performance hit to this increased context. In these cases, we switch to the Ollama default of 4k	2025-08-07 14:23:55 -07:00
Devon Rifkin	aa9d889522	Merge pull request #11765 from ollama/drifkin/thinking-without-content openai: always provide reasoning	2025-08-06 19:02:23 -07:00
Devon Rifkin	735c41f9ca	openai: always provide reasoning We were missing passing along thinking if content was nil (as opposed to empty string) Also added a test for content not being passed, which was the real cause of <https://github.com/ollama/ollama/issues/11704>, since with the way `Content` is typed, not passing it and empty string are distinct	2025-08-06 18:54:20 -07:00
Devon Rifkin	223a619468	Merge pull request #11761 from ollama/drifkin/openai-tool-names openai: when converting role=tool messages, propagate the tool name	2025-08-06 17:53:25 -07:00

1 2 3 4 5 ...

4493 Commits