ollama

mirror of https://github.com/ollama/ollama.git synced 2025-11-12 00:37:41 +01:00

Author	SHA1	Message	Date
jmorganca	92b96d54ef	Revert "runner: move harmony to runner (#12052 )" This reverts commit `1a558f98e2`.	2025-09-12 20:40:14 -03:00
Parth Sareen	1a558f98e2	runner: move harmony to runner (#12052 )	2025-09-08 15:07:59 -07:00
Jesse Gross	d5a0d8d904	llm: New memory management This changes the memory allocation strategy from upfront estimation to tracking actual allocations done by the engine and reacting to that. The goal is avoid issues caused by both under-estimation (crashing) and over-estimation (low performance due to under-utilized GPUs). It is currently opt-in and can be enabled for models running on the Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other cases is unchanged and will continue to use the existing estimates.	2025-08-14 15:24:01 -07:00
Devon Rifkin	30f8a68c4c	tools: support anyOf types afaik gpt-oss is the first model that meaningfully transforms tool function definitions in its template. We found that relatively common definitions that include `anyOf` were not working because the template was assuming that types were always defined via a `type` field. anyOf allows for fully recursive types, so I exposed a `toTypeScriptType()` function to handle this recursive logic in go and keep the templates cleaner. The gpt-oss templates will need to be updated to use this. We should keep building out our function definition support to more fully support the parts of json schema that make sense for this use case, but in the meantime this will unblock some users (e.g., zed's ollama integration w/ gpt-oss). Probably the most urgent is proper array support	2025-08-05 16:46:24 -07:00
Michael Yang	fa7776fd24	gpt-oss (#11672 ) * bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com> Co-authored-by: Jesse Gross <jesse@ollama.com> Co-authored-by: Devon Rifkin <drifkin@drifkin.net>	2025-08-05 12:21:16 -07:00

5 Commits