ollama

mirror of https://github.com/ollama/ollama.git synced 2025-11-11 20:57:58 +01:00

Author	SHA1	Message	Date
fengyuchuanshen	8a7e2055d2	cmd: use slices.Contains to simplify code (#12249 )	2025-09-11 09:57:31 -07:00
Patrick Devine	026bc29237	cli: show the default context length env setting in online help (#11928 )	2025-08-15 14:59:52 -07:00
Michael Yang	fa7776fd24	gpt-oss (#11672 ) * bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by: Daniel Hiltgen <daniel@ollama.com> Co-authored-by: Jesse Gross <jesse@ollama.com> Co-authored-by: Devon Rifkin <drifkin@drifkin.net>	2025-08-05 12:21:16 -07:00
Patrick Devine	80b538e312	cli: catch upstream errors gracefully (#11512 )	2025-07-23 22:16:55 -07:00
frob	802ad16ce4	docs: add the no-Modelfile function of `ollama create` (#9077 )	2025-07-16 22:16:10 -07:00
Parth Sareen	d73f8aa8c3	cmd: add default assistant role to message construction (#11431 )	2025-07-16 11:18:16 -07:00
Daniel Hiltgen	34088dbcfb	API/CLI context enhancements (#11331 ) * API: expose context size of loaded models * CLI: add context UX This adds a column in the ps output to show the models context size.	2025-07-08 11:59:06 -07:00
Devon Rifkin	5f57b0ef42	add thinking support to the api and cli (#10584 ) - Both `/api/generate` and `/api/chat` now accept a `"think"` option that allows specifying whether thinking mode should be on or not - Templates get passed this new option so, e.g., qwen3's template can put `/think` or `/no_think` in the system prompt depending on the value of the setting - Models' thinking support is inferred by inspecting model templates. The prefix and suffix the parser uses to identify thinking support is also automatically inferred from templates - Thinking control & parsing is opt-in via the API to prevent breaking existing API consumers. If the `"think"` option is not specified, the behavior is unchanged from previous versions of ollama - Add parsing for thinking blocks in both streaming/non-streaming mode in both `/generate` and `/chat` - Update the CLI to make use of these changes. Users can pass `--think` or `--think=false` to control thinking, or during an interactive session they can use the commands `/set think` or `/set nothink` - A `--hidethinking` option has also been added to the CLI. This makes it easy to use thinking in scripting scenarios like `ollama run qwen3 --think --hidethinking "my question here"` where you just want to see the answer but still want the benefits of thinking models	2025-05-28 19:38:52 -07:00
Daniel Hiltgen	7359b02707	win: detect background upgrade in progress (#10785 ) Give the user a helpful error instead of showing connection refused errors.	2025-05-21 10:46:56 -07:00
Daniel Hiltgen	27da2cddc5	Fix lingering Q4_0 help reference (#10720 )	2025-05-15 16:33:23 -07:00
Bruce MacDonald	feb8923ada	cmd: add ellipses to truncated show metadata (#10717 ) When a piece of information has been truncated in the show output an ellipses to indicate that more data has not been displayed	2025-05-15 15:45:52 -07:00
Daniel Hiltgen	424810450f	Move quantization to new backend (#10363 ) * Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.	2025-05-06 11:20:48 -07:00
Michael Yang	d931ee8f22	create blobs in parallel (#10135 ) * default max term height * error on out of tree files	2025-05-05 11:59:26 -07:00
Devon Rifkin	dd93e1af85	Revert "increase default context length to 4096 (#10364 )" This reverts commit `424f648632`.	2025-04-28 16:54:11 -07:00
Devon Rifkin	424f648632	increase default context length to 4096 (#10364 ) * increase default context length to 4096 We lower the default numParallel from 4 to 2 and use these "savings" to double the default context length from 2048 to 4096. We're memory neutral in cases when we previously would've used numParallel == 4, but we add the following mitigation to handle some cases where we would have previously fallen back to 1x2048 due to low VRAM: we decide between 2048 and 4096 using a runtime check, choosing 2048 if we're on a one GPU system with total VRAM of <= 4 GB. We purposefully don't check the available VRAM because we don't want the context window size to change unexpectedly based on the available VRAM. We plan on making the default even larger, but this is a relatively low-risk change we can make to quickly double it. * fix tests add an explicit context length so they don't get truncated. The code that converts -1 from being a signal for doing a runtime check isn't running as part of these tests. * tweak small gpu message * clarify context length default also make it actually show up in `ollama serve --help`	2025-04-22 16:33:24 -07:00
Blake Mizerany	1e7f62cb42	cmd: add retry/backoff (#10069 ) This commit adds retry/backoff to the registry client for pull requests. Also, revert progress indication to match original client's until we can "get it right." Also, make WithTrace wrap existing traces instead of clobbering them. This allows clients to compose traces.	2025-04-15 23:24:44 -07:00
frob	ccc8c6777b	cleanup: remove OLLAMA_TMPDIR and references to temporary executables (#10182 ) * cleanup: remove OLLAMA_TMPDIR * cleanup: ollama doesn't use temporary executables anymore --------- Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-04-08 15:01:39 -07:00
Bruce MacDonald	9876c9faa4	chore(all): replace instances of interface with any (#10067 ) Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy.	2025-04-02 09:44:27 -07:00
Bruce MacDonald	e172f095ba	api: return model capabilities from the show endpoint (#10066 ) With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.	2025-04-01 15:21:46 -07:00
Patrick Devine	6d1103048e	fix: show correct bool value for kv in verbose show information (#9928 )	2025-03-21 11:13:54 -07:00
Patrick Devine	4bed739259	add verbose mode to the show command (#9640 ) Add metadata and tensor information to the show command to be able to see more information about a model. This outputs the same data as shown on the model details page on ollama.com	2025-03-13 14:24:27 -07:00
Michael Yang	05a01fdecb	ml/backend/ggml: consolidate system info logging - output backend system info when initializing the backend. this ensures this information is always present without needing to be called explicitly - convert to structured logging - enumerate devices rather than backends since devices are ordered - track device indices grouped by device name	2025-03-04 15:14:31 -08:00
Daniel Hiltgen	1fdb351c37	New engine: vision models and auto-fallback (#9113 ) * Include unified vision layers in memory prediction For newer vision models with a single gguf, include the projection estimates. * Adjust CLI to handle both styles of vision model metadata * Wire up new tokenizers for new engine If we're loading the new engine, utilize the new model text processor instead of calling into cgo wrappers for llama.cpp. This also cleans up some tech debt from the older tokenization flow for the C++ server which was no longer used. This also adjusts the grammar handling logic to pass through to the new engine instead of utilizing the cgo schema to grammar call. * Lay foundation for auto selection of new engine	2025-03-04 09:03:46 -08:00
CYJiang	d25efe3954	cmd: add default err return for stop (#9458 )	2025-03-03 12:13:41 -08:00
Jesse Gross	ed443a0393	Runner for Ollama engine This provides integration with the new Ollama engine (`5824541` next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1	2025-02-13 17:09:26 -08:00
Patrick Devine	a420a453b4	fix default modelfile for create (#8452 )	2025-01-16 01:14:04 -08:00
Patrick Devine	32bd37adf8	make the modelfile path relative for `ollama create` (#8380 )	2025-01-10 16:14:08 -08:00
Patrick Devine	8bccae4f92	show a more descriptive error in the client if it is newer than the server (#8351 )	2025-01-09 10:12:30 -08:00
Patrick Devine	86a622cbdc	Update the /api/create endpoint to use JSON (#7935 ) Replaces `POST /api/create` to use JSON instead of a Modelfile. This is a breaking change.	2024-12-31 18:02:30 -08:00
Blake Mizerany	b1fd7fef86	server: more support for mixed-case model names (#8017 ) Fixes #7944	2024-12-11 15:29:59 -08:00
Daniel Hiltgen	4879a234c4	build: Make target improvements (#7499 ) * llama: wire up builtin runner This adds a new entrypoint into the ollama CLI to run the cgo built runner. On Mac arm64, this will have GPU support, but on all other platforms it will be the lowest common denominator CPU build. After we fully transition to the new Go runners more tech-debt can be removed and we can stop building the "default" runner via make and rely on the builtin always. * build: Make target improvements Add a few new targets and help for building locally. This also adjusts the runner lookup to favor local builds, then runners relative to the executable, and finally payloads. * Support customized CPU flags for runners This implements a simplified custom CPU flags pattern for the runners. When built without overrides, the runner name contains the vector flag we check for (AVX) to ensure we don't try to run on unsupported systems and crash. If the user builds a customized set, we omit the naming scheme and don't check for compatibility. This avoids checking requirements at runtime, so that logic has been removed as well. This can be used to build GPU runners with no vector flags, or CPU/GPU runners with additional flags (e.g. AVX512) enabled. * Use relative paths If the user checks out the repo in a path that contains spaces, make gets really confused so use relative paths for everything in-repo to avoid breakage. * Remove payloads from main binary * install: clean up prior libraries This removes support for v0.3.6 and older versions (before the tar bundle) and ensures we clean up prior libraries before extracting the bundle(s). Without this change, runners and dependent libraries could leak when we update and lead to subtle runtime errors.	2024-12-10 09:47:19 -08:00
Parth Sareen	de52b6c2f9	bugfix: "null" value json mode (#7979 )	2024-12-06 14:13:15 -08:00
Parth Sareen	c6c526275d	api: add generate endpoint for structured outputs (#7939 )	2024-12-04 17:37:12 -08:00
Parth Sareen	630e7dc6ff	api: structured outputs - chat endpoint (#7900 ) Adds structured outputs to chat endpoint --------- Co-authored-by: Michael Yang <mxyng@pm.me> Co-authored-by: Hieu Nguyen <hieunguyen1053@outlook.com>	2024-12-04 16:31:19 -08:00
Sam	1bdab9fdb1	llm: introduce k/v context quantization (vRAM improvements) (#6279 )	2024-12-03 15:57:19 -08:00
Bruce MacDonald	a210ec74d2	cmd: print location of model after pushing (#7695 ) After a user pushes their model it is not clear what to do next. Add a link to the output of `ollama push` that tells the user where their model can now be found.	2024-11-25 09:40:16 -08:00
Bruce MacDonald	7b5585b9cb	server: remove out of date anonymous access check (#7785 ) In the past the ollama.com server would return a JWT that contained information about the user being authenticated. This was used to return different error messages to the user. This is no longer possible since the token used to authenticate does not contain information about the user anymore. Removing this code that no longer works. Follow up changes will improve the error messages returned here, but good to clean up first.	2024-11-22 11:57:35 -08:00
Daniel Hiltgen	d88972ea48	Be quiet when redirecting output (#7360 ) This avoids emitting the progress indicators to stderr, and the interactive prompts to the output file or pipe. Running "ollama run model > out.txt" now exits immediately, and "echo hello \| ollama run model > out.txt" produces zero stderr output and a typical response in out.txt	2024-11-22 08:04:54 -08:00
Blake Mizerany	67691e410d	cmd: preserve exact bytes when displaying template/system layers (#7586 )	2024-11-13 23:53:30 -08:00
Daniel Hiltgen	35ec7f079f	Fix unicode output on windows with redirect to file (#7358 ) If we're not writing out to a terminal, avoid setting the console mode on windows, which corrupts the output file.	2024-10-25 13:43:16 -07:00
Patrick Devine	d78fb62056	default to "FROM ." if a Modelfile isn't present (#7250 )	2024-10-22 13:32:24 -07:00
Patrick Devine	c7cb0f0602	image processing for llama3.2 (#6963 ) Co-authored-by: jmorganca <jmorganca@gmail.com> Co-authored-by: Michael Yang <mxyng@pm.me> Co-authored-by: Jesse Gross <jesse@ollama.com>	2024-10-18 16:12:35 -07:00
Alex Mavrogiannis	f40bb398f6	Stop model before deletion if loaded (fixed #6957 ) (#7050 )	2024-10-01 15:45:43 -07:00
Patrick Devine	abed273de3	add "stop" command (#6739 )	2024-09-11 16:36:21 -07:00
Michael Yang	ecab6f1cc5	refactor show ouput fixes line wrapping on long texts	2024-09-11 14:23:09 -07:00
Daniel Hiltgen	6719097649	llm: make load time stall duration configurable via OLLAMA_LOAD_TIMEOUT With the new very large parameter models, some users are willing to wait for a very long time for models to load.	2024-09-05 14:00:08 -07:00
Daniel Hiltgen	b05c9e83d9	Introduce GPU Overhead env var (#5922 ) Provide a mechanism for users to set aside an amount of VRAM on each GPU to make room for other applications they want to start after Ollama, or workaround memory prediction bugs	2024-09-05 13:46:35 -07:00
Vimal Kumar	5f7b4a5e30	fix(cmd): show info may have nil ModelInfo (#6579 )	2024-08-31 21:12:17 -07:00
Patrick Devine	0c819e167b	convert safetensor adapters into GGUF (#6327 )	2024-08-23 11:29:56 -07:00
Michael Yang	beb49eef65	create bert models from cli	2024-08-20 17:27:34 -07:00

1 2 3 4 5 ...

285 Commits