ollama

mirror of https://github.com/ollama/ollama.git synced 2025-07-12 13:35:30 +02:00

Author	SHA1	Message	Date
Blake Mizerany	68bac1e0a6	server: group routes by category and purpose (#9270 ) The route assembly in Handler lacked clear organization making it difficult scan for routes and their relationships to each other. This commit aims to fix that by reordering the assembly of routes to group them by category and purpose. Also, be more specific about what "config" refers to (it is about CORS if you were wondering... I was.)	2025-02-21 21:02:26 -08:00
Jesse Gross	f53f4198c3	ml: Abstract attention out of model definitions There are two benefits to doing this: - Provide a library function that models can use, reducing code for each model implementation - Enables a single place to drop in optimized implementations of attention based on the backend or other factors. One is provided for GGML. On CUDA this improves token generation rate by about 3%. It does not have a significant effect on Metal. Co-authored-by: Daniel Hiltgen <daniel@ollama.com>	2025-02-21 13:16:21 -08:00
Michael Yang	2192a28eed	ml/backend/ggml: fix rms norm	2025-02-21 18:34:19 +00:00
Junyan Qin (Chin)	5d81c1a184	docs: add `RockChinQ/LangBot` to integrations list (#9272 )	2025-02-21 09:36:55 -08:00
Jesse Gross	5c5535c064	models: Prune unused outputs earlier in the forward pass Currently Rows is called as the last step in a model computation to get the values for the output tokens. However, if we move it earlier in the process then we can trim out computations that never get used. This is similar to how models are defined in llama.cpp. Changing the model definition in this way improves token generation performance by approximately 8%.	2025-02-20 14:49:47 -08:00
Jesse Gross	e5bcc51ae1	ggml-backend: Don't recreate the scheduler for each context We don't need to create and destroy the GGML scheduler for every context. This introduces extra CPU overhead for every forward pass and extra memory for contexts that don't actually get scheduled (for example, KV caches). We can instead just have one scheduler for the backend and reset it each time we call Compute. This improves token generation performance by 1-2% and removes scheduler create/destroy from profile traces.	2025-02-20 14:49:47 -08:00
Jesse Gross	bd6a7d5e64	ollamarunner: Pass runner performance parameters to backends Currently the following parameters are in the runner but not used: - numGPULayers - mainGPU - threads - tensorSplit This passes them through to the backend, which is where they would actually get used. However, the GGML backend does not yet do anything with them.	2025-02-20 13:27:57 -08:00
Bruce MacDonald	14b5a9a150	api: document client stream behavior with a test (#8996 ) Added unit tests to verify error handling behavior in the Client.stream and Client.do methods. Tests cover various error scenarios including: - Error responses with status codes >= 400 - Error messages with successful status codes - Empty error messages - Successful responses	2025-02-20 13:19:58 -08:00
Michael Yang	ba9ec3d05e	ci: use clang for windows cpu builds clang outputs are faster. we were previously building with clang via gcc wrapper in cgo but this was missed during the build updates so there was a drop in performance v0.5.12-rc1	2025-02-20 20:22:36 +00:00
frob	7c168b08c9	server: add missing function parens to debug log (#9255 )	2025-02-20 12:10:15 -08:00
danielekp	3d4cc7833c	docs: Add yla to community integrations	2025-02-20 11:34:24 -08:00
Lucas Hahn	351a85d9ea	openai: add 'timeout' to allowable x-stainless headers (#9237 ) v0.5.12-rc0	2025-02-19 21:56:18 -08:00
Michael Yang	bda4ef6c56	reorder patches	2025-02-20 03:49:24 +00:00
Michael Yang	1e438b237c	Merge pull request #9203 from ollama/mxyng/sapphirerapids build: remove backend build for sapphirerapids	2025-02-19 21:42:00 +00:00
yuiseki	d721a02e7d	test: add test cases for ListHandler (#9146 )	2025-02-19 13:24:27 -08:00
zyxucp	778603a818	docs: Add AntSK to Community Integrations (#9214 )	2025-02-19 13:22:48 -08:00
maninhill	3c874df46e	docs: Add MaxKB to Community Integrations (#9212 )	2025-02-19 13:20:09 -08:00
Jeffrey Morgan	d2eb226c91	llama: add patch to fix ggml backend reg on Linux with utf-8 characters in the path (#9159 )	2025-02-18 22:46:17 -05:00
Michael Yang	e13e7c8d94	Merge pull request #9079 from jeremyschlatter/main cmd: fix flickering in progress bar	2025-02-18 22:59:29 +00:00
Jeremy Schlatter	78f403ff45	address code review comments	2025-02-18 14:50:09 -08:00
Michael Yang	5f8c03189e	build: remove backend build for sapphirerapids sapphire rapids has amx support but it ends up having a negative performance impact. emerald rapids also has amx support with a positive performance impact however there's no reasonable way in ggml to differentiate between the two. the impact is small (~6%) so disable amx entirely for simplicity	2025-02-18 14:47:58 -08:00
Michael Yang	08a299e1d0	cmake: avoid building intel backends on linux	2025-02-18 22:17:00 +00:00
Michael Yang	7b5d916a9a	ci: set owner/group in tarball set owner and group when building the linux tarball so extracted files are consistent. this is the behaviour of release tarballs in version 0.5.7 and lower	2025-02-18 20:11:09 +00:00
benhaotang	33ad61b112	Add OpenDeepResearcher-via-searxng to Community Integrations (#9138 )	2025-02-18 11:39:11 -08:00
L. Jiang	716e365615	test: add test cases for HumanNumber (#9108 )	2025-02-18 11:35:26 -08:00
innightwolfsleep	3b4424ff98	readme: add LLM Telegram Bot to community integrations (#9150 )	2025-02-18 10:04:30 -05:00
Jeremy Schlatter	f9c7ead160	cmd: eliminate flickering with synchronized output	2025-02-17 20:01:03 -08:00
Jeremy Schlatter	5930aaeb1a	cmd: fix cursor flickering in progress bar The previous commit fixed flickering in the progress bar itself. Cursor flickering is harder to address. Cursor flickering could be fixed by hiding the cursor altogether while the progress bar is displayed. The downside of this is that if the program is killed in such a way that it can't clean up its state, it would leave the cursor invisible. Instead, this commit introduces an output buffer. All of the escape codes and content for a single progress update are written to a buffer, which is then flushed to the terminal all at once. This significantly decreases the time during which the terminal has seen the cursor-hiding code but has not yet seen the cursor-showing code, thus minimizing (but not 100% eliminating) cursor flickering. For more context, see: https://gitlab.gnome.org/GNOME/vte/-/issues/2837#note_2269501	2025-02-17 14:56:57 -08:00
Jeremy Schlatter	faf67db089	cmd: fix progress bar flickering Previous code cleared the display before writing new content, creating a window where the terminal could (and in some cases did) render empty lines. Instead, we now write new content over the old content, only clearing the trailing end of lines for cases where the new line is shorter. Fixes #1664	2025-02-17 13:39:02 -08:00
James-William-Kincaid-III	0667baddc6	docs: fix incorrect shortcut key in windows.md (#9098 )	2025-02-15 15:38:24 -05:00
Bruce MacDonald	d006e1e09b	model: document high-level model interface (#9122 )	2025-02-14 16:01:00 -08:00
Daniel Hiltgen	df2680b4b9	Wire up system info log for new engine (#9123 )	2025-02-14 15:55:33 -08:00
Jesse Gross	010313bb63	llamarunner: Init GGML before printing system info We currently print system info before the GGML backends are loaded. This results in only getting information about the default lowest common denominator runner. If we move up the GGML init then we can see what we are actually running. Before: time=2025-02-14T11:15:07.606-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 \| CPU : LLAMAFILE = 1 \| cgo(gcc)" threads=24 After: time=2025-02-14T11:16:02.936-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 \| CPU : LLAMAFILE = 1 \| CUDA : ARCHS = 890 \| USE_GRAPHS = 1 \| PEER_MAX_BATCH_SIZE = 128 \| CPU : SSE3 = 1 \| SSSE3 = 1 \| AVX = 1 \| AVX2 = 1 \| F16C = 1 \| FMA = 1 \| AVX512 = 1 \| AVX512_VBMI = 1 \| AVX512_VNNI = 1 \| LLAMAFILE = 1 \| cgo(gcc)" threads=24	2025-02-14 11:41:53 -08:00
Jeffrey Morgan	5296f487a8	llm: attempt to evaluate symlinks, but do not fail (#9089 ) provides a better approach to #9088 that will attempt to evaluate symlinks (important for macOS where 'ollama' is often a symlink), but use the result of os.Executable() as a fallback in scenarios where filepath.EvalSymlinks fails due to permission erorrs or other issues	2025-02-13 22:37:59 -08:00
Jeffrey Morgan	f05774b04c	llm: do not evaluate symlink for exe path lookup (#9088 ) In some cases, the directories in the executable path read by filepath.EvalSymlinks are not accessible, resulting in permission errors which results in an error when running models. It also doesn't work well on long paths on windows, also resulting in errors. This change removes filepath.EvalSymlinks when accessing os.Executable() altogether	2025-02-13 22:13:00 -08:00
Jeffrey Morgan	6600bd7d91	ml/backend/ggml: stable sort devices by score (#9081 )	2025-02-13 18:42:36 -08:00
Jesse Gross	ed443a0393	Runner for Ollama engine This provides integration with the new Ollama engine (`5824541` next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1	2025-02-13 17:09:26 -08:00
Jesse Gross	6945617af5	models: Move model into their own directory This allows there to be a file that is a list of models that is not mixed into the runner code.	2025-02-13 17:09:26 -08:00
Jesse Gross	7916f55009	vocab: Use int32 for special tokens Special tokens are currently read as uint32 from the model metadata. However, all other parts of the system (including the tokenizer) use int32 to represent tokens so it is impossible to represent the high portion of the unsigned range. For consistency and to avoid casts, we should just use int32 everywhere.	2025-02-13 17:09:26 -08:00
Jesse Gross	d650ad398f	model: Load tensors behind an interface Currently, if a model uses an interface for its data structures (as mllama does) then the tensor data in the structs implementing that interface will not get loaded.	2025-02-13 17:09:26 -08:00
Jesse Gross	d223f3b697	ggml-backend: Close on nil should be a no-op	2025-02-13 17:09:26 -08:00
Jesse Gross	60830695c2	ggml-backend: Ensure data is available after async computation We need to sync before retrieving data after async computation. It is also important to ensure that the Go buffer is not moved by the GC across function calls so we do a synchronous copy.	2025-02-13 17:09:26 -08:00
Jesse Gross	01d9a46854	ggml-backend: Let GGML allocate context memory Passing in a Go buffer is not safe because the garbage collector could free or move the memory while the context is still open. However, if we pass in the size and a nil pointer then GGML will allocate it from the C side.	2025-02-13 17:09:26 -08:00
Jesse Gross	d773b7d671	backend: API to support full precision matmul Most tensor backends try to optimize performance by using a lower precision for matmuls. However, some operations (such as kq) on some models are sensitive to this and require full precision.	2025-02-13 17:09:26 -08:00
Jesse Gross	4d4463b2bd	backend: Support graph computation that does not return an output There are two cases where we may not have an output after computing: - Prompt processing where the length of the input exceeds the batch size - Internal memory management operations such as cache defrag and shift	2025-02-13 17:09:26 -08:00
Jesse Gross	0e38297f87	backend: Consistently use int (vs. int64) for tensor shapes Currently there is a mixture of int and int64 used when dealing with tensor dimensions and shapes, which causes unnecessary conversions - they all should be the same type. In general, most interfaces (such as Pytorch) use int64 for generality but most implementations (such as CUDA) use int32 for performance. There isn't much benefit to us to being more flexible than the implementations we are likely to run on. In addition, as a practical matter, a model with a tensor with a single dimension larger than 32 bits is unlikely to run on a 32-bit machine.	2025-02-13 17:09:26 -08:00
Jesse Gross	7e13f568dc	backend: Don't return an error on Close It is not common to return errors with close/free operations - most people won't check it and even if they did there's probably not much that can do. It's better to not give implementations false expectations.	2025-02-13 17:09:26 -08:00
Michael Yang	58245413f4	next ollama runner (#7913 ) feat: add new Ollama engine using ggml through cgo This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this. - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go` - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go` - `ml.Tensor` defines the interface for a tensor and tensor operations This is the first implementation of the new engine. Follow up PRs will implement more features: - non-greedy sampling (#8410) - integration with Ollama and KV caching (#8301) - more model support (#9080) with more coming soon Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>	2025-02-13 16:31:21 -08:00
Bùi Đức Nhật	8cf16063a5	docs: add ollamazing to the README.md (#9075 )	2025-02-13 10:47:09 -08:00
frob	3a4449e2f1	docs: add H200 as supported device. (#9076 ) Co-authored-by: Richard Lyons <frob@cloudstaff.com>	2025-02-13 10:44:23 -08:00

1 2 3 4 5 ...

4001 Commits