ollama

mirror of https://github.com/ollama/ollama.git synced 2025-03-19 14:21:57 +01:00

Author	SHA1	Message	Date
Parth Sareen	630e7dc6ff	api: structured outputs - chat endpoint (#7900 ) Adds structured outputs to chat endpoint --------- Co-authored-by: Michael Yang <mxyng@pm.me> Co-authored-by: Hieu Nguyen <hieunguyen1053@outlook.com>	2024-12-04 16:31:19 -08:00
Sam	1bdab9fdb1	llm: introduce k/v context quantization (vRAM improvements) (#6279 )	2024-12-03 15:57:19 -08:00
Jeffrey Morgan	39e29ae5dd	llama: fix typo and formatting in readme (#7876 )	2024-11-28 17:27:11 -08:00
ItzCrazyKns	e3936d4fb3	Support Multiple LoRa Adapters (#7667 ) Closes #7627	2024-11-27 11:00:04 -08:00
Jesse Gross	71e6a0d0d1	runner.go: Don't try to extract image tags for text models When processing a prompt, we look for image tags of the form [img-0], which are inserted by the Ollama server process. However, this can cause errors if the original prompt has these tags - typically an image not found error is returned. This changes tag searching behavior to be similar to the 0.3.x series, which will largely avoid these problems. However,they can still happen when input text with these tags is used with image models. The correct solution is to escape the tags but this is a larger issue with special sequences in general so this is an incremental fix that should avoid the problem for the majority of cases.	2024-11-26 13:23:24 -08:00
Jesse Gross	2cd11ae365	runner.go: Add unit tests for context shifting This also makes it easier to truncate long inputs the same as shifting but does not actually implement it. This type of truncation has a trade off between quality and time to first token.	2024-11-26 11:21:35 -08:00
Jesse Gross	3478b2cf14	runner.go: Fix deadlock with many concurrent requests If there are no avilable slots for new sequences then a request will not be added to the processing queue but will continue on to wait for a response that never comes. Besides never giving a response to the request, this prevents the model from being unloaded due to the outstanding request. To prevent this, there are semaphores that prevent more requests from being processed than there are slots - one in the Ollama server and one in the runner. - The Ollama server one works but it is not designed to protect the runner's data internal structures and the runner can return a final response before clearing its data structures. - The internal runner semaphore has similar behavior where it can release the semaphore when it issues a response. This is wrong - it should only release the semaphore after it has cleared the data structure. In addition, we should return an error if a slot is not found rather than deadlocking in the event we ever get to this spot. Fixes #7779	2024-11-22 16:14:51 -08:00
Daniel Hiltgen	b85520bfb9	logs: explain client aborts better (#7783 ) Users get confused by "Failed to acquire semaphore" error="context canceled" messages in the logs, which are actually clients giving up. While there could be a legitimate hang bug in the system, sometimes this is just short client timeouts with an overloaded system, so this should help users understand what's going on better.	2024-11-22 08:05:32 -08:00
boessu	1a742f54c9	readme: update AMD ROCm links (#7213 )	2024-11-20 23:48:55 -08:00
Jesse Gross	c4b34f2a2a	runner.go: Truncate inputs that exceed context rather than shifting Previous versions of the runner would truncate inputs to the context window before beginning processing. The main processing loop relied on this behavior if the context needed to be shifted later (due to token generation). If truncation did not occur then invariants would be broken, causing crashes or infinite loops. Later versions attempted to fix these bugs and make the logic less subtle so that all inputs could be handled. Truncation was removed to make things consistent. However, truncation is much faster than processing and shifting, so removing it caused performance problems when the input vastly exceeded the context size. This restores the input truncation as a performance optimization while keeping the more robust processing logic. Fixes #7762	2024-11-20 12:49:24 -08:00
Jesse Gross	c3ff916431	runner.go: Don't add inputs to cache view until actually processed We need to track which tokens are in the cache ourselves. We currently add tokens to the cache tracker when we add them to batch but they are not actually in the cache until we call Decode. This can cause confusion when we are shifting the cache. Avoids "could not find a KV slot for the batch" issues. Bug #7545	2024-11-20 12:49:24 -08:00
Jesse Gross	3fc1dc0e6f	runner.go: Hard fail on errors rather than potentially infinite looping We try to recover from errors by dropping the tokens that caused the problem and re-trying. However, dropping the tokens is not correct and continuing often leads to infinite loops. To avoid, this we end the sequence if such a condition is detected, which is also surprising. At this point, it is better to just report the error. This will make it easier to find problems and the alternatives are perhaps even more surprising to users. This is not a very satisfactory solution either - we should isolate the error and return it to the user without killing the whole process. However, this is an incremental step and consistent with most other failures (which either manifest as abort() or panic).	2024-11-20 12:49:24 -08:00
Jesse Gross	7121dfa309	runner.go: Retry decoding after defragmentation if needed Fragmentation of the KV cache can occur due to cache shifting or different sequences getting processed. Decode uses a heuristic to decide if it should defrag. However, this heuristic isn't 100% accurate, so decoding can sometimes fail by surprise. For these cases, if decode indicates that there is no KV cache space, we should defrag and then try again.	2024-11-20 12:49:24 -08:00
Jesse Gross	5f68fcab12	runner.go: Use correct index when retrieving embedding results This doesn't have any impact currently because NUM_PARALLEL is forced to 1 for embeddings, so both indicies will always be 0.	2024-11-20 12:49:24 -08:00
Gabe Goodhart	807ace5b1f	fix(runner): Set logits to 0 if false on Batch.Add https://github.com/ollama/ollama/issues/7656 Branch: Granite3StoppingBug-7656 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-11-19 15:45:37 -08:00
Jesse Gross	d875e99e46	runner.go: Propagate panics back to the user. This is a partial revert of 8a35bb92 "runner.go: Increase survivability of main processing loop", removing the panic handler. Although we want to avoid errors taking down the runner, we also should make the user aware of problems when they happen. In the future, we can restructure things so both parts are true.	2024-11-15 11:52:25 -08:00
Jesse Gross	8a35bb926e	runner.go: Increase survivability of main processing loop Currently, if an error occurs during the prep stages (such as tokenizing) of a single request, it will only affect that request. However, if an error happens during decoding, it can take down the entire runner. Instead, it's better to drop the tokens that triggered the error and try to keep going. However, we also need to stop when we run out of tokens, otherwise, this just causes an infinite loop. This is likely the cause of at least some of the hanging issues that have been reported. Bug #7573	2024-11-14 17:18:41 -08:00
Jesse Gross	c25ffde91d	runner.go: Don't trim whitespace from inputs It's possible to get prompts that consist entirely of whitespace - this is most likely to happen when generating embeddings. Currently, we will trim this away, leaving an empty prompt, which will then generate an error. Generating embeddings from whitespace should not trigger an error, as this may break pipelines. It's better to just leave the whitespace in place and process what we are given. This is consistent with past versions of Ollama. Bug #7578	2024-11-14 11:23:06 -08:00
Jesse Gross	17b386a891	runner.go: Enforce NUM_PARALLEL directly in the runner NUM_PARALEL is currently enforced by the Ollama server process - it will only issue requests to the runner if the maximum number of concurrent requests has not been exceeded. Although this should be sufficient, it is good for the runner to protect its own data structures. Currently, if too many requests get through to the runner, they will just get stuck and never return. This may help with reports of Ollama hanging, though it is unclear how it would actually occur. Bug #7573	2024-11-14 11:21:59 -08:00
Michael Yang	549c2bdfcf	Merge pull request #7657 from ollama/mxyng/sync fix(mllama): sync backend between batches	2024-11-14 09:40:04 -08:00
Michael Yang	5b3393b6a2	fix(mllama): sync backend between batches	2024-11-13 16:37:21 -08:00
Jesse Gross	d7eb05b936	runner.go: Fix off-by-one for num predicted	2024-11-12 11:35:57 -08:00
Daniel Hiltgen	df011054fa	Jetpack support for Go server (#7217 ) This adds support for the Jetson JetPack variants into the Go runner	2024-11-12 10:31:52 -08:00
Jesse Gross	65973ceb64	runner.go: Make KV entry accounting more robust The structure of the accounting for KV cache shifting was carried over from the old runner but it now doesn't feel natural with the new runner. There are a number of invariants that should hold true but are difficult to reason about. There is at least one bug report that would imply that the invariants are not holding. This reduces the number of implicit assumptions and is more forgiving of unexpected situations. It also improves behavior around which input tokens are kept when truncation occurs. Bug #7545	2024-11-11 20:23:03 -08:00
Jesse Gross	c2e8cbaa14	runner.go: Check for zero length images If we get a request with a zero length image, it will result in an out-of-bounds error when we pass the data to the image encoder.	2024-11-08 09:39:32 -08:00
Daniel Hiltgen	1618700c5a	Workaround buggy P2P ROCm copy on windows (#7466 ) This enables the workaround code only for windows which should help windows users with muliple AMD GPUs	2024-11-07 14:26:31 -08:00
Daniel Hiltgen	9e83e550e1	Align rocm compiler flags (#7467 ) Bring consistency with the old generate script behavior	2024-11-07 10:20:50 -08:00
Daniel Hiltgen	fc2a0715df	Be explicit for gpu library link dir (#7560 ) On linux nvcc isn't automatically linking to the same cuda version.	2024-11-07 09:20:40 -08:00
Jesse Gross	a909417602	runner.go: Remove unused arguments Now that server.cpp is gone, we don't need to keep passing arguments that were only ignored and only kept for compatibility.	2024-11-06 13:32:18 -08:00
Jesse Gross	312d9de1d1	llama: Improve error handling Check for NULL return values from llama.cpp in more places and convert them into Go errors, which should make debugging easier in the future rather than having hidden surprises in our data structures.	2024-11-02 13:37:55 -07:00
Jesse Gross	a103dae01e	runner.go: Only allocate 1 element embedding batches for mllama Mllama has large embeddings (100 MB per image) and each embedding is represented as 1 token when passed to llama.cpp. Batches are pre- allocated for the size of the tokens times the batch size, so this results in allocations of over 50 GB at the default batch size. On some systems, these mallocs will fail. Since an image is represented as a single token and mllama doesn't support more than 1 image per request, we only need to allocate a batch size of 1, which is much more reasonable. In addition, for non-multimodal models, we don't need to allocate the embedding batches at all. Fixes #7464	2024-11-02 13:37:55 -07:00
Jesse Gross	26acdcf44e	runner.go: Don't set cross attention before sending embeddings Currently if an input has embeddings at any point then we will set cross attention to true from the beginning. This means that any tokens before the embeddings are sent will incorrectly have cross attention layers applied. This only sets cross attention when we have an embedding, either previously in this sequence or in the cache. It also makes cross attention capable of supporting parallelism at the runner level, though the mllama implementation doesn't support that yet.	2024-10-31 13:56:08 -07:00
Jesse Gross	c826e57475	runner.go: Better abstract vision model integration -Update mllama to take the cross attention state as embeddings in a batch, more similar to how Llava handles it. This improves integration with the input cache. -Pass locations in a prompt for embeddings using tags similar to Llava. -Abstract interface to vision models so the main runner accesses Clip and Mllama similarly Co-authored-by: Michael Yang <mxyng@pm.me>	2024-10-30 14:53:43 -07:00
Daniel Hiltgen	712e99d477	Soften windows clang requirement (#7428 ) This will no longer error if built with regular gcc on windows. To help triage issues that may come in related to different compilers, the runner now reports the compier used by cgo.	2024-10-30 12:28:36 -07:00
Daniel Hiltgen	b754f5a6a3	Remove submodule and shift to Go server - 0.4.0 (#7157 ) * Remove llama.cpp submodule and shift new build to top * CI: install msys and clang gcc on win Needed for deepseek to work properly on windows	2024-10-30 10:34:28 -07:00
Daniel Hiltgen	c9ca386131	Switch windows to clang (#7407 ) * Switch over to clang for deepseek on windows The patch for deepseek requires clang on windows. gcc on windows has a buggy c++ library and can't handle the unicode characters * Fail fast with wrong compiler on windows Avoid users mistakenly building with GCC when we need clang	2024-10-29 13:15:04 -07:00
Jesse Gross	de1557a0dc	runner.go: Better handle return NULL values from llama.cpp Llama.cpp sometimes returns NULL as a return value to report an error. We should explicitly check for this and convert it to a Go error rather than putting NULL in our data structures and waiting for it to blow up later.	2024-10-28 18:12:29 -07:00
Daniel Hiltgen	abd5dfd06a	Bump to latest Go 1.22 patch (#7379 )	2024-10-26 17:03:37 -07:00
Daniel Hiltgen	099f7077a1	Fix deepseek deseret regex (#7369 ) On windows compiled with gcc the c++ regex library failed to handle the characters	2024-10-26 14:58:54 -07:00
Daniel Hiltgen	5231ae52d9	Fix incremental build file deps (#7361 ) The common src/hdr defs should be in the common definitions, not gpu specific.	2024-10-25 11:50:45 -07:00
Daniel Hiltgen	3085c47bea	Improve dependency gathering logic (#7345 ) This unfies the rocm/cuda dependency logic into the makefile and fixes a missing define which broke windows rocm	2024-10-24 09:51:53 -07:00
Daniel Hiltgen	5c44461ccf	Fix rocm windows build and clean up dependency gathering (#7305 ) On windows ensure windows version define is properly set for rocm. Remove duplicate rocm arch flags. Resolve wildcards in the targets so parallel builds don't race. Use readlink to resolve rocm dependencies since wildcards omit libelf Keep windows rocm deps aligned with unified packaging model	2024-10-22 12:54:15 -07:00
Jesse Gross	03e40efa51	runner.go: Merge partial unicode characters before sending We check for partial unicode characters and accumulate them before sending. However, when we did send, we still sent each individual piece separately, leading to broken output. This combines everything into a single group, which is also more efficient. This also switches to the built-in check for valid unicode characters, which is stricter. After this, we should never send back an invalid sequence. Fixes #7290	2024-10-22 12:07:51 -07:00
Patrick Devine	c7cb0f0602	image processing for llama3.2 (#6963 ) Co-authored-by: jmorganca <jmorganca@gmail.com> Co-authored-by: Michael Yang <mxyng@pm.me> Co-authored-by: Jesse Gross <jesse@ollama.com>	2024-10-18 16:12:35 -07:00
Daniel Hiltgen	bf4018b9ec	llama: Decouple patching script from submodule (#7139 ) * Refine llama.cpp vendoring workflow tools Switch from the sync.sh over to make based tooling * Run new make sync and patch flow	2024-10-17 15:03:09 -07:00
Daniel Hiltgen	f86d00cd95	llama: add compiler tags for cpu features (#7137 ) This adds the ability to customize the default runner with user specified flags	2024-10-17 13:43:20 -07:00
Gabe Goodhart	f2890a4494	IBM granite/granitemoe architecture support (#6760 ) * fix(ext_server): Port llama.cpp sampling refactors to ext_server This was a fairly large changeset. I closely followed the changes here: `df270ef745` Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Bump llama.cpp to the latest master with `granite` support This does not yet have granite MoE support, but that can come in a follow up PR Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(solar): Update solar patch for llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(solar): Update the solar-pro patch for latest llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama.cpp): Bump to the latest master of llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(patches): Update all patches for latest bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama): Always run sync.sh from the right directory Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/patches): Update llama patches Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat(llama)!: Rough sync with llama.cpp submodule There are a number of changes that will need to be propagated to llama.go before any of this works! Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/patches): Add a patch and update for missing ggml-impl.h include This include is where the ggml_cgraph struct is defined. It is included in many of the .c files to define the forward declartion in ggml.h. It seems that with the subset of code included here, the import was somehow lost (or out-of-order) when building, so adding this include to llama.cpp fixes the missing definition. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Add missing log.cpp This was added as part of the logging overhaul done in llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Overhaul use of sampling module for llama.cpp changes The changes here reflect the changes made in the big llama.cpp sampling PR https://github.com/ggerganov/llama.cpp/pull/9294 The sampling functionality is now broken into the base interface (llama_sampler) and the generation implementation (gpt_sampler). The changes here reflect that. Since the sampling.h/sampling.cpp code uses c++ STL headers, the sampling_ext.[h\|cpp] wrapper is maintained to allow go to access a pure-C interface. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Fix the impl of SampleTokenGreedy for new sampling I don't think this method is currently used, so it could probably just be removed so that all sampling goes through the GPT interface, but in the interest of doing no harm, this should keep the method working as expected. Branch: IBMGraniteArchitectureSupport * fix(llama): Remove unused SampleTokenGreedy Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(sync): Remove bash-specific change to sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * chore(gofumpt): Format on llama.go to pass linting Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llm): Fix missing <thread> include in ext_server Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Remove TODO about grammar_first This feature was not used/needed previously so should be fine without plumbing it through now. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Better naming for sampling wrapper and args Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Fix patch 05 to use new wrapper api and re-sync Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * runner: Flush pending responses before returning If there are any pending reponses (such as from potential stop tokens) then we should send them back before ending the sequence. Otherwise, we can be missing tokens at the end of a response. Fixes #6707 * fix(llama/sampling): Use gpt_sampler with a forward declaration Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llama): Remove unnecessary patch for gguf impl header This was caused by an earlier mistake in the embeddings patch that was dereferencing the pointer instead of using the wrapper API. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(llm): Remove use of deprecated --log-disable flag Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2024-10-17 11:59:52 -07:00
Daniel Hiltgen	7d6eb0d4c3	Move macos v11 support flags to build script (#7203 ) Having v11 support hard-coded into the cgo settings causes warnings for newer Xcode versions. This should help keep the build clean for users building from source with the latest tools, while still allow us to target the older OS via our CI processes.	2024-10-16 12:49:46 -07:00
Daniel Hiltgen	5dd0477fd4	Fix regression on older macos versions (#7192 ) The new cgo compilation requires a flag to target older macos versions	2024-10-13 10:47:42 -07:00
Jesse Gross	0077e22d52	runner.go: Handle truncation of tokens for stop sequences When a single token contains both text to be return and a stop sequence, this causes an out of bounds error when we update the cache to match our text. This is because we currently assume that the removing the stop sequence will consume at least one token. This also inverts the logic to deal with positive numbers, rather than a value to be subtracted, which is easier to reason about. Fixes #7153	2024-10-09 20:39:04 -07:00

1 2 3

109 Commits