3963 Commits

Author SHA1 Message Date
Daniel Hiltgen
e91ae3d47d
Update ROCm (6.3 linux, 6.2 windows) and CUDA v12.8 (#9304)
* Bump cuda and rocm versions

Update ROCm to linux:6.3 win:6.2 and CUDA v12 to 12.8.
Yum has some silent failure modes, so largely switch to dnf.

* Fix windows build script
2025-02-25 13:47:36 -08:00
José Pekkarinen
6ecd7f64ba
docker: upgrade rocm to 6.3.3 (#8211)
centos-7 images have been deprecated upstream and replaced with
almalinux-8 images instead, requiring some small extra work.

Signed-off-by: José Pekkarinen <jose.pekkarinen@foxhound.fi>
2025-02-25 13:38:08 -08:00
Chuanhui Liu
888855675e
docs: rocm install link (#9346) 2025-02-25 13:15:47 -08:00
Michael Yang
b16367b4b2 fix: add back bf16 support
this was accidentally removed when moving fs/ggml from its previous
location
2025-02-25 19:26:14 +00:00
Pavol Rusnak
a499390648
build: support Compute Capability 5.0, 5.2 and 5.3 for CUDA 12.x (#8567)
CUDA 12.x still supports Compute Capability 5.0, 5.2 and 5.3,
so let's build for these architectures as well
2025-02-25 09:54:19 -08:00
frob
4df98f3eb5
Move cgroups fix out of AMD section. (#9072)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-02-25 08:52:50 -08:00
Blake Mizerany
348b3e0983
server/internal: copy bmizerany/ollama-go to internal package (#9294)
This commit copies (without history) the bmizerany/ollama-go repository
with the intention of integrating it into the ollama as a replacement
for the pushing, and pulling of models, and management of the cache they
are pushed and pulled from.

New homes for these packages will be determined as they are integrated
and we have a better understanding of proper package boundaries.
2025-02-24 22:39:44 -08:00
Parth Sareen
0b7e1676eb
sample: add sampling package for new engine (#8410) 2025-02-24 17:19:01 -08:00
Parth Sareen
314573bfe8
config: allow setting context length through env var (#8938)
* envconfig: allow setting context length through env var
2025-02-24 13:26:35 -08:00
Blake Mizerany
4604b10306
go.mod: bump to go1.24 (#9242) 2025-02-24 13:11:46 -08:00
Jeffrey Morgan
8c13cfa4dd
ml/backend/ggml: fix crash on windows paths with wide characters (#9305) v0.5.12 2025-02-23 19:13:53 -08:00
Jeffrey Morgan
7cfd4aee4d
docs: add additional ROCm docs for building (#9066) 2025-02-22 11:22:59 -08:00
Blake Mizerany
68bac1e0a6
server: group routes by category and purpose (#9270)
The route assembly in Handler lacked clear organization making it
difficult scan for routes and their relationships to each other. This
commit aims to fix that by reordering the assembly of routes to group
them by category and purpose.

Also, be more specific about what "config" refers to (it is about CORS
if you were wondering... I was.)
2025-02-21 21:02:26 -08:00
Jesse Gross
f53f4198c3 ml: Abstract attention out of model definitions
There are two benefits to doing this:
 - Provide a library function that models can use, reducing code for
   each model implementation
 - Enables a single place to drop in optimized implementations of
   attention based on the backend or other factors. One is provided for
   GGML.

On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-02-21 13:16:21 -08:00
Michael Yang
2192a28eed ml/backend/ggml: fix rms norm 2025-02-21 18:34:19 +00:00
Junyan Qin (Chin)
5d81c1a184
docs: add RockChinQ/LangBot to integrations list (#9272) 2025-02-21 09:36:55 -08:00
Jesse Gross
5c5535c064 models: Prune unused outputs earlier in the forward pass
Currently Rows is called as the last step in a model computation
to get the values for the output tokens. However, if we move it
earlier in the process then we can trim out computations that
never get used. This is similar to how models are defined in
llama.cpp.

Changing the model definition in this way improves token generation
performance by approximately 8%.
2025-02-20 14:49:47 -08:00
Jesse Gross
e5bcc51ae1 ggml-backend: Don't recreate the scheduler for each context
We don't need to create and destroy the GGML scheduler for every
context. This introduces extra CPU overhead for every forward
pass and extra memory for contexts that don't actually get scheduled
(for example, KV caches). We can instead just have one scheduler
for the backend and reset it each time we call Compute.

This improves token generation performance by 1-2% and removes
scheduler create/destroy from profile traces.
2025-02-20 14:49:47 -08:00
Jesse Gross
bd6a7d5e64 ollamarunner: Pass runner performance parameters to backends
Currently the following parameters are in the runner but not used:
 - numGPULayers
 - mainGPU
 - threads
 - tensorSplit

This passes them through to the backend, which is where they would
actually get used. However, the GGML backend does not yet do anything
with them.
2025-02-20 13:27:57 -08:00
Bruce MacDonald
14b5a9a150
api: document client stream behavior with a test (#8996)
Added unit tests to verify error handling behavior in the Client.stream and Client.do methods.
Tests cover various error scenarios including:
- Error responses with status codes >= 400
- Error messages with successful status codes
- Empty error messages
- Successful responses
2025-02-20 13:19:58 -08:00
Michael Yang
ba9ec3d05e ci: use clang for windows cpu builds
clang outputs are faster. we were previously building with clang via gcc
wrapper in cgo but this was missed during the build updates so there was
a drop in performance
v0.5.12-rc1
2025-02-20 20:22:36 +00:00
frob
7c168b08c9
server: add missing function parens to debug log (#9255) 2025-02-20 12:10:15 -08:00
danielekp
3d4cc7833c
docs: Add yla to community integrations 2025-02-20 11:34:24 -08:00
Lucas Hahn
351a85d9ea
openai: add 'timeout' to allowable x-stainless headers (#9237) v0.5.12-rc0 2025-02-19 21:56:18 -08:00
Michael Yang
bda4ef6c56 reorder patches 2025-02-20 03:49:24 +00:00
Michael Yang
1e438b237c
Merge pull request #9203 from ollama/mxyng/sapphirerapids
build: remove backend build for sapphirerapids
2025-02-19 21:42:00 +00:00
yuiseki
d721a02e7d
test: add test cases for ListHandler (#9146) 2025-02-19 13:24:27 -08:00
zyxucp
778603a818
docs: Add AntSK to Community Integrations (#9214) 2025-02-19 13:22:48 -08:00
maninhill
3c874df46e
docs: Add MaxKB to Community Integrations (#9212) 2025-02-19 13:20:09 -08:00
Jeffrey Morgan
d2eb226c91
llama: add patch to fix ggml backend reg on Linux with utf-8 characters in the path (#9159) 2025-02-18 22:46:17 -05:00
Michael Yang
e13e7c8d94
Merge pull request #9079 from jeremyschlatter/main
cmd: fix flickering in progress bar
2025-02-18 22:59:29 +00:00
Jeremy Schlatter
78f403ff45
address code review comments 2025-02-18 14:50:09 -08:00
Michael Yang
5f8c03189e build: remove backend build for sapphirerapids
sapphire rapids has amx support but it ends up having a negative
performance impact.

emerald rapids also has amx support with a positive performance impact
however there's no reasonable way in ggml to differentiate between the
two. the impact is small (~6%) so disable amx entirely for simplicity
2025-02-18 14:47:58 -08:00
Michael Yang
08a299e1d0 cmake: avoid building intel backends on linux 2025-02-18 22:17:00 +00:00
Michael Yang
7b5d916a9a ci: set owner/group in tarball
set owner and group when building the linux tarball so extracted files
are consistent. this is the behaviour of release tarballs in version
0.5.7 and lower
2025-02-18 20:11:09 +00:00
benhaotang
33ad61b112
Add OpenDeepResearcher-via-searxng to Community Integrations (#9138) 2025-02-18 11:39:11 -08:00
L. Jiang
716e365615
test: add test cases for HumanNumber (#9108) 2025-02-18 11:35:26 -08:00
innightwolfsleep
3b4424ff98
readme: add LLM Telegram Bot to community integrations (#9150) 2025-02-18 10:04:30 -05:00
Jeremy Schlatter
f9c7ead160
cmd: eliminate flickering with synchronized output 2025-02-17 20:01:03 -08:00
Jeremy Schlatter
5930aaeb1a
cmd: fix cursor flickering in progress bar
The previous commit fixed flickering in the progress bar itself. Cursor
flickering is harder to address.

Cursor flickering could be fixed by hiding the cursor altogether while
the progress bar is displayed. The downside of this is that if the
program is killed in such a way that it can't clean up its state, it
would leave the cursor invisible.

Instead, this commit introduces an output buffer. All of the escape
codes and content for a single progress update are written to a buffer,
which is then flushed to the terminal all at once. This significantly
decreases the time during which the terminal has seen the cursor-hiding
code but has not yet seen the cursor-showing code, thus minimizing (but
not 100% eliminating) cursor flickering.

For more context, see:
https://gitlab.gnome.org/GNOME/vte/-/issues/2837#note_2269501
2025-02-17 14:56:57 -08:00
Jeremy Schlatter
faf67db089
cmd: fix progress bar flickering
Previous code cleared the display before writing new content, creating a
window where the terminal could (and in some cases did) render empty lines.

Instead, we now write new content over the old content, only clearing
the trailing end of lines for cases where the new line is shorter.

Fixes #1664
2025-02-17 13:39:02 -08:00
James-William-Kincaid-III
0667baddc6
docs: fix incorrect shortcut key in windows.md (#9098) 2025-02-15 15:38:24 -05:00
Bruce MacDonald
d006e1e09b
model: document high-level model interface (#9122) 2025-02-14 16:01:00 -08:00
Daniel Hiltgen
df2680b4b9
Wire up system info log for new engine (#9123) 2025-02-14 15:55:33 -08:00
Jesse Gross
010313bb63 llamarunner: Init GGML before printing system info
We currently print system info before the GGML backends are loaded.
This results in only getting information about the default lowest
common denominator runner. If we move up the GGML init then we can
see what we are actually running.

Before:
time=2025-02-14T11:15:07.606-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=24

After:
time=2025-02-14T11:16:02.936-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=24
2025-02-14 11:41:53 -08:00
Jeffrey Morgan
5296f487a8
llm: attempt to evaluate symlinks, but do not fail (#9089)
provides a better approach to #9088 that will attempt to
evaluate symlinks (important for macOS where 'ollama' is
often a symlink), but use the result of os.Executable()
as a fallback in scenarios where filepath.EvalSymlinks
fails due to permission erorrs or other issues
2025-02-13 22:37:59 -08:00
Jeffrey Morgan
f05774b04c
llm: do not evaluate symlink for exe path lookup (#9088)
In some cases, the directories in the executable path read by
filepath.EvalSymlinks are not accessible, resulting in permission
errors which results in an error when running models. It also
doesn't work well on long paths on windows, also resulting in
errors. This change removes filepath.EvalSymlinks when accessing
os.Executable() altogether
2025-02-13 22:13:00 -08:00
Jeffrey Morgan
6600bd7d91
ml/backend/ggml: stable sort devices by score (#9081) 2025-02-13 18:42:36 -08:00
Jesse Gross
ed443a0393 Runner for Ollama engine
This provides integration with the new Ollama engine
(5824541 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1
2025-02-13 17:09:26 -08:00
Jesse Gross
6945617af5 models: Move model into their own directory
This allows there to be a file that is a list of models that is
not mixed into the runner code.
2025-02-13 17:09:26 -08:00