4037 Commits

Author SHA1 Message Date
Michael Yang
b42aba40ed cuda: enable flash attention
ggml added an option to disable flash attention so explicitly enable it
2025-02-28 19:40:34 +00:00
王贺
25885e5335
docs: Add 1Panel to Community Integrations (#9312) 2025-02-28 09:53:03 -08:00
Jeffrey Morgan
98d44fa39d
llama: add phi4 mini support (#9403) v0.5.13-rc1 2025-02-27 19:30:32 -08:00
Blake Mizerany
2099e2d267
CONTRIBUTING: provide clarity on good commit messages, and bad (#9405)
Also, our commit messages have been getting better, but we can do
better, and be more consistent. This adds more clarity on how to write
commit messages and provides examples of good and bad messages.

Also, our contributing guide was lacking helpful guidance on how to
start change proposals. This commit adds the start of that section.

Soon, we should add a proposal template to the issue tracker with a link
back to the proposal section, which should also be expanded upon.
2025-02-27 19:22:26 -08:00
Bruce MacDonald
0c1041ad85
runner: default to greedy sampler for performance (#9407)
As are adding support for weighted sampling we have seen some performance
regressions, bypassing the sampler logic for now and defaulting to greedy
until we can benchmark the new sampler logic.
2025-02-27 16:41:20 -08:00
Parth Sareen
c245b0406f
sample: remove transforms from greedy sampling (#9377) 2025-02-27 15:44:53 -08:00
Michael Yang
8b194b7520 kvcache: update tests 2025-02-27 22:27:16 +00:00
Michael Yang
3e8b8a1933 ml: update Context.Forward interface
update Context.Forward to accept multiple tensors to match
Context.Compute signature

update Context.Forward to return Context such that it can be chained
with Context.Compute
2025-02-27 22:27:16 +00:00
Blake Mizerany
41dc280491
server/internal/registry: implement CloseNotify and Flush (for now) (#9402)
This fixes panics introduced in 2412adf42b8380748ac79476e273f5b337c3b977
when Gin ungracefully assumes that the http.ResponseWriter implements
http.CloseNotifier and http.Flusher, which our new statusCodeRecorder
does not. This is a temporary fix until we can pour the rest of the Gin
out.
2025-02-27 14:00:37 -08:00
Michael Yang
53d2990d9b model: add bos token if configured 2025-02-27 21:04:59 +00:00
Jesse Gross
e185c08ad9 go.mod: Use full version for go 1.24.0
Otherwise on Linux I get:
go: download go1.24 for linux/amd64: toolchain not available
2025-02-27 13:01:32 -08:00
Blake Mizerany
2412adf42b
server/internal: replace model delete API with new registry handler. (#9347)
This commit introduces a new API implementation for handling
interactions with the registry and the local model cache. The new API is
located in server/internal/registry. The package name is "registry" and
should be considered temporary; it is hidden and not bleeding outside of
the server package. As the commits roll in, we'll start consuming more
of the API and then let reverse osmosis take effect, at which point it
will surface closer to the root level packages as much as needed.
2025-02-27 12:04:53 -08:00
Steven Hartland
be2ac1ed93
docs: fix api examples link (#9360)
Fix the examples link in the go package documentation for the API.
2025-02-27 10:51:12 -08:00
Eries Trisnadi
dc13813a03
server: allow vscode-file origins (#9313) 2025-02-27 10:39:43 -08:00
Michael Yang
d6af13efed runner: simplify tensor split parsing 2025-02-27 18:36:46 +00:00
Michael Yang
a59f665235 ml/backend/ggml: fix debug logging 2025-02-27 18:30:57 +00:00
Daniel Hiltgen
688925aca9
Windows ARM build (#9120)
* Windows ARM build

Skip cmake, and note it's unused in the developer docs.

* Win: only check for ninja when we need it

On windows ARM, the cim lookup fails, but we don't need ninja anyway.
v0.5.13-rc0
2025-02-27 09:02:25 -08:00
Blake Mizerany
76e903cf9d
.github/workflows: swap order of go test and golangci-lint (#9389)
The linter is secondary to the tests, so it should run after the tests,
exposing test failures faster.
2025-02-26 23:03:48 -08:00
Jeffrey Morgan
a5272130c4
ml/backend/ggml: follow on fixes after updating vendored code (#9388)
Fixes sync filters and lowers CUDA version to 11.3 in test.yaml
2025-02-26 22:33:53 -08:00
Jeffrey Morgan
d7d7e99662
llama: update llama.cpp vendor code to commit d7cfe1ff (#9356) 2025-02-26 20:34:44 -08:00
Gordon Kamer
2db96c18e7
readme: add Nichey to community integrations (#9370) 2025-02-26 10:40:53 -08:00
Daniel Hiltgen
e12af460ed
Add cuda Blackwell architecture for v12 (#9350)
* Add cuda Blackwell architecture for v12

* Win: Split rocm out to separate zip file

* Reduce CC matrix

The 6.2 and 7.2 architectures only appear on Jetsons, so they were wasting space.
The 5.0 should be forward compatible with 5.2 and 5.3.
2025-02-26 09:20:52 -08:00
Jeffrey Morgan
3ad4bc8afe
llama: removed unused 'vendoring' file (#9351) 2025-02-25 14:33:03 -08:00
Blake Mizerany
0d694793f2
.github: always run tests, and other helpful fixes (#9348)
During work on our new registry client, I ran into frustrations with CI
where a misspelling in a comment caused the linter to fail, which caused
the tests to not run, which caused the build to not be cached, which
caused the next run to be slow, which caused me to be sad.

This commit address these issues, and pulls in some helpful changes
we've had in CI on ollama.com for some time now.

They are:

* Always run tests, even if the other checks fail.

Tests are the most important part of CI, and should always run. Failures
in tests can be correlated with failures in other checks, and can help
surface the root cause of the failure sooner. This is especially
important when the failure is platform specific, and the tests are not
platform independent.

* Check that `go generate` is clean.

This prevents 'go generate' abuse regressions. This codebase used to use
it to generate platform specific binary build artifacts. Let's make sure
that does not happen again and this powerful tool is used correctly, and
the generated code is checked in.

Also, while adding `go generate` the check, it was revealed that the
generated metal code was putting dates in the comments, resulting in
non-deterministic builds. This is a bad practice, and this commit fixes
that. Git tells us the most important date: the commit date along with
other associated changes.

* Check that `go mod tidy` is clean.

A new job to check that `go mod tidy` is clean was added, to prevent
easily preventable merge conflicts or go.mod changes being deferred to a
future PR that is unrelated to the change that caused the go.mod to
change.

* More robust caching.

We now cache the go build cache, and the go mod download cache
independently. This is because the download cache contains zips that can
be unpacked in parallel faster than they can be fetched and extracted by
tar. This speeds up the build significantly.

The linter is hostile enough. It does not need to also punish us with
longer build times due to small failures like misspellings.
2025-02-25 14:28:07 -08:00
Daniel Hiltgen
e91ae3d47d
Update ROCm (6.3 linux, 6.2 windows) and CUDA v12.8 (#9304)
* Bump cuda and rocm versions

Update ROCm to linux:6.3 win:6.2 and CUDA v12 to 12.8.
Yum has some silent failure modes, so largely switch to dnf.

* Fix windows build script
2025-02-25 13:47:36 -08:00
José Pekkarinen
6ecd7f64ba
docker: upgrade rocm to 6.3.3 (#8211)
centos-7 images have been deprecated upstream and replaced with
almalinux-8 images instead, requiring some small extra work.

Signed-off-by: José Pekkarinen <jose.pekkarinen@foxhound.fi>
2025-02-25 13:38:08 -08:00
Chuanhui Liu
888855675e
docs: rocm install link (#9346) 2025-02-25 13:15:47 -08:00
Michael Yang
b16367b4b2 fix: add back bf16 support
this was accidentally removed when moving fs/ggml from its previous
location
2025-02-25 19:26:14 +00:00
Pavol Rusnak
a499390648
build: support Compute Capability 5.0, 5.2 and 5.3 for CUDA 12.x (#8567)
CUDA 12.x still supports Compute Capability 5.0, 5.2 and 5.3,
so let's build for these architectures as well
2025-02-25 09:54:19 -08:00
frob
4df98f3eb5
Move cgroups fix out of AMD section. (#9072)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-02-25 08:52:50 -08:00
Blake Mizerany
348b3e0983
server/internal: copy bmizerany/ollama-go to internal package (#9294)
This commit copies (without history) the bmizerany/ollama-go repository
with the intention of integrating it into the ollama as a replacement
for the pushing, and pulling of models, and management of the cache they
are pushed and pulled from.

New homes for these packages will be determined as they are integrated
and we have a better understanding of proper package boundaries.
2025-02-24 22:39:44 -08:00
Parth Sareen
0b7e1676eb
sample: add sampling package for new engine (#8410) 2025-02-24 17:19:01 -08:00
Parth Sareen
314573bfe8
config: allow setting context length through env var (#8938)
* envconfig: allow setting context length through env var
2025-02-24 13:26:35 -08:00
Blake Mizerany
4604b10306
go.mod: bump to go1.24 (#9242) 2025-02-24 13:11:46 -08:00
Jeffrey Morgan
8c13cfa4dd
ml/backend/ggml: fix crash on windows paths with wide characters (#9305) v0.5.12 2025-02-23 19:13:53 -08:00
Jeffrey Morgan
7cfd4aee4d
docs: add additional ROCm docs for building (#9066) 2025-02-22 11:22:59 -08:00
Blake Mizerany
68bac1e0a6
server: group routes by category and purpose (#9270)
The route assembly in Handler lacked clear organization making it
difficult scan for routes and their relationships to each other. This
commit aims to fix that by reordering the assembly of routes to group
them by category and purpose.

Also, be more specific about what "config" refers to (it is about CORS
if you were wondering... I was.)
2025-02-21 21:02:26 -08:00
Jesse Gross
f53f4198c3 ml: Abstract attention out of model definitions
There are two benefits to doing this:
 - Provide a library function that models can use, reducing code for
   each model implementation
 - Enables a single place to drop in optimized implementations of
   attention based on the backend or other factors. One is provided for
   GGML.

On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.

Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-02-21 13:16:21 -08:00
Michael Yang
2192a28eed ml/backend/ggml: fix rms norm 2025-02-21 18:34:19 +00:00
Junyan Qin (Chin)
5d81c1a184
docs: add RockChinQ/LangBot to integrations list (#9272) 2025-02-21 09:36:55 -08:00
Jesse Gross
5c5535c064 models: Prune unused outputs earlier in the forward pass
Currently Rows is called as the last step in a model computation
to get the values for the output tokens. However, if we move it
earlier in the process then we can trim out computations that
never get used. This is similar to how models are defined in
llama.cpp.

Changing the model definition in this way improves token generation
performance by approximately 8%.
2025-02-20 14:49:47 -08:00
Jesse Gross
e5bcc51ae1 ggml-backend: Don't recreate the scheduler for each context
We don't need to create and destroy the GGML scheduler for every
context. This introduces extra CPU overhead for every forward
pass and extra memory for contexts that don't actually get scheduled
(for example, KV caches). We can instead just have one scheduler
for the backend and reset it each time we call Compute.

This improves token generation performance by 1-2% and removes
scheduler create/destroy from profile traces.
2025-02-20 14:49:47 -08:00
Jesse Gross
bd6a7d5e64 ollamarunner: Pass runner performance parameters to backends
Currently the following parameters are in the runner but not used:
 - numGPULayers
 - mainGPU
 - threads
 - tensorSplit

This passes them through to the backend, which is where they would
actually get used. However, the GGML backend does not yet do anything
with them.
2025-02-20 13:27:57 -08:00
Bruce MacDonald
14b5a9a150
api: document client stream behavior with a test (#8996)
Added unit tests to verify error handling behavior in the Client.stream and Client.do methods.
Tests cover various error scenarios including:
- Error responses with status codes >= 400
- Error messages with successful status codes
- Empty error messages
- Successful responses
2025-02-20 13:19:58 -08:00
Michael Yang
ba9ec3d05e ci: use clang for windows cpu builds
clang outputs are faster. we were previously building with clang via gcc
wrapper in cgo but this was missed during the build updates so there was
a drop in performance
v0.5.12-rc1
2025-02-20 20:22:36 +00:00
frob
7c168b08c9
server: add missing function parens to debug log (#9255) 2025-02-20 12:10:15 -08:00
danielekp
3d4cc7833c
docs: Add yla to community integrations 2025-02-20 11:34:24 -08:00
Lucas Hahn
351a85d9ea
openai: add 'timeout' to allowable x-stainless headers (#9237) v0.5.12-rc0 2025-02-19 21:56:18 -08:00
Michael Yang
bda4ef6c56 reorder patches 2025-02-20 03:49:24 +00:00
Michael Yang
1e438b237c
Merge pull request #9203 from ollama/mxyng/sapphirerapids
build: remove backend build for sapphirerapids
2025-02-19 21:42:00 +00:00