775 Commits

Author SHA1 Message Date
Blake Mizerany
68bac1e0a6
server: group routes by category and purpose (#9270)
The route assembly in Handler lacked clear organization making it
difficult scan for routes and their relationships to each other. This
commit aims to fix that by reordering the assembly of routes to group
them by category and purpose.

Also, be more specific about what "config" refers to (it is about CORS
if you were wondering... I was.)
2025-02-21 21:02:26 -08:00
frob
7c168b08c9
server: add missing function parens to debug log (#9255) 2025-02-20 12:10:15 -08:00
Lucas Hahn
351a85d9ea
openai: add 'timeout' to allowable x-stainless headers (#9237) 2025-02-19 21:56:18 -08:00
Jesse Gross
ed443a0393 Runner for Ollama engine
This provides integration with the new Ollama engine
(5824541 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1
2025-02-13 17:09:26 -08:00
Jesse Gross
6945617af5 models: Move model into their own directory
This allows there to be a file that is a list of models that is
not mixed into the runner code.
2025-02-13 17:09:26 -08:00
Michael Yang
58245413f4
next ollama runner (#7913)
feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon

Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2025-02-13 16:31:21 -08:00
Yashwanth A
291def6adb
server: increase timeout in stall detection from 5s to 30s (#8831)
In some cases, downloads slow due to disk i/o or other factors,
causing the download to restart a part. This causes the download
to "reverse" in percent completion. By increasing the timeout to 30s,
this should happen less frequently.
2025-02-05 10:00:26 -08:00
Jeffrey Morgan
c852b8e021
server: always print upload/download part info (#8832) 2025-02-04 19:30:49 -08:00
William
d8932c55e7
server: fix out of bounds exception on model download (#8746) 2025-02-04 18:52:47 -08:00
Michael Yang
dcfb7a105c
next build (#8539)
* add build to .dockerignore

* test: only build one arch

* add build to .gitignore

* fix ccache path

* filter amdgpu targets

* only filter if autodetecting

* Don't clobber gpu list for default runner

This ensures the GPU specific environment variables are set properly

* explicitly set CXX compiler for HIP

* Update build_windows.ps1

This isn't complete, but is close.  Dependencies are missing, and it only builds the "default" preset.

* build: add ollama subdir

* add .git to .dockerignore

* docs: update development.md

* update build_darwin.sh

* remove unused scripts

* llm: add cwd and build/lib/ollama to library paths

* default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS

* add additional cmake output vars for msvc

* interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12

* remove unncessary filepath.Dir, cleanup

* add hardware-specific directory to path

* use absolute server path

* build: linux arm

* cmake install targets

* remove unused files

* ml: visit each library path once

* build: skip cpu variants on arm

* build: install cpu targets

* build: fix workflow

* shorter names

* fix rocblas install

* docs: clean up development.md

* consistent build dir removal in development.md

* silence -Wimplicit-function-declaration build warnings in ggml-cpu

* update readme

* update development readme

* llm: update library lookup logic now that there is one runner (#8587)

* tweak development.md

* update docs

* add windows cuda/rocm tests

---------

Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
2025-01-29 15:03:38 -08:00
Patrick Devine
2539f2dbf9
Fix absolute path names + gguf detection (#8428) 2025-01-14 19:01:24 -08:00
Patrick Devine
8bccae4f92
show a more descriptive error in the client if it is newer than the server (#8351) 2025-01-09 10:12:30 -08:00
isamu arimoto
6ae2adc1af
openai: accept additional headers to fix CORS errors (#8343) 2025-01-08 11:28:11 -08:00
Patrick Devine
86a622cbdc
Update the /api/create endpoint to use JSON (#7935)
Replaces `POST /api/create` to use JSON instead of a Modelfile.

This is a breaking change.
2024-12-31 18:02:30 -08:00
湛露先生
928de9050e
server: reuse InvalidModelNameErrMsg type (#8163) 2024-12-23 10:38:34 -05:00
Patrick Devine
8c9fb8eb73
imageproc mllama refactor (#7537)
Refactor mllama image processing code, and add pixtral and qwen2vl
2024-12-14 19:50:15 -08:00
Blake Mizerany
b1fd7fef86
server: more support for mixed-case model names (#8017)
Fixes #7944
2024-12-11 15:29:59 -08:00
frob
757eeacc1b
server: lowercase hostname for Host header check (#5851) 2024-12-10 13:43:22 -08:00
Stefan Weil
abfdc4710f
all: fix typos in documentation, code, and comments (#7021) 2024-12-10 12:58:06 -08:00
Daniel Hiltgen
4879a234c4
build: Make target improvements (#7499)
* llama: wire up builtin runner

This adds a new entrypoint into the ollama CLI to run the cgo built runner.
On Mac arm64, this will have GPU support, but on all other platforms it will
be the lowest common denominator CPU build.  After we fully transition
to the new Go runners more tech-debt can be removed and we can stop building
the "default" runner via make and rely on the builtin always.

* build: Make target improvements

Add a few new targets and help for building locally.
This also adjusts the runner lookup to favor local builds, then
runners relative to the executable, and finally payloads.

* Support customized CPU flags for runners

This implements a simplified custom CPU flags pattern for the runners.
When built without overrides, the runner name contains the vector flag
we check for (AVX) to ensure we don't try to run on unsupported systems
and crash.  If the user builds a customized set, we omit the naming
scheme and don't check for compatibility.  This avoids checking
requirements at runtime, so that logic has been removed as well.  This
can be used to build GPU runners with no vector flags, or CPU/GPU
runners with additional flags (e.g. AVX512) enabled.

* Use relative paths

If the user checks out the repo in a path that contains spaces, make gets
really confused so use relative paths for everything in-repo to avoid breakage.

* Remove payloads from main binary

* install: clean up prior libraries

This removes support for v0.3.6 and older versions (before the tar bundle)
and ensures we clean up prior libraries before extracting the bundle(s).
Without this change, runners and dependent libraries could leak when we
update and lead to subtle runtime errors.
2024-12-10 09:47:19 -08:00
Jesse Gross
900f64e6be prompt: Don't trim whitespace from prompts
New lines can be an important part of a user's prompt and trimming
it can alter the results. We previously only trimmed prompts with
images but refactoring brought this behavior to all prompts, where
it became more noticable.

The /generate endpoint adds less whitespace and therefore doesn't
need to trim it out - this brings the same behavior to /chat.

Thanks to @gabe-l-hart for spotting the issue!

Fixes #7795
2024-12-09 11:02:55 -08:00
Parth Sareen
c6c526275d
api: add generate endpoint for structured outputs (#7939) 2024-12-04 17:37:12 -08:00
Parth Sareen
630e7dc6ff
api: structured outputs - chat endpoint (#7900)
Adds structured outputs to chat endpoint
---------

Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Hieu Nguyen <hieunguyen1053@outlook.com>
2024-12-04 16:31:19 -08:00
Jeffrey Morgan
d543b282a7
server: add warning message for deprecated context field (#7878) 2024-11-30 14:05:50 -08:00
Parth Sareen
5f8051180e
Enable index tracking for tools - openai api support (#7888) 2024-11-29 20:00:09 -08:00
Parth Sareen
ce7455a8e1
api: enable tool streaming (#7836) 2024-11-27 13:40:57 -08:00
Blake Mizerany
2b7ed61ca2
server: fix Transport override (#7834)
This changes makeRequest to update the http client Transport if and only
if testMakeRequestDialContext is set. This is to avoid overriding the
default Transport when testMakeRequestDialContext is nil, which broke
existing behavior, included proxies, timeouts, and other behaviors.

Fixes #7829
Fixes #7788
2024-11-25 15:08:34 -08:00
oza6ut0ne
31cb1ca9e5
openai: accept X-Stainless-Retry-Count header (#6910) 2024-11-23 12:39:05 -08:00
Bruce MacDonald
7b5585b9cb
server: remove out of date anonymous access check (#7785)
In the past the ollama.com server would return a JWT that contained
information about the user being authenticated. This was used to return
different error messages to the user. This is no longer possible since the
token used to authenticate does not contain information about the user
anymore. Removing this code that no longer works.

Follow up changes will improve the error messages returned here, but good to
clean up first.
2024-11-22 11:57:35 -08:00
Daniel Hiltgen
f602ab4de4
expose underlying error on embedding failure (#7743)
Avoid a round-trip asking users for logs to see what went wrong.
2024-11-19 16:26:05 -08:00
Blake Mizerany
4b8a2e341a
server: allow mixed-case model names on push, pull, cp, and create (#7676)
This change allows for mixed-case model names to be pushed, pulled,
copied, and created, which was previously disallowed because the Ollama
registry was backed by a Docker registry that enforced a naming
convention that disallowed mixed-case names, which is no longer the
case.

This does not break existing, intended, behaviors.

Also, make TestCase test a story of creating, updating, pulling, and
copying a model with case variations, ensuring the model's manifest is
updated correctly, and not duplicated across different files with
different case variations.
2024-11-19 15:05:57 -08:00
Jeffrey Morgan
8b4b243f5f
server: fix warnings in prompt_test.go (#7710) 2024-11-17 13:01:04 -08:00
Jesse Gross
6cd566872b sched: Lift parallel restriction for multimodal models except mllama
The Go runner does not have a problem with supporting parallel
requests for most multimodal models. Now that we won't be potentially
falling back to server.cpp, this restriction can be lifted.

However, the new mllama model can't support parallel requests, so we
will need to keep a restriction for that.
2024-11-06 13:32:18 -08:00
Daniel Hiltgen
a4c70fe157
One corrupt manifest should not wedge model operations (#7515)
One potential failure mode is an empty file which bubbles up as an EOF error,
leading to all pulls and listing operations failing.  Instead, continue and
warn about the corrupt manifest.  This also allows re-pulling the corrupt
manifest to repair the system.
2024-11-05 14:21:45 -08:00
Jesse Gross
34a75102f7 prompt: Use a single token when estimating mllama context size
Currently we assume that images take 768 tokens of context size for
the purposes of clipping old messages that exceed the context window.
However, our mllama implementation stores the full image embedding
in a single token. As a result, there is significant waste of context
space.

Ideally, we would handle this more generically and have the
implementation report the number of tokens. However, at the moment
this would just result in a similar set of 'if' conditions in the
runner plus APIs to report it back. So for now, we just keep this
simple.
2024-11-05 10:11:50 -08:00
Daniel Hiltgen
4ebfa2cb91
Quiet down debug log of image payload (#7454)
Avoid excessive log spew and make consistent with chat logging
2024-11-04 13:05:16 -08:00
Jesse Gross
c826e57475 runner.go: Better abstract vision model integration
-Update mllama to take the cross attention state as embeddings in
a batch, more similar to how Llava handles it. This improves
integration with the input cache.
-Pass locations in a prompt for embeddings using tags similar to Llava.
-Abstract interface to vision models so the main runner accesses Clip
and Mllama similarly

Co-authored-by: Michael Yang <mxyng@pm.me>
2024-10-30 14:53:43 -07:00
Patrick Devine
db1842b9e1
add more tests for getting the optimal tiled canvas (#7411) 2024-10-29 16:28:02 -07:00
Patrick Devine
084929c293
add mllama image processing to the generate handler (#7384) 2024-10-28 13:51:19 -07:00
Patrick Devine
c7cb0f0602
image processing for llama3.2 (#6963)
Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Jesse Gross <jesse@ollama.com>
2024-10-18 16:12:35 -07:00
Daniel Hiltgen
05cd82ef94
Rename gpu package discover (#7143)
Cleaning up go package naming
2024-10-16 17:45:00 -07:00
Jeffrey Morgan
96efd9052f
Re-introduce the llama package (#5034)
* Re-introduce the llama package

This PR brings back the llama package, making it possible to call llama.cpp and
ggml APIs from Go directly via CGo. This has a few advantages:

- C APIs can be called directly from Go without needing to use the previous
  "server" REST API
- On macOS and for CPU builds on Linux and Windows, Ollama can be built without
  a go generate ./... step, making it easy to get up and running to hack on
  parts of Ollama that don't require fast inference
- Faster build times for AVX,AVX2,CUDA and ROCM (a full build of all runners
  takes <5 min on a fast CPU)
- No git submodule making it easier to clone and build from source

This is a big PR, but much of it is vendor code except for:

- llama.go CGo bindings
- example/: a simple example of running inference
- runner/: a subprocess server designed to replace the llm/ext_server package
- Makefile an as minimal as possible Makefile to build the runner package for
  different targets (cpu, avx, avx2, cuda, rocm)

Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

* cache: Clear old KV cache entries when evicting a slot

When forking a cache entry, if no empty slots are available we
evict the least recently used one and copy over the KV entries
from the closest match. However, this copy does not overwrite
existing values but only adds new ones. Therefore, we need to
clear the old slot first.

This change fixes two issues:
 - The KV cache fills up and runs out of space even though we think
   we are managing it correctly
 - Performance gets worse over time as we use new cache entries that
   are not hot in the processor caches

* doc: explain golang objc linker warning (#6830)

* llama: gather transitive dependencies for rocm for dist packaging (#6848)

* Refine go server makefiles to be more DRY (#6924)

This breaks up the monolithic Makefile for the Go based runners into a
set of utility files as well as recursive Makefiles for the runners.
Files starting with the name "Makefile" are buildable, while files that
end with ".make" are utilities to include in other Makefiles.  This
reduces the amount of nearly identical targets and helps set a pattern
for future community contributions for new GPU runner architectures.

When we are ready to switch over to the Go runners, these files should
move to the top of the repo, and we should add targets for the main CLI,
as well as a helper "install" (put all the built binaries on the local
system in a runnable state) and "dist" target (generate the various
tar/zip files for distribution) for local developer use.

* llama: don't create extraneous directories (#6988)

* llama: Exercise the new build in CI (#6989)

Wire up some basic sanity testing in CI for the Go runner.  GPU runners are not covered yet.

* llama: Refine developer docs for Go server (#6842)

This enhances the documentation for development focusing on the new Go
server.  After we complete the transition further doc refinements
can remove the "transition" discussion.

* runner.go: Allocate batches for all sequences during init

We should tell the model that we could have full batches for all
sequences. We already do this when we allocate the batches but it was
missed during initialization.

* llama.go: Don't return nil from Tokenize on zero length input

Potentially receiving nil in a non-error condition is surprising to
most callers - it's better to return an empty slice.

* runner.go: Remove stop tokens from cache

If the last token is EOG then we don't return this and it isn't
present in the cache (because it was never submitted to Decode).
This works well for extending the cache entry with a new sequence.

However, for multi-token stop sequences, we won't return any of the
tokens but all but the last one will be in the cache. This means
when the conversation continues the cache will contain tokens that
don't overlap with the new prompt.

This works (we will pick up the portion where there is overlap) but
it causes unnecessary cache thrashing because we will fork the original
cache entry as it is not a perfect match.

By trimming the cache to the tokens that we actually return this
issue can be avoided.

* runner.go: Simplify flushing of pending tokens

* runner.go: Update TODOs

* runner.go: Don't panic when processing sequences

If there is an error processing a sequence, we should return a
clean HTTP error back to Ollama rather than panicing. This will
make us more resilient to transient failures.

Panics can still occur during startup as there is no way to serve
requests if that fails.

Co-authored-by: jmorganca <jmorganca@gmail.com>

* runner.go: More accurately capture timings

Currently prompt processing time doesn't capture the that it takes
to tokenize the input, only decoding time. We should capture the
full process to more accurately reflect reality. This is especially
true once we start processing images where the initial processing
can take significant time. This is also more consistent with the
existing C++ runner.

* runner.go: Support for vision models

In addition to bringing feature parity with the C++ runner, this also
incorporates several improvements:
 - Cache prompting works with images, avoiding the need to re-decode
   embeddings for every message in a conversation
 - Parallelism is supported, avoiding the need to restrict to one
   sequence at a time. (Though for now Ollama will not schedule
   them while we might need to fall back to the old runner.)

Co-authored-by: jmorganca <jmorganca@gmail.com>

* runner.go: Move Unicode checking code and add tests

* runner.go: Export external cache members

Runner and cache are in the same package so the change doesn't
affect anything but it is more internally consistent.

* runner.go: Image embedding cache

Generating embeddings from images can take significant time (on
my machine between 100ms and 8s depending on the model). Although
we already cache the result of decoding these images, the embeddings
need to be regenerated every time. This is not necessary if we get
the same image over and over again, for example, during a conversation.

This currently uses a very small cache with a very simple algorithm
but it is easy to improve as is warranted.

* llama: catch up on patches

Carry forward solar-pro and cli-unicode patches

* runner.go: Don't re-allocate memory for every batch

We can reuse memory allocated from batch to batch since batch
size is fixed. This both saves the cost of reallocation as well
keeps the cache lines hot.

This results in a roughly 1% performance improvement for token
generation with Nvidia GPUs on Linux.

* runner.go: Default to classic input cache policy

The input cache as part of the go runner implemented a cache
policy that aims to maximize hit rate in both single and multi-
user scenarios. When there is a cache hit, the response is
very fast.

However, performance is actually slower when there is an input
cache miss due to worse GPU VRAM locality. This means that
performance is generally better overall for multi-user scenarios
(better input cache hit rate, locality was relatively poor already).
But worse for single users (input cache hit rate is about the same,
locality is now worse).

This defaults the policy back to the old one to avoid a regression
but keeps the new one available through an environment variable
OLLAMA_MULTIUSER_CACHE. This is left undocumented as the goal is
to improve this in the future to get the best of both worlds
without user configuration.

For inputs that result in cache misses, on Nvidia/Linux this
change improves performance by 31% for prompt processing and
13% for token generation.

* runner.go: Increase size of response channel

Generally the CPU can easily keep up with handling reponses that
are generated but there's no reason not to let generation continue
and handle things in larger batches if needed.

* llama: Add CI to verify all vendored changes have patches (#7066)

Make sure we don't accidentally merge changes in the vendored code
that aren't also reflected in the patches.

* llama: adjust clip patch for mingw utf-16 (#7065)

* llama: adjust clip patch for mingw utf-16

* llama: ensure static linking of runtime libs

Avoid runtime dependencies on non-standard libraries

* runner.go: Enable llamafile (all platforms) and BLAS (Mac OS)

These are two features that are shown on llama.cpp's system info
that are currently different between the two runners. On my test
systems the performance difference is very small to negligible
but it is probably still good to equalize the features.

* llm: Don't add BOS/EOS for tokenize requests

This is consistent with what server.cpp currently does. It affects
things like token processing counts for embedding requests.

* runner.go: Don't cache prompts for embeddings

Our integration with server.cpp implicitly disables prompt caching
because it is not part of the JSON object being parsed, this makes
the Go runner behavior similarly.

Prompt caching has been seen to affect the results of text completions
on certain hardware. The results are not wrong either way but they
are non-deterministic. However, embeddings seem to be affected even
on hardware that does not show this behavior for completions. For
now, it is best to maintain consistency with the existing behavior.

* runner.go: Adjust debug log levels

Add system info printed at startup and quiet down noisier logging.

* llama: fix compiler flag differences (#7082)

Adjust the flags for the new Go server to more closely match the
generate flow

* llama: refine developer docs (#7121)

* llama: doc and example clean up (#7122)

* llama: doc and example clean up

* llama: Move new dockerfile into llama dir

Temporary home until we fully transition to the Go server

* llama: runner doc cleanup

* llama.go: Add description for Tokenize error case

---------

Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Daniel Hiltgen <dhiltgen@users.noreply.github.com>
2024-10-08 08:53:54 -07:00
Alex Mavrogiannis
f40bb398f6
Stop model before deletion if loaded (fixed #6957) (#7050) 2024-10-01 15:45:43 -07:00
Blake Mizerany
03608cb46e
server: close response body on error (#6986)
This change closes the response body when an error occurs in
makeRequestWithRetry. Previously, the first, non-200 response body was
not closed before reattempting the request. This change ensures that
the response body is closed in all cases where an error occurs,
preventing leaks of file descriptors.

Fixes #6974
2024-09-26 12:00:31 -07:00
Daniel Hiltgen
d632e23fba
Add Windows arm64 support to official builds (#5712)
* Unified arm/x86 windows installer

This adjusts the installer payloads to be architecture aware so we can cary
both amd64 and arm64 binaries in the installer, and install only the applicable
architecture at install time.

* Include arm64 in official windows build

* Harden schedule test for slow windows timers

This test seems to be a bit flaky on windows, so give it more time to converge
2024-09-20 13:09:38 -07:00
Jeffrey Morgan
d05da29912
server: add tool parsing support for nemotron-mini (#6849) 2024-09-17 18:06:16 -07:00
Daniel Hiltgen
cd5c8f6471
Optimize container images for startup (#6547)
* Optimize container images for startup

This change adjusts how to handle runner payloads to support
container builds where we keep them extracted in the filesystem.
This makes it easier to optimize the cpu/cuda vs cpu/rocm images for
size, and should result in faster startup times for container images.

* Refactor payload logic and add buildx support for faster builds

* Move payloads around

* Review comments

* Converge to buildx based helper scripts

* Use docker buildx action for release
2024-09-12 12:10:30 -07:00
Patrick Devine
abed273de3
add "stop" command (#6739) 2024-09-11 16:36:21 -07:00
Daniel Hiltgen
9565fa64a8
Revert "Detect running in a container (#6495)" (#6662)
This reverts commit a60d9b89cec60f960841caa9881c4a48e4a87406.
2024-09-05 14:26:00 -07:00
Daniel Hiltgen
a60d9b89ce
Detect running in a container (#6495) 2024-09-05 13:24:51 -07:00