Commit Graph

3014 Commits

Author SHA1 Message Date
da8e2a0447 use kvs to detect embedding models 2024-07-01 10:47:43 -07:00
a30915bde1 add capabilities 2024-07-01 10:47:43 -07:00
58e3fff311 rename templates to template 2024-07-01 10:40:54 -07:00
3f0b309ad4 remove ManifestV2 2024-07-01 10:40:54 -07:00
e70610ef06 Merge pull request #5410 from dhiltgen/ctx_cleanup
Fix case for NumCtx
2024-07-01 09:54:20 -07:00
dfded7e075 Merge pull request #5364 from dhiltgen/concurrency_docs
Document concurrent behavior and settings
2024-07-01 09:49:48 -07:00
173b550438 Remove default auto from help message
This may confuse users thinking "auto" is an acceptable string - it must be numeric
2024-07-01 09:48:05 -07:00
cff3f44f4a Fix case for NumCtx 2024-07-01 09:43:59 -07:00
3518aaef33 Merge pull request #4218 from dhiltgen/auto_parallel
Enable concurrency by default
2024-07-01 08:32:29 -07:00
1963c00201 Update README.md (#5214)
* Update README.md

Added Mesop example to web & desktop

* Update README.md

---------

Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
2024-06-30 22:00:57 -04:00
27402cb7a2 Update gpu.md (#5382)
Runs fine on a NVIDIA GeForce GTX 1050 Ti
2024-06-30 21:48:51 -04:00
c1218199cf Update api.md 2024-06-29 16:22:49 -07:00
717f7229eb Do not shift context for sliding window models (#5368)
* Do not shift context for sliding window models

* truncate prompt > 2/3 tokens

* only target gemma2
v0.1.48
2024-06-28 19:39:31 -07:00
aae56abb7c Document concurrent behavior and settings 2024-06-28 13:15:57 -07:00
5f034f5b63 Include Show Info in Interactive (#5342) 2024-06-28 13:15:52 -07:00
b910fa9010 Ollama Show: Check for Projector Type (#5307)
* Check exists projtype

* Maintain Ordering
2024-06-28 11:30:16 -07:00
6d4219083c Update docs (#5312) 2024-06-28 09:58:14 -07:00
1ed4f521c4 Merge pull request #5340 from ollama/mxyng/mem
gemma2 graph
2024-06-27 14:26:49 -07:00
de2163dafd gemma2 graph 2024-06-27 13:34:52 -07:00
2cc7d05012 update readme for gemma 2 (#5333)
* update readme for gemma 2
2024-06-27 12:45:16 -04:00
123a722a6f zip: prevent extracting files into parent dirs (#5314) v0.1.47 2024-06-26 21:38:21 -07:00
4d311eb731 llm: architecture patch (#5316) 2024-06-26 21:38:12 -07:00
cb42e607c5 llm: speed up gguf decoding by a lot (#5246)
Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.
v0.1.46
2024-06-24 21:47:52 -07:00
2aa91a937b cmd: defer stating model info until necessary (#5248)
This commit changes the 'ollama run' command to defer fetching model
information until it really needs it. That is, when in interactive mode.

It also removes one such case where the model information is fetch in
duplicate, just before calling generateInteractive and then again, first
thing, in generateInteractive.

This positively impacts the performance of the command:

    ; time ./before run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.168 total
    ; time ./before run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.220 total
    ; time ./before run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.217 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.02s user 0.01s system 4% cpu 0.652 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.01s user 0.01s system 5% cpu 0.498 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?

    ./after run llama3 'hi'  0.01s user 0.01s system 3% cpu 0.479 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.02s user 0.01s system 5% cpu 0.507 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.02s user 0.01s system 5% cpu 0.507 total
2024-06-24 20:14:03 -07:00
ccef9431c8 Merge pull request #5205 from dhiltgen/modelfile_use_mmap
Fix use_mmap parsing for modelfiles
2024-06-21 16:30:36 -07:00
642cee1342 Sort the ps output
Provide consistent ordering for the ps command - longest duration listed first
2024-06-21 15:59:41 -07:00
9a9e7d83c4 Docs (#5149) 2024-06-21 15:52:09 -07:00
9929751cc8 Disable concurrency for AMD + Windows
Until ROCm v6.2 ships, we wont be able to get accurate free memory
reporting on windows, which makes automatic concurrency too risky.
Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes.
All other platforms and GPUs have accurate VRAM reporting wired
up now, so we can turn on concurrency by default.
2024-06-21 15:45:05 -07:00
17b7186cd7 Enable concurrency by default
This adjusts our default settings to enable multiple models and parallel
requests to a single model.  Users can still override these by the same
env var settings as before.  Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s).  As before, multiple models will only load
concurrently if they fully fit in VRAM.
2024-06-21 15:45:05 -07:00
189a43caa2 Merge pull request #5206 from ollama/mxyng/quantize
fix: quantization with template
2024-06-21 13:44:34 -07:00
e835ef1836 fix: quantization with template 2024-06-21 13:39:25 -07:00
7e7749224c Fix use_mmap parsing for modelfiles
Add the new tristate parsing logic for the code path for modelfiles,
as well as a unit test.
2024-06-21 12:27:19 -07:00
c7c2f3bc22 Merge pull request #5194 from dhiltgen/linux_mmap_auto
Refine mmap default logic on linux
2024-06-20 11:44:08 -07:00
54a79d6a8a Merge pull request #5125 from dhiltgen/fedora39
Bump latest fedora cuda repo to 39
2024-06-20 11:27:24 -07:00
5bf5aeec01 Refine mmap default logic on linux
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
2024-06-20 11:07:04 -07:00
e01e535cbb Merge pull request #5192 from ollama/mxyng/kv
handle asymmetric embedding KVs
v0.1.45-rc5 v0.1.45
2024-06-20 10:46:24 -07:00
0195d6a2f8 Merge pull request #5188 from ollama/jyan/tmpdir2
fix: skip os.removeAll() if PID does not exist
2024-06-20 10:40:59 -07:00
8e0641a9bf handle asymmetric embedding KVs 2024-06-20 09:57:27 -07:00
662568d453 err!=nil check 2024-06-20 09:30:59 -07:00
4ebb66c662 reformat error check 2024-06-20 09:23:43 -07:00
23e899f32d skip os.removeAll() if PID does not exist 2024-06-20 08:51:35 -07:00
fedf71635e Extend api/show and ollama show to return more model info (#4881)
* API Show Extended

* Initial Draft of Information

Co-Authored-By: Patrick Devine <pdevine@sonic.net>

* Clean Up

* Descriptive arg error messages and other fixes

* Second Draft of Show with Projectors Included

* Remove Chat Template

* Touches

* Prevent wrapping from files

* Verbose functionality

* Docs

* Address Feedback

* Lint

* Resolve Conflicts

* Function Name

* Tests for api/show model info

* Show Test File

* Add Projector Test

* Clean routes

* Projector Check

* Move Show Test

* Touches

* Doc update

---------

Co-authored-by: Patrick Devine <pdevine@sonic.net>
v0.1.45-rc4
2024-06-19 14:19:02 -07:00
97c59be653 Merge pull request #5074 from dhiltgen/app_log_rotation
Implement log rotation for tray app
2024-06-19 13:02:24 -07:00
9d8a4988e8 Implement log rotation for tray app 2024-06-19 12:53:34 -07:00
1ae0750a21 Merge pull request #5147 from ollama/mxyng/cleanup
remove confusing log message
2024-06-19 12:50:31 -07:00
9d91e5e587 remove confusing log message 2024-06-19 11:14:11 -07:00
96624aa412 Merge pull request #5072 from dhiltgen/windows_path
Move libraries out of users path
2024-06-19 09:13:39 -07:00
10f33b8537 Merge pull request #5146 from dhiltgen/backout
Put back temporary intel GPU env var
2024-06-19 09:12:45 -07:00
4a633cc295 Merge pull request #5145 from dhiltgen/bad_loads
Fix bad symbol load detection
2024-06-19 09:12:33 -07:00
d34d88e417 Revert "Revert "gpu: add env var for detecting Intel oneapi gpus (#5076)""
This reverts commit 755b4e4fc2.
2024-06-19 08:57:41 -07:00