Commit Graph

497 Commits

Author SHA1 Message Date
c895a7d13f some gocritic 2024-06-04 11:13:30 -07:00
04f3c12bb7 replace x/exp/slices with slices 2024-06-04 11:13:30 -07:00
829ff87bd1 revert tokenize ffi (#4761)
* Revert "use `int32_t` for call to tokenize (#4738)"

This reverts commit 763bb65dbb.

* Revert "vocab only"

This reverts commit bf54c845e9.

* Revert "use ffi for tokenizing/detokenizing"

This reverts commit 26a00a0410.
2024-05-31 18:54:21 -07:00
763bb65dbb use int32_t for call to tokenize (#4738)
* use `int32_t` for call to tokenize

* variable naming

* cleanup

* fix crash
2024-05-30 21:43:30 -07:00
7ca9605f54 speed up tests by only building static lib (#4740) 2024-05-30 21:43:15 -07:00
eb2c443a79 Merge pull request #4736 from ollama/mxyng/vocab-only
vocab only for tokenize
2024-05-30 17:21:00 -07:00
a50a87a7b8 partial offloading: allow flash attention and disable mmap (#4734)
* partial offloading: allow flash attention and disable mmap

* allow mmap with num_gpu=0
2024-05-30 16:58:01 -07:00
bf54c845e9 vocab only 2024-05-30 16:49:28 -07:00
22f5c12ced Update llama.cpp submodule to 5921b8f0 (#4731)
* update llama.cpp submodule to `5921b8f089d3b7bda86aac5a66825df6a6c10603`

* add patch
2024-05-30 16:20:22 -07:00
de781b37c8 rm unused infill 2024-05-29 11:26:47 -07:00
3e21799377 rm unused system prompt 2024-05-29 11:26:47 -07:00
26a00a0410 use ffi for tokenizing/detokenizing 2024-05-29 11:26:47 -07:00
646371f56d Merge pull request #3278 from zhewang1-intc/rebase_ollama_main
Enabling ollama to run on Intel GPUs with SYCL backend
2024-05-28 16:30:50 -07:00
92c81e8117 Give the final model loading more time
On some systems, 1 minute isn't sufficient to finish the load after it
hits 100% This creates 2 distinct timers, although they're both set to
the same value for now so we can refine the timeouts further.
2024-05-28 09:08:10 -07:00
7487229c34 llm/server.go: Fix 2 minor typos (#4661)
Signed-off-by: Lei Jitang <leijitang@outlook.com>
2024-05-27 17:21:10 -07:00
0165ba1651 Merge pull request #4638 from dhiltgen/better_error
Report better warning on client closed abort of load
2024-05-25 14:32:28 -07:00
c4209d6d21 Report better warning on client closed abort of load
If the client closes the connection before we finish loading the model
we abort, so lets make the log message clearer why to help users
understand this failure mode
2024-05-25 09:23:28 -07:00
d51f15257c Update llm/ggml.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
2024-05-24 16:10:43 -07:00
8f440d579a fix q5_0, q5_1 2024-05-24 16:01:46 -07:00
4cc3be3035 Move envconfig and consolidate env vars (#4608) 2024-05-24 14:57:15 -07:00
fd5971be0b support ollama run on Intel GPUs 2024-05-24 11:18:27 +08:00
714adb8bd1 bump (#4597) 2024-05-23 14:16:26 -07:00
95b1133d0c Merge pull request #4547 from dhiltgen/load_progress
Wire up load progress
2024-05-23 14:06:02 -07:00
b37b496a12 Wire up load progress
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
2024-05-23 13:36:48 -07:00
d6f692ad1a Add support for IQ1_S, IQ3_S, IQ2_S, IQ4_XS. IQ4_NL (#4322)
Co-authored-by: ManniX-ITA <20623405+mann1x@users.noreply.github.com>
2024-05-23 13:21:49 -07:00
38255d2af1 Use flash attention flag for now (#4580)
* put flash attention behind flag for now

* add test

* remove print

* up timeout for sheduler tests
2024-05-22 21:52:09 -07:00
171eb040fc simplify safetensors reading 2024-05-21 11:28:22 -07:00
bbbd9f20f3 cleanup 2024-05-20 16:13:57 -07:00
547132e820 bpe pretokenizer 2024-05-20 16:13:57 -07:00
c8cf0d94ed llama3 conversion 2024-05-20 16:13:57 -07:00
5cab13739e set llama.cpp submodule commit to 614d3b9 2024-05-20 15:28:17 -07:00
8aadad9c72 updated updateURL 2024-05-20 15:24:32 -07:00
Sam
e15307fdf4 feat: add support for flash_attn (#4120)
* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: add flash_attn support
2024-05-20 13:36:03 -07:00
583c1f472c update llama.cpp submodule to 614d3b9 (#4414) 2024-05-16 13:53:09 -07:00
c48c1d7c46 Port cuda/rocm skip build vars to linux
Windows already implements these, carry over to linux.
2024-05-15 15:56:43 -07:00
d1692fd3e0 fix the cpu estimatedTotal memory + get the expiry time for loading models (#4461) 2024-05-15 15:43:16 -07:00
853ae490e1 Sanitize the env var debug log
Only dump env vars we care about in the logs
2024-05-15 14:42:57 -07:00
0e331c7168 Merge pull request #4328 from ollama/mxyng/mem
count memory up to NumGPU if set by user
2024-05-14 13:47:44 -07:00
6845988807 Ollama ps command for showing currently loaded models (#4327) 2024-05-13 17:17:36 -07:00
1d359e737e typo 2024-05-13 14:18:34 -07:00
50b9056e09 count memory up to NumGPU 2024-05-13 14:13:10 -07:00
92ca2cca95 Revert "only forward some env vars"
This reverts commit ce3b212d12.
2024-05-10 22:53:21 -07:00
c4014e73a2 Fall back to CPU runner with zero layers 2024-05-10 15:09:48 -07:00
1eb382da5a add phi2 mem 2024-05-10 12:13:28 -07:00
bb6fd02298 Don't clamp ctx size in PredictServerFit (#4317)
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
2024-05-10 10:17:12 -07:00
cf442cd57e fix typo 2024-05-09 16:23:37 -07:00
ce3b212d12 only forward some env vars 2024-05-09 15:16:09 -07:00
58876091f7 log clean up 2024-05-09 14:55:36 -07:00
d0425f26cf Merge pull request #4294 from dhiltgen/harden_subprocess_reaping
Harden subprocess reaping
2024-05-09 14:02:16 -07:00
cfa84b8470 add done_reason to the api (#4235) 2024-05-09 13:30:14 -07:00