Commit Graph

693 Commits

Author SHA1 Message Date
bee2f4a3b0 Record GPU usage information
This records more GPU usage information for eventual UX inclusion.
2024-05-08 14:45:39 -07:00
eeb695261f skip if same quantization 2024-05-07 17:44:19 -07:00
72700279e2 Detect noexec and report a better error
This will bubble up a much more informative error message if noexec
is preventing us from running the subprocess
2024-05-07 16:46:15 -07:00
1e0a669f75 Merge pull request #3682 from ollama/mxyng/quantize-all-the-things
quantize any fp16/fp32 model
2024-05-07 15:20:49 -07:00
4736391bfb llm: add minimum based on layer size 2024-05-06 17:04:19 -07:00
01811c176a comments 2024-05-06 15:24:01 -07:00
9685c34509 quantize any fp16/fp32 model
- FROM /path/to/{safetensors,pytorch}
- FROM /path/to/fp{16,32}.bin
- FROM model:fp{16,32}
2024-05-06 15:24:01 -07:00
380378cc80 Use our libraries first
Trying to live off the land for cuda libraries was not the right strategy.  We need to use the version we compiled against to ensure things work properly
2024-05-06 14:23:29 -07:00
ed740a2504 Fix no slots available error with concurrent requests (#4160) 2024-05-06 14:22:53 -07:00
1b0e6c9c0e Fix llava models not working after first request (#4164)
* fix llava models not working after first request

* individual requests only for llava models
2024-05-05 20:50:31 -07:00
f56aa20014 Centralize server config handling
This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs
2024-05-05 16:49:50 -07:00
44869c59d6 omit prompt and generate settings from final response 2024-05-03 17:00:02 -07:00
321d57e1a0 Removing go routine calling .wait from load. 2024-05-01 18:51:10 +00:00
ba26c7aa00 it will always return an error due to Kill() discarding Wait() errors 2024-05-01 18:51:10 +00:00
63c763685f log when the waiting for the process to stop to help debug when other tasks execute during this wait.
expire timer clear the timer reference because it will not be reused.
close will clean up expireTimer if calling code has not already done this.
2024-05-01 18:51:10 +00:00
948114e3e3 fix sched to wait for the runner to terminate to ensure following vram check will be more accurate 2024-05-01 18:51:10 +00:00
f0c454ab57 gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068) 2024-05-01 11:46:03 -04:00
fcf4d60eee llm: add back check for empty token cache 2024-04-30 17:38:44 -04:00
e33d5c2dbc update llama.cpp commit to 952d03d 2024-04-30 17:31:20 -04:00
18d9a7e1f1 update llama.cpp submodule to f364eb6 (#4060) 2024-04-30 17:25:39 -04:00
23d23409a0 Update llama.cpp (#4036)
* Bump llama.cpp to b2761

* Adjust types for bump
2024-04-29 23:18:48 -04:00
7aa08a77ca llm: dont cap context window limit to training context window (#3988) 2024-04-29 10:07:30 -04:00
8a65717f55 Do not build AVX runners on ARM64 2024-04-26 23:55:32 -06:00
b438d485f1 Use architecture specific folders in the generate script 2024-04-26 23:34:12 -06:00
86e67fc4a9 Add import declaration for windows,arm64 to llm.go 2024-04-26 23:23:53 -06:00
e4859c4563 Fine grain control over windows generate steps
This will speed up CI which already tries to only build static for unit tests
2024-04-26 15:49:46 -07:00
0b5c589ca2 Merge pull request #3966 from dhiltgen/bump
Fix target in gen_windows.ps1
2024-04-26 15:36:53 -07:00
65fadddc85 Merge pull request #3964 from ollama/mxyng/weights
fix gemma, command-r layer weights
2024-04-26 15:23:33 -07:00
ed5fb088c4 Fix target in gen_windows.ps1 2024-04-26 15:10:42 -07:00
f81f308118 fix gemma, command-r layer weights 2024-04-26 15:00:55 -07:00
bb31def011 return code 499 when user cancels request while a model is loading (#3955) 2024-04-26 17:38:29 -04:00
5c0c2d1d09 Merge pull request #3954 from dhiltgen/ci_fixes
Put back non-avx CPU build for windows
2024-04-26 13:09:03 -07:00
421c878a2d Put back non-avx CPU build for windows 2024-04-26 12:44:07 -07:00
85801317d1 Fix clip log import 2024-04-26 09:43:46 -07:00
2ed0d65948 Bump llama.cpp to b2737 2024-04-26 09:43:28 -07:00
8671fdeda6 Refactor windows generate for more modular usage 2024-04-26 08:35:50 -07:00
8feb97dc0d Move cuda/rocm dependency gathering into generate script
This will make it simpler for CI to accumulate artifacts from prior steps
2024-04-25 22:38:44 -07:00
de4ded68b0 Merge pull request #3923 from ollama/mxyng/mem
only count output tensors
2024-04-25 16:34:17 -07:00
9b5a3c5991 Merge pull request #3914 from dhiltgen/mac_perf
Improve mac parallel performance
2024-04-25 16:28:31 -07:00
993cf8bf55 llm: limit generation to 10x context size to avoid run on generations (#3918)
* llm: limit generation to 10x context size to avoid run on generations

* add comment

* simplify condition statement
2024-04-25 19:02:30 -04:00
7bb7cb8a60 only count output tensors 2024-04-25 15:24:08 -07:00
ddf5c09a9b use matrix multiplcation kernels in more cases 2024-04-25 13:58:54 -07:00
5f73c08729 Remove trailing spaces (#3889) 2024-04-25 14:32:26 -04:00
6e76348df7 Merge pull request #3834 from dhiltgen/not_found_in_path
Report errors on server lookup instead of path lookup failure
2024-04-24 10:50:48 -07:00
14476d48cc fixes for gguf (#3863) 2024-04-23 20:57:20 -07:00
5445aaa94e Add back memory escape valve
If we get our predictions wrong, this can be used to
set a lower memory limit as a workaround.  Recent multi-gpu
refactoring accidentally removed it, so this adds it back.
2024-04-23 17:09:02 -07:00
058f6cd2cc Move nested payloads to installer and zip file on windows
Now that the llm runner is an executable and not just a dll, more users are facing
problems with security policy configurations on windows that prevent users
writing to directories and then executing binaries from the same location.
This change removes payloads from the main executable on windows and shifts them
over to be packaged in the installer and discovered based on the executables location.
This also adds a new zip file for people who want to "roll their own" installation model.
2024-04-23 16:14:47 -07:00
58888a74bc Detect and recover if runner removed
Tmp cleaners can nuke the file out from underneath us.  This detects the missing
runner, and re-initializes the payloads.
2024-04-23 10:05:26 -07:00
cc5a71e0e3 Merge pull request #3709 from remy415/custom-gpu-defs
Adds support for customizing GPU build flags in llama.cpp
2024-04-23 09:28:34 -07:00
e83bcf7f9a Merge pull request #3836 from ollama/mxyng/mixtral
fix: mixtral graph
2024-04-23 09:15:10 -07:00