ollama

mirror of https://github.com/ollama/ollama.git synced 2025-04-17 16:11:22 +02:00

History

Disable CUDA peer access as a workaround for multi-gpu inference bug (#1261 )

When CUDA peer access is enabled, multi-gpu inference will produce
garbage output. This is a known bug of llama.cpp (or nvidia). Until the
upstream bug is fixed, we can disable CUDA peer access temporarily
to ensure correct output.

See #961.

2023-11-24 14:05:57 -05:00

llama.cpp

Disable CUDA peer access as a workaround for multi-gpu inference bug (#1261 )

2023-11-24 14:05:57 -05:00

falcon.go

starcoder

2023-10-02 19:56:51 -07:00

ggml.go

ggufv3

2023-10-23 09:35:49 -07:00

gguf.go

fix: gguf int type

2023-11-22 11:40:30 -08:00

llama.go

only set main_gpu if value > 0 is provided

2023-11-20 19:54:04 -05:00

llm.go

recent llama.cpp update added kernels for fp32, q5_0, and q5_1

2023-11-20 13:44:31 -08:00

starcoder.go

starcoder

2023-10-02 19:56:51 -07:00

utils.go

partial decode ggml bin for more info

2023-08-10 09:23:10 -07:00