mirror of
https://github.com/ollama/ollama.git
synced 2025-11-12 15:57:17 +01:00
Allocating (and in particular, freeing) memory from CUDA host buffers is expensive and can cause a significant performance hit if we do it for every token. Using normal system memory avoids this issue and also gives the OS more flexibility to manage it. There is no performance impact from this patch directly (either positive or negative) but it makes a difference once we start freeing memory correctly.