Parth Sareen
108fe02165
sample: make mutations in transforms explicit ( #9743 )
...
* updated minP to use early exit making use of sorted tokens
2025-03-17 11:24:18 -07:00
Parth Sareen
5c0b663969
sample: separate softmax and temperature transforms ( #9732 )
2025-03-13 09:53:27 -07:00
ParthSareen
4aeb67ef4c
sample: do all sorting in topK
2025-03-12 11:59:17 -07:00
Parth Sareen
7e34f4fbfa
sample: add numerical stability to temperature/softmax transform ( #9631 )
2025-03-10 14:43:53 -07:00
Jeffrey Morgan
e093db92c4
sample: temporarily use grammars for constrained generation in new engine ( #9586 )
2025-03-10 16:17:39 +01:00
Parth Sareen
0682dae027
sample: improve ollama engine sampler performance ( #9374 )
...
This change bring in various interface cleanups along with greatly improving the performance of the sampler.
Tested with llama3.2 on local machine.
Improves performance from ~ 70 tokens/s -> 135 tokens/s with topK(40) enabled.
Without topK performance is ~ 110 tokens/s
2025-03-07 12:37:48 -08:00
Parth Sareen
c245b0406f
sample: remove transforms from greedy sampling ( #9377 )
2025-02-27 15:44:53 -08:00
Parth Sareen
0b7e1676eb
sample: add sampling package for new engine ( #8410 )
2025-02-24 17:19:01 -08:00