ollama/model/parsers/qwen3coder.go at 05ba4ca1f4b356df50ed6eede0e2bcdc76b31fb8

mirror of https://github.com/ollama/ollama.git synced 2025-11-11 20:37:31 +01:00

Files

Devon Rifkin 05ba4ca1f4 parsers: fix unicode handling for qwen3-coder

When trimming whitespace at the end of every chunk, we were iterating
backwards over the string byte-by-byte instead of rune-by-rune.

As an example of how this can cause corruption, suppose we have the
multi-byte character ✅ (`"\u2705"`), which is represented in utf-8 as
the three bytes `0xE2 0x9C 0x85`. It happens that `0x85` is NEL, which
passes `unicode.IsSpace()`. Because we were iterating byte-by-byte, this
caused us to mistakenly slice in the middle of the rune, removing `0x85`
and leaving `0xE2 0x9C`, which beyond being the incorrect place to
slice, is not even a valid utf-8 character.

`trailingWhitespaceLen()` was modified to count from the end in a
rune-aware way. Tests with various multibyte unicode characters were
also added.


Fixes: #12414

2025-09-25 15:47:46 -07:00

13 KiB

Raw Blame History

View Raw

13 KiB Raw Blame History

13 KiB

Raw Blame History