mirror of
https://github.com/ollama/ollama.git
synced 2025-04-14 14:49:25 +02:00
1.7 KiB
1.7 KiB
Benchmark
Go benchmark tests that measure end-to-end performance of a running Ollama server. Run these tests to evaluate model inference performance on your hardware and measure the impact of code changes.
When to use
Run these benchmarks when:
- Making changes to the model inference engine
- Modifying model loading/unloading logic
- Changing prompt processing or token generation code
- Implementing a new model architecture
- Testing performance across different hardware setups
Prerequisites
- Ollama server running locally with
ollama serve
on127.0.0.1:11434
Usage and Examples
Note
All commands must be run from the root directory of the Ollama project.
Basic syntax:
go test -bench=. ./benchmark/... -m $MODEL_NAME
Required flags:
-bench=.
: Run all benchmarks-m
: Model name to benchmark
Optional flags:
-count N
: Number of times to run the benchmark (useful for statistical analysis)-timeout T
: Maximum time for the benchmark to run (e.g. "10m" for 10 minutes)
Common usage patterns:
Single benchmark run with a model specified:
go test -bench=. ./benchmark/... -m llama3.3
Output metrics
The benchmark reports several key metrics:
gen_tok/s
: Generated tokens per secondprompt_tok/s
: Prompt processing tokens per secondttft_ms
: Time to first token in millisecondsload_ms
: Model load time in millisecondsgen_tokens
: Total tokens generatedprompt_tokens
: Total prompt tokens processed
Each benchmark runs two scenarios:
- Cold start: Model is loaded from disk for each test
- Warm start: Model is pre-loaded in memory
Three prompt lengths are tested for each scenario:
- Short prompt (100 tokens)
- Medium prompt (500 tokens)
- Long prompt (1000 tokens)