From e3f3043f5bc0ca254f7f88a4736091f4e44b5aa1 Mon Sep 17 00:00:00 2001 From: Bruce MacDonald Date: Tue, 25 Feb 2025 14:59:39 -0800 Subject: [PATCH] Update add-a-model.md --- docs/add-a-model.md | 368 ++++++++++++++++++++++++++++---------------- 1 file changed, 239 insertions(+), 129 deletions(-) diff --git a/docs/add-a-model.md b/docs/add-a-model.md index f3d162fa9..1358c4dae 100644 --- a/docs/add-a-model.md +++ b/docs/add-a-model.md @@ -2,7 +2,7 @@ > **Note**: This guide and the Go inference engine are in early development and will be updated as implementation details evolve. -This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to deploying your model to ollama.com. +This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to publishing your model to ollama.com. ## Architecture Overview @@ -12,96 +12,49 @@ Below is a diagram showing Ollama's inference engine architecture layers and how graph TB subgraph Models["Model Layer: LLM Implementations"] direction TB - llama["model/models/llama/model.go"] - mllama["model/models/mllama/model.go"] - qwen["model/models/qwen2/model.go"] - qwen_vl["model/models/qwen2vl/model.go"] + llama["model/models/llama"] + mllama["model/models/mllama"] + qwen["model/models/qwen2"] + etc["...etc"] - note1["Each model implements a specific architecture - - Defines model parameters - - Implements forward pass"] + note1[" Each model implements a
specific architecture:
- Defines model parameters
- Implements forward pass"] end subgraph ML_Ops["Neural Network Operations"] direction TB - nn_ops["nn/ - linear.go - Matrix operations - embedding.go - Token embeddings - normalization.go - Layer normalization - convolution.go - Conv operations"] + nn_ops[" nn/
linear.go: Matrix multiplication
embedding.go: Token embedding lookups
normalization.go: Layer norm operations
convolution.go: Convolutional operations "] - backend["ml/backend.go - Hardware Abstraction Layer - - Defines tensor operations - - Manages computation graphs - - Handles memory allocation"] + backend[" ml/backend.go
Hardware Abstraction Layer:
- Defines tensor operations
- Manages computation graphs
- Handles memory allocation "] - note2["Common neural net operations - used across different models - - Abstracts hardware details - - Provides unified API - - Manages computation flow"] + note2[" Common neural net operations:
- Abstracts hardware details
- Provides unified API
- Manages computation flow "] end - subgraph GGML["Hardware Execution Layer"] + subgraph Hardware["Backend Execution Layer"] direction TB - ggml["ggml.go - CGO Interface - - Bridges Go and C++ - - Handles type conversion - - Manages memory between languages"] - - subgraph Hardware_Specific["Hardware-Specific Implementations"] - direction LR - cpu["ggml-cpu.h - CPU optimized ops"] - cuda["ggml-cuda.h - NVIDIA GPU ops"] - metal["ggml-metal.h - Apple GPU ops"] - vulkan["ggml-vulkan.h - Cross-platform GPU"] - opencl["ggml-opencl.h - OpenCL acceleration"] - end - - note3["GGML provides optimized - implementations for each hardware: - - Automatic dispatch - - Hardware-specific optimizations - - Memory management - - Parallel execution"] + backend_impl[" The backend package provides:
- Unified computation interface
- Automatic hardware selection
- Optimized kernels
- Efficient memory management "] end - %% Connections with explanations - Models --> |"Makes high-level calls - (e.g., self-attention)"| ML_Ops - ML_Ops --> |"Translates to tensor operations - (e.g., matmul, softmax)"| GGML - GGML --> |"Executes optimized code - on target hardware"| Hardware_Specific - - %% Styling - classDef model fill:#fff,stroke:#01579b,stroke-width:2px - classDef ml fill:#fff,stroke:#e65100,stroke-width:2px - classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px - classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5 - - class llama,mllama,qwen,qwen_vl,pixtral model - class nn_ops,backend ml - class ggml,cpu,cuda,metal,vulkan,opencl hw - class note1,note2,note3 note - - %% Style subgraphs - style Models fill:#fff,stroke:#01579b,stroke-width:2px - style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px - style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px - style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px + Models --> |" Makes high-level calls
(e.g., self-attention) "| ML_Ops + ML_Ops --> |" Translates to tensor operations
(e.g., matmul, softmax) "| Hardware ``` When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer. -## Implementation Steps +## Implementation Process Overview + +Here's the high-level process for implementing a new model in Ollama: + +1. **Environment Setup**: Clone the repository and set up your development environment +2. **Research Implementation**: Understand the original model architecture +3. **Project Structure Setup**: Set up the necessary file structure +4. **Create Basic Modelfile**: Create a simple Modelfile for testing +5. **Implement Weight Conversion**: Map from original format to GGUF +6. **Open a Draft PR**: Create a draft pull request to establish communication with maintainers +7. **Implement Model Logic**: Create the model architecture and forward pass +8. **Quality Check and Final Steps**: Create a Modelfile, add tests and ensure functionality +10. **Finalize PR and Publish**: Complete the PR and publish to ollama.com + +## Implementation Steps in Detail ### 1. Environment Setup @@ -121,11 +74,11 @@ Get the original model implementation running. This typically involves: Create the necessary file structure by referencing previous model implementations. You'll need: ``` +convert/ +└── convert_your-model.go # Weight conversion logic (PyTorch/SafeTensors to GGML) model/ └── your-model/ - ├── model.go # Architecture and forward pass implementation - ├── convert.go # Weight conversion logic (PyTorch/SafeTensors to GGML) - └── convert_test.go # Conversion logic tests + └── model.go # Architecture and forward pass implementation ``` Add your model to the main paths in [model/models/models.go](https://github.com/ollama/ollama/blob/main/model/models/models.go): @@ -140,45 +93,194 @@ import ( ) ``` -### 4. Development Process +### 4. Create a Basic Modelfile -1. **Open a Draft PR** - - Create a draft pull request in the `ollama/ollama` repository - - Use this as a communication channel with Ollama maintainers +Create a simple Modelfile early in the process to facilitate testing: -2. **Implement Weight Conversion** - - Work on `convert.go` - - Reference existing conversion implementations - - Create a basic Modelfile: - ``` - FROM /path/to/model - ``` - - Test conversion: - ```bash - go run . create -f /path/to/Modelfile +``` +FROM /path/to/model +TEMPLATE "{{.Prompt}}" # Use a static prompt format for initial testing +``` + +This allows you to test your implementation with consistent inputs before finalizing the proper prompt template. + +### 5. Implement Weight Conversion + +- Work on `convert/convert_your-model.go` +- Reference existing conversion implementations +- Conversion involves mapping from PyTorch/SafeTensors naming to GGUF naming as you see fit +- Understand typical GGUF layout and structure: + + **Typical GGUF Layout:** + ``` + GGUF + ├── Metadata Section + │ ├── Model Parameters + │ │ ├── General architecture parameters + │ │ │ ├── "{arch}.vocab_size" (e.g., "llama.vocab_size") + │ │ │ ├── "{arch}.context_length" (e.g., "llama.context_length") + │ │ │ ├── "{arch}.embedding_length" (e.g., "llama.embedding_length") + │ │ │ └── "{arch}.block_count" (e.g., "llama.block_count") + │ │ │ + │ │ └── Architecture-specific parameters + │ │ ├── "{arch}.attention.head_count" (e.g., "llama.attention.head_count") + │ │ ├── "{arch}.attention.head_count_kv" (e.g., "llama.attention.head_count_kv") + │ │ ├── "{arch}.rope.dimension_count" (e.g., "llama.rope.dimension_count") + │ │ └── "{arch}.attention.layer_norm_rms_epsilon" (e.g., "llama.attention.layer_norm_rms_epsilon") + │ │ + │ ├── Tokenizer parameters + │ │ ├── "tokenizer.ggml.model" (e.g., "llama") + │ │ ├── "tokenizer.ggml.tokens" (vocabulary tokens) + │ │ ├── "tokenizer.ggml.bos_id" (beginning of sequence token ID) + │ │ └── "tokenizer.ggml.eos_id" (end of sequence token ID) + │ │ + │ └── General metadata + │ └── "general.architecture" (e.g., "llama", "qwen2", "phi") + │ + └── Tensor Data Section + ├── Common tensors: + │ ├── "token_embd.weight" (token embedding matrix) + │ ├── "rope_freqs.weight" (RoPE frequency weights) + │ ├── "output_norm.weight" (final layer normalization) + │ └── "output.weight" (output projection) + │ + └── Layer-specific tensors: + ├── "blk.{i}.attn_q.weight" (query projection) + ├── "blk.{i}.attn_k.weight" (key projection) + ├── "blk.{i}.attn_v.weight" (value projection) + ├── "blk.{i}.attn_output.weight" (attention output) + ├── "blk.{i}.attn_norm.weight" (attention normalization) + ├── "blk.{i}.ffn_norm.weight" (feed-forward normalization) + ├── "blk.{i}.ffn_up.weight" (FFN up projection) + ├── "blk.{i}.ffn_down.weight" (FFN down projection) + └── "blk.{i}.ffn_gate.weight" (FFN gate projection) + ``` + + - Key conversion details include: + - Linear weight matrices (sometimes need transposition) + - Layer normalization weights (might need reshaping) + - **Note: In GGML, FFN values are for the MLP (Multi-Layer Perceptron) part of the architecture** + +- Test conversion: + ```bash + go run . create -f /path/to/Modelfile + ``` + +### 6. Open a Draft PR + +After implementing the initial weight conversion, creating a draft pull request is recommended as it: +- Establishes a communication channel with Ollama maintainers +- Allows for early feedback on your approach +- Makes it easier to track progress and changes + +To open a draft PR: +1. Fork the repository +2. Create a new branch for your model implementation +3. Make initial commits with your weight conversion implementation +4. Open a PR in the `ollama/ollama` repository and mark it as draft +5. Include a clear description of the model you're implementing + +### 7. Implement Model Logic + +- Reference existing model implementations +- Implement `New()` and `Forward()` functions in `model.go`: + + **The `New()` function:** + - Creates and initializes your model structure + - Loads configuration parameters (embedding size, attention heads, etc.) + - Sets up the tokenizer with vocabulary and special tokens + - Initializes all model layers and weights + - **Important**: Sets up the KV cache for efficient inference + - Example: + ```go + func New(c ml.Config) (model.Model, error) { + m := &Model{ + // Initialize tokenizer + BytePairEncoding: model.NewBytePairEncoding(...), + // Create layer arrays + Layers: make([]Layer, c.Uint("block_count")), + // Set model parameters + Options: &Options{...}, + } + // Initialize KV cache for efficient inference + m.Cache = kvcache.NewCausalCache(m.Shift) + return m, nil + } + ``` + + **The `Forward()` function:** + - **What it does**: Defines the computational graph of your model + - **Important**: The graph is NOT executed immediately - it's built first, then executed later when predictions are needed + - Takes input tokens and converts them to embeddings + - Processes inputs through transformer layers (attention and feed-forward networks) + - Creates the path for data flow through your model's components + - Example: + ```go + func (m *Model) Forward(ctx ml.Context, opts model.Options) (ml.Tensor, error) { + // Convert inputs to tensors + inputTensor, _ := ctx.FromIntSlice(opts.Inputs, len(opts.Inputs)) + positionsTensor, _ := ctx.FromIntSlice(opts.Positions, len(opts.Positions)) + + // Initial token embedding + hiddenStates := m.TokenEmbedding.Forward(ctx, inputTensor) + + // Process through transformer layers + for i, layer := range m.Layers { + m.Cache.SetLayer(i) + hiddenStates = layer.Forward(ctx, hiddenStates, positionsTensor, m.Cache, m.Options) + } + + // Final processing and output + normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon) + logits := m.Output.Forward(ctx, normalizedOutput) + + // Return logits for requested positions + outputsTensor, _ := ctx.FromIntSlice(opts.Outputs, len(opts.Outputs)) + return logits.Rows(ctx, outputsTensor), nil + } + ``` + + **Key Components to Implement:** + + 1. **KV Cache**: + - Improves inference performance for text generation + - How it works: Stores previously computed key and value tensors from self-attention, avoiding redundant computations + - Implementation: Use the `kvcache.NewCausalCache()` for autoregressive models + - Important: Must implement the `Shift()` function to handle rotary position embeddings with the cache + + 2. **Self-Attention**: + - Core component that learns contextual relationships between tokens + - Implements query, key, value projections and their interactions + - Must handle positional encoding (usually Rotary Position Embeddings) + - Uses the KV cache to make generation efficient + + 3. **Normalization Layers**: + - Purpose: Stabilizes training and maintains consistent activation distributions + - Types: RMSNorm, LayerNorm, etc. depending on model architecture + - Implementation: Apply before attention and feed-forward networks + - Example: `normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)` + + 4. **Activation Functions**: + - Purpose: Introduces non-linearity into the model + - Common types: SILU (Sigmoid Linear Unit), GELU, ReLU + - Found in feed-forward/MLP blocks + - Example: + ```go + // SwiGLU activation in MLP + gateActivation := mlp.Gate.Forward(ctx, hiddenState).SILU(ctx) + upProjection := mlp.Up.Forward(ctx, hiddenState) + intermediateStates := gateActivation.Mul(ctx, upProjection) ``` +- Run your forward pass: + ```bash + # in the root of the ollama directory + go build . + OLLAMA_DEBUG=1 ./ollama serve + OLLAMA_DEBUG=1 ./ollama run + ``` +- Compare output with research implementation -3. **Implement Model Logic** - - Implement `New()` and `Forward()` functions in `model.go` - - Reference existing model implementations - - Debug forward pass: - ```bash - OLLAMA_DEBUG=1 go run . run - ``` - - Compare output with research implementation - -4. **Tokenizer Implementation** - - Implement a new tokenizer if required - - Ensure compatibility with model architecture - -5. **Text Generation Testing** - - Implement proper prompt formatting - - Test basic generation: - ```bash - go run . run "hello" - ``` - -### 5. Testing +### 8. Quality Check and Final Steps 1. Add comprehensive tests to: - `model_test.go` @@ -189,28 +291,36 @@ import ( - Model initialization - Text generation -### 6. Model Deployment +3. **Create Final Modelfile** + - Replace the static prompt with the proper Go template for your model: + ``` + FROM + TEMPLATE # Add the proper Go template for your model, including tools if needed + LICENSE # Add appropriate license information + # Add additional parameters if needed + ``` + +4. **End-to-end Testing** + - Run your model with your local Ollama build to ensure that it functions as expected + +5. Benchmark + - Run performance benchmarks on your model implementation + ```go + # from the root of the Ollama directory, while a server is running locally + go build . + OLLAMA_DEBUG=1 ./ollama serve + go test -bench=. -m ./... + ``` + +### 9. Finalize PR and Publish to ollama.com 1. **Finalize Pull Request** - Move PR out of draft state - Address reviewer feedback -2. **Deploy to ollama.com** - - Determine model prompt format - - Convert prompt format to Go template - - Create final Modelfile: - ``` - FROM - TEMPLATE - LICENSE - # Add additional parameters if needed - ``` +2. **Publish to ollama.com** - Push to ollama.com: ```bash ollama create / -f /path/to/Modelfile ollama push / - ``` - -3. **Integration Testing** - - Run end-to-end tests - - Verify model behavior in production environment + ``` \ No newline at end of file