Update add-a-model.md

2025-06-14 05:20:57 +02:00 · 2025-02-25 14:59:39 -08:00 · 2025-02-25 14:59:39 -08:00 · e3f3043f5b
commit e3f3043f5b
parent b5fc84c930
1 changed files with 239 additions and 129 deletions
--- a/docs/add-a-model.md
+++ b/docs/add-a-model.md
@ -2,7 +2,7 @@
 > **Note**: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.
-This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to deploying your model to ollama.com.
+This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to publishing your model to ollama.com.
 ## Architecture Overview
@ -12,96 +12,49 @@ Below is a diagram showing Ollama's inference engine architecture layers and how
 graph TB
    subgraph Models["Model Layer: LLM Implementations"]
        direction TB
-        llama["model/models/llama/model.go"]
+        llama["model/models/llama"]
-        mllama["model/models/mllama/model.go"]
+        mllama["model/models/mllama"]
-        qwen["model/models/qwen2/model.go"]
+        qwen["model/models/qwen2"]
-        qwen_vl["model/models/qwen2vl/model.go"]
+        etc["...etc"]
-        note1["Each model implements a specific architecture
+        note1[" Each model implements a<br>specific architecture:<br>- Defines model parameters<br>- Implements forward pass"]
        - Defines model parameters
        - Implements forward pass"]
    end
    subgraph ML_Ops["Neural Network Operations"]
        direction TB
-        nn_ops["nn/
+        nn_ops[" nn/<br>linear.go: Matrix multiplication<br>embedding.go: Token embedding lookups<br>normalization.go: Layer norm operations<br>convolution.go: Convolutional operations "]
            linear.go - Matrix operations
            embedding.go - Token embeddings
            normalization.go - Layer normalization
            convolution.go - Conv operations"]
-        backend["ml/backend.go
+        backend[" ml/backend.go<br>Hardware Abstraction Layer:<br>- Defines tensor operations<br>- Manages computation graphs<br>- Handles memory allocation "]
        Hardware Abstraction Layer
        - Defines tensor operations
        - Manages computation graphs
        - Handles memory allocation"]
-        note2["Common neural net operations
+        note2[" Common neural net operations:<br>- Abstracts hardware details<br>- Provides unified API<br>- Manages computation flow "]
        used across different models
        - Abstracts hardware details
        - Provides unified API
        - Manages computation flow"]
    end
-    subgraph GGML["Hardware Execution Layer"]
+    subgraph Hardware["Backend Execution Layer"]
        direction TB
-        ggml["ggml.go
+        backend_impl[" The backend package provides:<br>- Unified computation interface<br>- Automatic hardware selection<br>- Optimized kernels<br>- Efficient memory management "]
        CGO Interface
        - Bridges Go and C++
        - Handles type conversion
        - Manages memory between languages"]
        subgraph Hardware_Specific["Hardware-Specific Implementations"]
            direction LR
            cpu["ggml-cpu.h
            CPU optimized ops"]
            cuda["ggml-cuda.h
            NVIDIA GPU ops"]
            metal["ggml-metal.h
            Apple GPU ops"]
            vulkan["ggml-vulkan.h
            Cross-platform GPU"]
            opencl["ggml-opencl.h
            OpenCL acceleration"]
        end
        note3["GGML provides optimized 
        implementations for each hardware:
        - Automatic dispatch
        - Hardware-specific optimizations
        - Memory management
        - Parallel execution"]
    end
-    %% Connections with explanations
+    Models --> |" Makes high-level calls<br>(e.g., self-attention) "| ML_Ops
-    Models --> |"Makes high-level calls
+    ML_Ops --> |" Translates to tensor operations<br>(e.g., matmul, softmax) "| Hardware
    (e.g., self-attention)"| ML_Ops
    ML_Ops --> |"Translates to tensor operations
    (e.g., matmul, softmax)"| GGML
    GGML --> |"Executes optimized code
    on target hardware"| Hardware_Specific
    %% Styling
    classDef model fill:#fff,stroke:#01579b,stroke-width:2px
    classDef ml fill:#fff,stroke:#e65100,stroke-width:2px
    classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px
    classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5
    class llama,mllama,qwen,qwen_vl,pixtral model
    class nn_ops,backend ml
    class ggml,cpu,cuda,metal,vulkan,opencl hw
    class note1,note2,note3 note
    %% Style subgraphs
    style Models fill:#fff,stroke:#01579b,stroke-width:2px
    style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px
    style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px
    style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px
 ```
 When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.
-## Implementation Steps
+## Implementation Process Overview
 Here's the high-level process for implementing a new model in Ollama:
 1. **Environment Setup**: Clone the repository and set up your development environment
 2. **Research Implementation**: Understand the original model architecture
 3. **Project Structure Setup**: Set up the necessary file structure
 4. **Create Basic Modelfile**: Create a simple Modelfile for testing
 5. **Implement Weight Conversion**: Map from original format to GGUF
 6. **Open a Draft PR**: Create a draft pull request to establish communication with maintainers
 7. **Implement Model Logic**: Create the model architecture and forward pass
 8. **Quality Check and Final Steps**: Create a Modelfile, add tests and ensure functionality
 10. **Finalize PR and Publish**: Complete the PR and publish to ollama.com
 ## Implementation Steps in Detail
 ### 1. Environment Setup
@ -121,11 +74,11 @@ Get the original model implementation running. This typically involves:
 Create the necessary file structure by referencing previous model implementations. You'll need:
 ```
 convert/
 └── convert_your-model.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
 model/
 └── your-model/
-    ├── model.go         # Architecture and forward pass implementation
+    └── model.go         # Architecture and forward pass implementation
    ├── convert.go       # Weight conversion logic (PyTorch/SafeTensors to GGML)
    └── convert_test.go  # Conversion logic tests
 ```
 Add your model to the main paths in [model/models/models.go](https://github.com/ollama/ollama/blob/main/model/models/models.go):
@ -140,45 +93,194 @@ import (
 )
 ```
-### 4. Development Process
+### 4. Create a Basic Modelfile
-1. **Open a Draft PR**
+Create a simple Modelfile early in the process to facilitate testing:
   - Create a draft pull request in the `ollama/ollama` repository
   - Use this as a communication channel with Ollama maintainers
-2. **Implement Weight Conversion**
+```
-   - Work on `convert.go`
+FROM /path/to/model
-   - Reference existing conversion implementations
+TEMPLATE "{{.Prompt}}" # Use a static prompt format for initial testing
-   - Create a basic Modelfile:
+```
-     ```
+
-     FROM /path/to/model
+This allows you to test your implementation with consistent inputs before finalizing the proper prompt template.
-     ```
+
-   - Test conversion:
+### 5. Implement Weight Conversion
-     ```bash
+
-     go run . create <my-model> -f /path/to/Modelfile
+- Work on `convert/convert_your-model.go`
 - Reference existing conversion implementations
 - Conversion involves mapping from PyTorch/SafeTensors naming to GGUF naming as you see fit
 - Understand typical GGUF layout and structure:
  **Typical GGUF Layout:**
  ```
  GGUF
  ├── Metadata Section
  │   ├── Model Parameters
  │   │   ├── General architecture parameters 
  │   │   │   ├── "{arch}.vocab_size" (e.g., "llama.vocab_size") 
  │   │   │   ├── "{arch}.context_length" (e.g., "llama.context_length")
  │   │   │   ├── "{arch}.embedding_length" (e.g., "llama.embedding_length")
  │   │   │   └── "{arch}.block_count" (e.g., "llama.block_count")
  │   │   │
  │   │   └── Architecture-specific parameters
  │   │       ├── "{arch}.attention.head_count" (e.g., "llama.attention.head_count")
  │   │       ├── "{arch}.attention.head_count_kv" (e.g., "llama.attention.head_count_kv")
  │   │       ├── "{arch}.rope.dimension_count" (e.g., "llama.rope.dimension_count")
  │   │       └── "{arch}.attention.layer_norm_rms_epsilon" (e.g., "llama.attention.layer_norm_rms_epsilon")
  │   │
  │   ├── Tokenizer parameters
  │   │   ├── "tokenizer.ggml.model" (e.g., "llama")
  │   │   ├── "tokenizer.ggml.tokens" (vocabulary tokens)
  │   │   ├── "tokenizer.ggml.bos_id" (beginning of sequence token ID)
  │   │   └── "tokenizer.ggml.eos_id" (end of sequence token ID)
  │   │
  │   └── General metadata
  │       └── "general.architecture" (e.g., "llama", "qwen2", "phi")
  │
  └── Tensor Data Section
      ├── Common tensors:
      │   ├── "token_embd.weight" (token embedding matrix)
      │   ├── "rope_freqs.weight" (RoPE frequency weights)
      │   ├── "output_norm.weight" (final layer normalization)
      │   └── "output.weight" (output projection)
      │
      └── Layer-specific tensors:
          ├── "blk.{i}.attn_q.weight" (query projection)
          ├── "blk.{i}.attn_k.weight" (key projection) 
          ├── "blk.{i}.attn_v.weight" (value projection)
          ├── "blk.{i}.attn_output.weight" (attention output)
          ├── "blk.{i}.attn_norm.weight" (attention normalization)
          ├── "blk.{i}.ffn_norm.weight" (feed-forward normalization)
          ├── "blk.{i}.ffn_up.weight" (FFN up projection)
          ├── "blk.{i}.ffn_down.weight" (FFN down projection)
          └── "blk.{i}.ffn_gate.weight" (FFN gate projection)
  ```
  - Key conversion details include:
    - Linear weight matrices (sometimes need transposition)
    - Layer normalization weights (might need reshaping)
    - **Note: In GGML, FFN values are for the MLP (Multi-Layer Perceptron) part of the architecture**
 - Test conversion:
  ```bash
  go run . create <my-model> -f /path/to/Modelfile
  ```
 ### 6. Open a Draft PR
 After implementing the initial weight conversion, creating a draft pull request is recommended as it:
 - Establishes a communication channel with Ollama maintainers
 - Allows for early feedback on your approach
 - Makes it easier to track progress and changes
 To open a draft PR:
 1. Fork the repository
 2. Create a new branch for your model implementation
 3. Make initial commits with your weight conversion implementation
 4. Open a PR in the `ollama/ollama` repository and mark it as draft
 5. Include a clear description of the model you're implementing
 ### 7. Implement Model Logic
 - Reference existing model implementations
 - Implement `New()` and `Forward()` functions in `model.go`:
  **The `New()` function:**
  - Creates and initializes your model structure
  - Loads configuration parameters (embedding size, attention heads, etc.)
  - Sets up the tokenizer with vocabulary and special tokens
  - Initializes all model layers and weights
  - **Important**: Sets up the KV cache for efficient inference
  - Example:
    ```go
    func New(c ml.Config) (model.Model, error) {
        m := &Model{
            // Initialize tokenizer
            BytePairEncoding: model.NewBytePairEncoding(...),
            // Create layer arrays
            Layers: make([]Layer, c.Uint("block_count")),
            // Set model parameters
            Options: &Options{...},
        }
        // Initialize KV cache for efficient inference
        m.Cache = kvcache.NewCausalCache(m.Shift)
        return m, nil
    }
    ```
  **The `Forward()` function:**
  - **What it does**: Defines the computational graph of your model
  - **Important**: The graph is NOT executed immediately - it's built first, then executed later when predictions are needed
  - Takes input tokens and converts them to embeddings
  - Processes inputs through transformer layers (attention and feed-forward networks)
  - Creates the path for data flow through your model's components
  - Example:
    ```go
    func (m *Model) Forward(ctx ml.Context, opts model.Options) (ml.Tensor, error) {
        // Convert inputs to tensors
        inputTensor, _ := ctx.FromIntSlice(opts.Inputs, len(opts.Inputs))
        positionsTensor, _ := ctx.FromIntSlice(opts.Positions, len(opts.Positions))
        // Initial token embedding
        hiddenStates := m.TokenEmbedding.Forward(ctx, inputTensor)
        // Process through transformer layers
        for i, layer := range m.Layers {
            m.Cache.SetLayer(i)
            hiddenStates = layer.Forward(ctx, hiddenStates, positionsTensor, m.Cache, m.Options)
        }
        // Final processing and output
        normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)
        logits := m.Output.Forward(ctx, normalizedOutput)
        // Return logits for requested positions
        outputsTensor, _ := ctx.FromIntSlice(opts.Outputs, len(opts.Outputs))
        return logits.Rows(ctx, outputsTensor), nil
    }
    ```
  **Key Components to Implement:**
  1. **KV Cache**:
     - Improves inference performance for text generation
     - How it works: Stores previously computed key and value tensors from self-attention, avoiding redundant computations
     - Implementation: Use the `kvcache.NewCausalCache()` for autoregressive models
     - Important: Must implement the `Shift()` function to handle rotary position embeddings with the cache
  2. **Self-Attention**:
     - Core component that learns contextual relationships between tokens
     - Implements query, key, value projections and their interactions
     - Must handle positional encoding (usually Rotary Position Embeddings)
     - Uses the KV cache to make generation efficient
  3. **Normalization Layers**:
     - Purpose: Stabilizes training and maintains consistent activation distributions
     - Types: RMSNorm, LayerNorm, etc. depending on model architecture
     - Implementation: Apply before attention and feed-forward networks
     - Example: `normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)`
  4. **Activation Functions**:
     - Purpose: Introduces non-linearity into the model
     - Common types: SILU (Sigmoid Linear Unit), GELU, ReLU
     - Found in feed-forward/MLP blocks
     - Example:
     ```go
     // SwiGLU activation in MLP
     gateActivation := mlp.Gate.Forward(ctx, hiddenState).SILU(ctx)
     upProjection := mlp.Up.Forward(ctx, hiddenState)
     intermediateStates := gateActivation.Mul(ctx, upProjection)
     ```
 - Run your forward pass:
  ```bash
  # in the root of the ollama directory
  go build .
  OLLAMA_DEBUG=1 ./ollama serve
  OLLAMA_DEBUG=1 ./ollama run <my-model>
  ```
 - Compare output with research implementation
-3. **Implement Model Logic**
+### 8. Quality Check and Final Steps
   - Implement `New()` and `Forward()` functions in `model.go`
   - Reference existing model implementations
   - Debug forward pass:
     ```bash
     OLLAMA_DEBUG=1 go run . run <my-model>
     ```
   - Compare output with research implementation
 4. **Tokenizer Implementation**
   - Implement a new tokenizer if required
   - Ensure compatibility with model architecture
 5. **Text Generation Testing**
   - Implement proper prompt formatting
   - Test basic generation:
     ```bash
     go run . run <my-model> "hello"
     ```
 ### 5. Testing
 1. Add comprehensive tests to:
   - `model_test.go`
@ -189,28 +291,36 @@ import (
   - Model initialization
   - Text generation
-### 6. Model Deployment
+3. **Create Final Modelfile**
   - Replace the static prompt with the proper Go template for your model:
     ```
     FROM <converted-gguf>
     TEMPLATE <prompt-template>    # Add the proper Go template for your model, including tools if needed
     LICENSE <license-info>        # Add appropriate license information
     # Add additional parameters if needed
     ```
 4. **End-to-end Testing**
   - Run your model with your local Ollama build to ensure that it functions as expected
 5. Benchmark
   - Run performance benchmarks on your model implementation
   ```go
   # from the root of the Ollama directory, while a server is running locally
   go build .
   OLLAMA_DEBUG=1 ./ollama serve
   go test -bench=. -m <your-model-name> ./...
   ```
 ### 9. Finalize PR and Publish to ollama.com
 1. **Finalize Pull Request**
   - Move PR out of draft state
   - Address reviewer feedback
-2. **Deploy to ollama.com**
+2. **Publish to ollama.com**
   - Determine model prompt format
   - Convert prompt format to Go template
   - Create final Modelfile:
     ```
     FROM <converted-gguf>
     TEMPLATE <prompt-template>
     LICENSE <license-info>
     # Add additional parameters if needed
     ```
   - Push to ollama.com:
     ```bash
     ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
     ollama push <your-namespace>/<your-model>
-     ```
+     ```
 3. **Integration Testing**
   - Run end-to-end tests
   - Verify model behavior in production environment