ollama/docs/add-a-model.md
Bruce MacDonald 5f62064e2f examples
2025-03-25 09:33:17 -07:00

14 KiB

Guide: Implementing Models in Ollama's Go Inference Engine

Note

: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.

This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to publishing your model to ollama.com.

Architecture Overview

Below is a diagram showing Ollama's inference engine architecture layers and how they interact:

graph TB
    subgraph Models["Model Layer: LLM Implementations"]
        direction TB
        llama["model/models/llama"]
        mllama["model/models/mllama"]
        qwen["model/models/qwen2"]
        etc["...etc"]
        
        note1[" Each model implements a<br>specific architecture:<br>- Defines model parameters<br>- Implements forward pass"]
    end

    subgraph ML_Ops["Neural Network Operations"]
        direction TB
        nn_ops[" nn/<br>linear.go: Matrix multiplication<br>embedding.go: Token embedding lookups<br>normalization.go: Layer norm operations<br>convolution.go: Convolutional operations "]
        
        backend[" ml/backend.go<br>Hardware Abstraction Layer:<br>- Defines tensor operations<br>- Manages computation graphs<br>- Handles memory allocation "]

        note2[" Common neural net operations:<br>- Abstracts hardware details<br>- Provides unified API<br>- Manages computation flow "]
    end

    subgraph Hardware["Backend Execution Layer"]
        direction TB
        backend_impl[" The backend package provides:<br>- Unified computation interface<br>- Automatic hardware selection<br>- Optimized kernels<br>- Efficient memory management "]
        
        subgraph Backends["Backend Implementations"]
            direction LR
            cpu["backend/cpu<br>- Pure Go implementation<br>- Fallback for all platforms"]
            
            metal["backend/metal<br>- Apple Silicon (M1/M2/M3)<br>- MLX integration<br>- Leverages Apple Neural Engine"]
            
            onnx["backend/onnx<br>- Cross-platform compatibility<br>- ONNX Runtime integration<br>- Pre-compiled graph execution"]
            
            ggml["backend/ggml<br>- CPU/GPU quantized compute<br>- Low-precision operations<br>- Memory-efficient inferencing"]
        end
    end

    Models --> |" Makes high-level calls<br>(e.g., self-attention) "| ML_Ops
    ML_Ops --> |" Translates to tensor operations<br>(e.g., matmul, softmax) "| Hardware
    backend_impl --> Backends

When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.

Implementation Process Overview

Here's the high-level process for implementing a new model in Ollama:

  1. Environment Setup: Clone the repository and set up your development environment
  2. Research Implementation: Understand the original model architecture
  3. Project Structure Setup: Set up the necessary file structure
  4. Create Basic Modelfile: Create a simple Modelfile for testing
  5. Implement Weight Conversion: Map from original format to GGUF
  6. Open a Draft PR: Create a draft pull request to establish communication with maintainers
  7. Implement Model Logic: Create the model architecture and forward pass
  8. Quality Check and Final Steps: Create a Modelfile, add tests and ensure functionality
  9. Finalize PR and Publish: Complete the PR and publish to ollama.com

Implementation Steps in Detail

1. Environment Setup

First, clone the Ollama repository and get it running locally. Follow the development setup guide at: https://github.com/ollama/ollama/blob/main/docs/development.md

2. Research Implementation

Get the original model implementation running. This typically involves:

  • Cloning the research code repository (usually Python-based)
  • Setting up the required environment
  • Running inference with sample inputs
  • Understanding the model architecture and forward pass

3. Project Structure Setup

Create the necessary file structure by referencing previous model implementations. You'll need:

convert/
└── convert_your-model.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
model/
└── your-model/
    └── model.go         # Architecture and forward pass implementation

Add your model to the main paths in model/models/models.go:

package models

import (
    _ "github.com/ollama/ollama/model/models/llama"
    _ "github.com/ollama/ollama/model/models/mllama"
    _ "github.com/ollama/ollama/model/models/your-model"  // Add your model here
)

4. Create a Basic Modelfile

Create a simple Modelfile early in the process to facilitate testing:

FROM /path/to/model
TEMPLATE "{{.Prompt}}" # Use a static prompt format for initial testing

This allows you to test your implementation with consistent inputs before finalizing the proper prompt template.

5. Implement Weight Conversion

  • Work on convert/convert_your-model.go

  • Reference existing conversion implementations

  • Conversion involves mapping from PyTorch/SafeTensors naming to GGUF naming as you see fit

  • Understand typical GGUF layout and structure:

    Typical GGUF Layout:

    GGUF
    ├── Metadata Section
    │   ├── Model Parameters
    │   │   ├── General architecture parameters 
    │   │   │   ├── "{arch}.vocab_size" (e.g., "llama.vocab_size") 
    │   │   │   ├── "{arch}.context_length" (e.g., "llama.context_length")
    │   │   │   ├── "{arch}.embedding_length" (e.g., "llama.embedding_length")
    │   │   │   └── "{arch}.block_count" (e.g., "llama.block_count")
    │   │   │
    │   │   └── Architecture-specific parameters
    │   │       ├── "{arch}.attention.head_count" (e.g., "llama.attention.head_count")
    │   │       ├── "{arch}.attention.head_count_kv" (e.g., "llama.attention.head_count_kv")
    │   │       ├── "{arch}.rope.dimension_count" (e.g., "llama.rope.dimension_count")
    │   │       └── "{arch}.attention.layer_norm_rms_epsilon" (e.g., "llama.attention.layer_norm_rms_epsilon")
    │   │
    │   ├── Tokenizer parameters
    │   │   ├── "tokenizer.ggml.model" (e.g., "llama")
    │   │   ├── "tokenizer.ggml.tokens" (vocabulary tokens)
    │   │   ├── "tokenizer.ggml.bos_id" (beginning of sequence token ID)
    │   │   └── "tokenizer.ggml.eos_id" (end of sequence token ID)
    │   │
    │   └── General metadata
    │       └── "general.architecture" (e.g., "llama", "qwen2", "phi")
    │
    └── Tensor Data Section
        ├── Common tensors:
        │   ├── "token_embd.weight" (token embedding matrix)
        │   ├── "rope_freqs.weight" (RoPE frequency weights)
        │   ├── "output_norm.weight" (final layer normalization)
        │   └── "output.weight" (output projection)
        │
        └── Layer-specific tensors:
            ├── "blk.{i}.attn_q.weight" (query projection)
            ├── "blk.{i}.attn_k.weight" (key projection) 
            ├── "blk.{i}.attn_v.weight" (value projection)
            ├── "blk.{i}.attn_output.weight" (attention output)
            ├── "blk.{i}.attn_norm.weight" (attention normalization)
            ├── "blk.{i}.ffn_norm.weight" (feed-forward normalization)
            ├── "blk.{i}.ffn_up.weight" (FFN up projection)
            ├── "blk.{i}.ffn_down.weight" (FFN down projection)
            └── "blk.{i}.ffn_gate.weight" (FFN gate projection)
    
    • Key conversion details include:
      • Linear weight matrices (sometimes need transposition)
      • Layer normalization weights (might need reshaping)
      • Note: In GGML, FFN values are for the MLP (Multi-Layer Perceptron) part of the architecture
  • Test conversion:

    go run . create <my-model> -f /path/to/Modelfile
    

6. Open a Draft PR

After implementing the initial weight conversion, creating a draft pull request is recommended as it:

  • Establishes a communication channel with Ollama maintainers
  • Allows for early feedback on your approach
  • Makes it easier to track progress and changes

To open a draft PR:

  1. Fork the repository
  2. Create a new branch for your model implementation
  3. Make initial commits with your weight conversion implementation
  4. Open a PR in the ollama/ollama repository and mark it as draft
  5. Include a clear description of the model you're implementing

7. Implement Model Logic

  • Reference existing model implementations

  • Implement New() and Forward() functions in model.go:

    The New() function:

    • Creates and initializes your model structure
    • Loads configuration parameters (embedding size, attention heads, etc.)
    • Sets up the tokenizer with vocabulary and special tokens
    • Initializes all model layers and weights
    • Important: Sets up the KV cache for efficient inference
    • Example:
      func New(c ml.Config) (model.Model, error) {
          m := &Model{
              // Initialize tokenizer
              BytePairEncoding: model.NewBytePairEncoding(...),
              // Create layer arrays
              Layers: make([]Layer, c.Uint("block_count")),
              // Set model parameters
              Options: &Options{...},
          }
          // Initialize KV cache for efficient inference
          m.Cache = kvcache.NewCausalCache(m.Shift)
          return m, nil
      }
      

    The Forward() function:

    • What it does: Defines the computational graph of your model
    • Important: The graph is NOT executed immediately - it's built first, then executed later when predictions are needed
    • Takes input tokens and converts them to embeddings
    • Processes inputs through transformer layers (attention and feed-forward networks)
    • Creates the path for data flow through your model's components
    • Example:
      func (m *Model) Forward(ctx ml.Context, opts model.Options) (ml.Tensor, error) {
          // Convert inputs to tensors
          inputTensor, _ := ctx.FromIntSlice(opts.Inputs, len(opts.Inputs))
          positionsTensor, _ := ctx.FromIntSlice(opts.Positions, len(opts.Positions))
      
          // Initial token embedding
          hiddenStates := m.TokenEmbedding.Forward(ctx, inputTensor)
      
          // Process through transformer layers
          for i, layer := range m.Layers {
              m.Cache.SetLayer(i)
              hiddenStates = layer.Forward(ctx, hiddenStates, positionsTensor, m.Cache, m.Options)
          }
      
          // Final processing and output
          normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)
          logits := m.Output.Forward(ctx, normalizedOutput)
      
          // Return logits for requested positions
          outputsTensor, _ := ctx.FromIntSlice(opts.Outputs, len(opts.Outputs))
          return logits.Rows(ctx, outputsTensor), nil
      }
      

    Key Components to Implement:

    1. KV Cache:

      • Improves inference performance for text generation
      • How it works: Stores previously computed key and value tensors from self-attention, avoiding redundant computations
      • Implementation: Use the kvcache.NewCausalCache() for autoregressive models
      • Important: Must implement the Shift() function to handle rotary position embeddings with the cache
    2. Self-Attention:

      • Core component that learns contextual relationships between tokens
      • Implements query, key, value projections and their interactions
      • Must handle positional encoding (usually Rotary Position Embeddings)
      • Uses the KV cache to make generation efficient
    3. Normalization Layers:

      • Purpose: Stabilizes training and maintains consistent activation distributions
      • Types: RMSNorm, LayerNorm, etc. depending on model architecture
      • Implementation: Apply before attention and feed-forward networks
      • Example: normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)
    4. Activation Functions:

      • Purpose: Introduces non-linearity into the model
      • Common types: SILU (Sigmoid Linear Unit), GELU, ReLU
      • Found in feed-forward/MLP blocks
      • Example:
      // SwiGLU activation in MLP
      gateActivation := mlp.Gate.Forward(ctx, hiddenState).SILU(ctx)
      upProjection := mlp.Up.Forward(ctx, hiddenState)
      intermediateStates := gateActivation.Mul(ctx, upProjection)
      
  • Run your forward pass:

    # in the root of the ollama directory
    go build .
    OLLAMA_DEBUG=1 ./ollama serve
    OLLAMA_DEBUG=1 ./ollama run <my-model>
    
  • Compare output with research implementation

8. Quality Check and Final Steps

  1. Add comprehensive tests to:

    • model_test.go
    • convert_test.go
  2. Ensure tests cover:

    • Weight conversion
    • Model initialization
    • Text generation
  3. Create Final Modelfile

    • Replace the static prompt with the proper Go template for your model:
      FROM <converted-gguf>
      TEMPLATE <prompt-template>    # Add the proper Go template for your model, including tools if needed
      LICENSE <license-info>        # Add appropriate license information
      # Add additional parameters if needed
      
  4. End-to-end Testing

    • Run your model with your local Ollama build to ensure that it functions as expected
  5. Benchmark

    • Run performance benchmarks on your model implementation
    # from the root of the Ollama directory, while a server is running locally
    go build .
    OLLAMA_DEBUG=1 ./ollama serve
    go test -bench=. -m <your-model-name> ./...
    

9. Finalize PR and Publish to ollama.com

  1. Finalize Pull Request

    • Move PR out of draft state
    • Address reviewer feedback
  2. Publish to ollama.com

    • Push to ollama.com:
      ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
      ollama push <your-namespace>/<your-model>