mirror of https://github.com/ollama/ollama.git synced 2025-11-27 12:37:03 +01:00

Files

Bruce MacDonald 5f62064e2f examples

2025-03-25 09:33:17 -07:00

14 KiB

Raw Permalink Blame History

Guide: Implementing Models in Ollama's Go Inference Engine

Note

: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.

This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to publishing your model to ollama.com.

Architecture Overview

Below is a diagram showing Ollama's inference engine architecture layers and how they interact:

graph TB
    subgraph Models["Model Layer: LLM Implementations"]
        direction TB
        llama["model/models/llama"]
        mllama["model/models/mllama"]
        qwen["model/models/qwen2"]
        etc["...etc"]
        
        note1[" Each model implements a<br>specific architecture:<br>- Defines model parameters<br>- Implements forward pass"]
    end

    subgraph ML_Ops["Neural Network Operations"]
        direction TB
        nn_ops[" nn/<br>linear.go: Matrix multiplication<br>embedding.go: Token embedding lookups<br>normalization.go: Layer norm operations<br>convolution.go: Convolutional operations "]
        
        backend[" ml/backend.go<br>Hardware Abstraction Layer:<br>- Defines tensor operations<br>- Manages computation graphs<br>- Handles memory allocation "]

        note2[" Common neural net operations:<br>- Abstracts hardware details<br>- Provides unified API<br>- Manages computation flow "]
    end

    subgraph Hardware["Backend Execution Layer"]
        direction TB
        backend_impl[" The backend package provides:<br>- Unified computation interface<br>- Automatic hardware selection<br>- Optimized kernels<br>- Efficient memory management "]
        
        subgraph Backends["Backend Implementations"]
            direction LR
            cpu["backend/cpu<br>- Pure Go implementation<br>- Fallback for all platforms"]
            
            metal["backend/metal<br>- Apple Silicon (M1/M2/M3)<br>- MLX integration<br>- Leverages Apple Neural Engine"]
            
            onnx["backend/onnx<br>- Cross-platform compatibility<br>- ONNX Runtime integration<br>- Pre-compiled graph execution"]
            
            ggml["backend/ggml<br>- CPU/GPU quantized compute<br>- Low-precision operations<br>- Memory-efficient inferencing"]
        end
    end

    Models --> |" Makes high-level calls<br>(e.g., self-attention) "| ML_Ops
    ML_Ops --> |" Translates to tensor operations<br>(e.g., matmul, softmax) "| Hardware
    backend_impl --> Backends

When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.

Implementation Process Overview

Here's the high-level process for implementing a new model in Ollama:

Environment Setup: Clone the repository and set up your development environment
Research Implementation: Understand the original model architecture
Project Structure Setup: Set up the necessary file structure
Create Basic Modelfile: Create a simple Modelfile for testing
Implement Weight Conversion: Map from original format to GGUF
Open a Draft PR: Create a draft pull request to establish communication with maintainers
Implement Model Logic: Create the model architecture and forward pass
Quality Check and Final Steps: Create a Modelfile, add tests and ensure functionality
Finalize PR and Publish: Complete the PR and publish to ollama.com

Implementation Steps in Detail

1. Environment Setup

First, clone the Ollama repository and get it running locally. Follow the development setup guide at: https://github.com/ollama/ollama/blob/main/docs/development.md

2. Research Implementation

Get the original model implementation running. This typically involves:

Cloning the research code repository (usually Python-based)
Setting up the required environment
Running inference with sample inputs
Understanding the model architecture and forward pass

3. Project Structure Setup

Create the necessary file structure by referencing previous model implementations. You'll need:

convert/
└── convert_your-model.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
model/
└── your-model/
    └── model.go         # Architecture and forward pass implementation

Add your model to the main paths in model/models/models.go:

package models

import (
    _ "github.com/ollama/ollama/model/models/llama"
    _ "github.com/ollama/ollama/model/models/mllama"
    _ "github.com/ollama/ollama/model/models/your-model"  // Add your model here
)

4. Create a Basic Modelfile

Create a simple Modelfile early in the process to facilitate testing:

FROM /path/to/model
TEMPLATE "{{.Prompt}}" # Use a static prompt format for initial testing

This allows you to test your implementation with consistent inputs before finalizing the proper prompt template.

5. Implement Weight Conversion

Work on convert/convert_your-model.go
Reference existing conversion implementations
Conversion involves mapping from PyTorch/SafeTensors naming to GGUF naming as you see fit

Understand typical GGUF layout and structure:

Typical GGUF Layout:

GGUF
├── Metadata Section
│   ├── Model Parameters
│   │   ├── General architecture parameters 
│   │   │   ├── "{arch}.vocab_size" (e.g., "llama.vocab_size") 
│   │   │   ├── "{arch}.context_length" (e.g., "llama.context_length")
│   │   │   ├── "{arch}.embedding_length" (e.g., "llama.embedding_length")
│   │   │   └── "{arch}.block_count" (e.g., "llama.block_count")
│   │   │
│   │   └── Architecture-specific parameters
│   │       ├── "{arch}.attention.head_count" (e.g., "llama.attention.head_count")
│   │       ├── "{arch}.attention.head_count_kv" (e.g., "llama.attention.head_count_kv")
│   │       ├── "{arch}.rope.dimension_count" (e.g., "llama.rope.dimension_count")
│   │       └── "{arch}.attention.layer_norm_rms_epsilon" (e.g., "llama.attention.layer_norm_rms_epsilon")
│   │
│   ├── Tokenizer parameters
│   │   ├── "tokenizer.ggml.model" (e.g., "llama")
│   │   ├── "tokenizer.ggml.tokens" (vocabulary tokens)
│   │   ├── "tokenizer.ggml.bos_id" (beginning of sequence token ID)
│   │   └── "tokenizer.ggml.eos_id" (end of sequence token ID)
│   │
│   └── General metadata
│       └── "general.architecture" (e.g., "llama", "qwen2", "phi")
│
└── Tensor Data Section
    ├── Common tensors:
    │   ├── "token_embd.weight" (token embedding matrix)
    │   ├── "rope_freqs.weight" (RoPE frequency weights)
    │   ├── "output_norm.weight" (final layer normalization)
    │   └── "output.weight" (output projection)
    │
    └── Layer-specific tensors:
        ├── "blk.{i}.attn_q.weight" (query projection)
        ├── "blk.{i}.attn_k.weight" (key projection) 
        ├── "blk.{i}.attn_v.weight" (value projection)
        ├── "blk.{i}.attn_output.weight" (attention output)
        ├── "blk.{i}.attn_norm.weight" (attention normalization)
        ├── "blk.{i}.ffn_norm.weight" (feed-forward normalization)
        ├── "blk.{i}.ffn_up.weight" (FFN up projection)
        ├── "blk.{i}.ffn_down.weight" (FFN down projection)
        └── "blk.{i}.ffn_gate.weight" (FFN gate projection)

Key conversion details include:
- Linear weight matrices (sometimes need transposition)
- Layer normalization weights (might need reshaping)
- Note: In GGML, FFN values are for the MLP (Multi-Layer Perceptron) part of the architecture

Test conversion:

go run . create <my-model> -f /path/to/Modelfile

6. Open a Draft PR

After implementing the initial weight conversion, creating a draft pull request is recommended as it:

Establishes a communication channel with Ollama maintainers
Allows for early feedback on your approach
Makes it easier to track progress and changes

To open a draft PR:

Fork the repository
Create a new branch for your model implementation
Make initial commits with your weight conversion implementation
Open a PR in the ollama/ollama repository and mark it as draft
Include a clear description of the model you're implementing

7. Implement Model Logic

Reference existing model implementations
Implement New() and Forward() functions in model.go:

The New() function:
- Creates and initializes your model structure
- Loads configuration parameters (embedding size, attention heads, etc.)
- Sets up the tokenizer with vocabulary and special tokens
- Initializes all model layers and weights
- Important: Sets up the KV cache for efficient inference
- Example:
```
func New(c ml.Config) (model.Model, error) {
    m := &Model{
        // Initialize tokenizer
        BytePairEncoding: model.NewBytePairEncoding(...),
        // Create layer arrays
        Layers: make([]Layer, c.Uint("block_count")),
        // Set model parameters
        Options: &Options{...},
    }
    // Initialize KV cache for efficient inference
    m.Cache = kvcache.NewCausalCache(m.Shift)
    return m, nil
}
```
The Forward() function:
- What it does: Defines the computational graph of your model
- Important: The graph is NOT executed immediately - it's built first, then executed later when predictions are needed
- Takes input tokens and converts them to embeddings
- Processes inputs through transformer layers (attention and feed-forward networks)
- Creates the path for data flow through your model's components
- Example:
```
func (m *Model) Forward(ctx ml.Context, opts model.Options) (ml.Tensor, error) {
    // Convert inputs to tensors
    inputTensor, _ := ctx.FromIntSlice(opts.Inputs, len(opts.Inputs))
    positionsTensor, _ := ctx.FromIntSlice(opts.Positions, len(opts.Positions))

    // Initial token embedding
    hiddenStates := m.TokenEmbedding.Forward(ctx, inputTensor)

    // Process through transformer layers
    for i, layer := range m.Layers {
        m.Cache.SetLayer(i)
        hiddenStates = layer.Forward(ctx, hiddenStates, positionsTensor, m.Cache, m.Options)
    }

    // Final processing and output
    normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)
    logits := m.Output.Forward(ctx, normalizedOutput)

    // Return logits for requested positions
    outputsTensor, _ := ctx.FromIntSlice(opts.Outputs, len(opts.Outputs))
    return logits.Rows(ctx, outputsTensor), nil
}
```
Key Components to Implement:
1. KV Cache:
  - Improves inference performance for text generation
  - How it works: Stores previously computed key and value tensors from self-attention, avoiding redundant computations
  - Implementation: Use the kvcache.NewCausalCache() for autoregressive models
  - Important: Must implement the Shift() function to handle rotary position embeddings with the cache
2. Self-Attention:
  - Core component that learns contextual relationships between tokens
  - Implements query, key, value projections and their interactions
  - Must handle positional encoding (usually Rotary Position Embeddings)
  - Uses the KV cache to make generation efficient
3. Normalization Layers:
  - Purpose: Stabilizes training and maintains consistent activation distributions
  - Types: RMSNorm, LayerNorm, etc. depending on model architecture
  - Implementation: Apply before attention and feed-forward networks
  - Example: normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)
4. Activation Functions:
  - Purpose: Introduces non-linearity into the model
  - Common types: SILU (Sigmoid Linear Unit), GELU, ReLU
  - Found in feed-forward/MLP blocks
  - Example:
```
// SwiGLU activation in MLP
gateActivation := mlp.Gate.Forward(ctx, hiddenState).SILU(ctx)
upProjection := mlp.Up.Forward(ctx, hiddenState)
intermediateStates := gateActivation.Mul(ctx, upProjection)
```

Run your forward pass:

# in the root of the ollama directory
go build .
OLLAMA_DEBUG=1 ./ollama serve
OLLAMA_DEBUG=1 ./ollama run <my-model>

Compare output with research implementation

8. Quality Check and Final Steps

Add comprehensive tests to:
- model_test.go
- convert_test.go
Ensure tests cover:
- Weight conversion
- Model initialization
- Text generation

Create Final Modelfile

Replace the static prompt with the proper Go template for your model:

FROM <converted-gguf>
TEMPLATE <prompt-template>    # Add the proper Go template for your model, including tools if needed
LICENSE <license-info>        # Add appropriate license information
# Add additional parameters if needed

End-to-end Testing
- Run your model with your local Ollama build to ensure that it functions as expected

Benchmark

Run performance benchmarks on your model implementation

# from the root of the Ollama directory, while a server is running locally
go build .
OLLAMA_DEBUG=1 ./ollama serve
go test -bench=. -m <your-model-name> ./...

9. Finalize PR and Publish to ollama.com

Finalize Pull Request
- Move PR out of draft state
- Address reviewer feedback

Publish to ollama.com

Push to ollama.com:

ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
ollama push <your-namespace>/<your-model>

14 KiB Raw Permalink Blame History