docs: add basic steps to implement a new model

Add detailed guide for implementing new models in Ollama's Go inference engine. The guide walks through the full process from initial setup to deployment, including architecture overview, file structure, conversion process, and testing requirements. This will help new contributors understand how to add models to Ollama.
2025-04-16 07:31:35 +02:00 · 2025-02-19 11:17:30 -08:00 · 2025-02-19 11:17:30 -08:00 · 0d15036d82
commit 0d15036d82
parent d2eb226c91
1 changed files with 216 additions and 0 deletions
--- a/docs/implement.md
+++ b/docs/implement.md
@ -0,0 +1,216 @@
+# Guide: Implementing Models in Ollama's Go Inference Engine
+
+> **Note**: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.
+
+This guide outlines the process of implementing a new model in Ollama's Go GGML inference engine. It covers everything from initial setup to deploying your model to ollama.com.
+
+## Architecture Overview
+
+Below is a diagram showing Ollama's inference engine architecture layers and how they interact:
+
+```mermaid
+graph TB
+    subgraph Models["Model Layer: LLM Implementations"]
+        direction TB
+        llama["model/models/llama/model.go"]
+        mllama["model/models/mllama/model.go"]
+        qwen["model/models/qwen2/model.go"]
+        qwen_vl["model/models/qwen2vl/model.go"]
+        
+        note1["Each model implements a specific architecture
+        - Defines model parameters
+        - Implements forward pass"]
+    end
+
+    subgraph ML_Ops["Neural Network Operations"]
+        direction TB
+        nn_ops["nn/
+            linear.go - Matrix operations
+            embedding.go - Token embeddings
+            normalization.go - Layer normalization
+            convolution.go - Conv operations"]
+        
+        backend["ml/backend.go
+        Hardware Abstraction Layer
+        - Defines tensor operations
+        - Manages computation graphs
+        - Handles memory allocation"]
+
+        note2["Common neural net operations
+        used across different models
+        - Abstracts hardware details
+        - Provides unified API
+        - Manages computation flow"]
+    end
+
+    subgraph GGML["Hardware Execution Layer"]
+        direction TB
+        ggml["ggml.go
+        CGO Interface
+        - Bridges Go and C++
+        - Handles type conversion
+        - Manages memory between languages"]
+        
+        subgraph Hardware_Specific["Hardware-Specific Implementations"]
+            direction LR
+            cpu["ggml-cpu.h
+            CPU optimized ops"]
+            cuda["ggml-cuda.h
+            NVIDIA GPU ops"]
+            metal["ggml-metal.h
+            Apple GPU ops"]
+            vulkan["ggml-vulkan.h
+            Cross-platform GPU"]
+            opencl["ggml-opencl.h
+            OpenCL acceleration"]
+        end
+
+        note3["GGML provides optimized 
+        implementations for each hardware:
+        - Automatic dispatch
+        - Hardware-specific optimizations
+        - Memory management
+        - Parallel execution"]
+    end
+
+    %% Connections with explanations
+    Models --> |"Makes high-level calls
+    (e.g., self-attention)"| ML_Ops
+    ML_Ops --> |"Translates to tensor operations
+    (e.g., matmul, softmax)"| GGML
+    GGML --> |"Executes optimized code
+    on target hardware"| Hardware_Specific
+    
+    %% Styling
+    classDef model fill:#fff,stroke:#01579b,stroke-width:2px
+    classDef ml fill:#fff,stroke:#e65100,stroke-width:2px
+    classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px
+    classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5
+    
+    class llama,mllama,qwen,qwen_vl,pixtral model
+    class nn_ops,backend ml
+    class ggml,cpu,cuda,metal,vulkan,opencl hw
+    class note1,note2,note3 note
+
+    %% Style subgraphs
+    style Models fill:#fff,stroke:#01579b,stroke-width:2px
+    style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px
+    style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px
+    style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px
+```
+
+When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.
+
+## Implementation Steps
+
+### 1. Environment Setup
+
+First, clone the Ollama repository and get it running locally. Follow the development setup guide at:
+https://github.com/ollama/ollama/blob/main/docs/development.md
+
+### 2. Research Implementation
+
+Get the original model implementation running. This typically involves:
+- Cloning the research code repository (usually Python-based)
+- Setting up the required environment
+- Running inference with sample inputs
+- Understanding the model architecture and forward pass
+
+### 3. Project Structure Setup
+
+Create the necessary file structure by referencing previous model implementations. You'll need:
+
+```
+model/
+└── your-model/
+    ├── model.go         # Architecture and forward pass implementation
+    ├── convert.go       # Weight conversion logic (PyTorch/SafeTensors to GGML)
+    └── convert_test.go  # Conversion logic tests
+```
+
+Add your model to the main paths in [model/models/models.go](https://github.com/ollama/ollama/blob/main/model/models/models.go):
+
+```
+package models
+
+import (
+    _ "github.com/ollama/ollama/model/models/llama"
+    _ "github.com/ollama/ollama/model/models/mllama"
+    _ "github.com/ollama/ollama/model/models/your-model"  // Add your model here
+)
+```
+
+### 4. Development Process
+
+1. **Open a Draft PR**
+   - Create a draft pull request in the `ollama/ollama` repository
+   - Use this as a communication channel with Ollama maintainers
+
+2. **Implement Weight Conversion**
+   - Work on `convert.go`
+   - Reference existing conversion implementations
+   - Create a basic Modelfile:
+     ```
+     FROM /path/to/model
+     ```
+   - Test conversion:
+     ```bash
+     go run . create <my-model> -f /path/to/Modelfile
+     ```
+
+3. **Implement Model Logic**
+   - Implement `New()` and `Forward()` functions in `model.go`
+   - Reference existing model implementations
+   - Debug forward pass:
+     ```bash
+     OLLAMA_DEBUG=1 go run . run <my-model>
+     ```
+   - Compare output with research implementation
+
+4. **Tokenizer Implementation**
+   - Implement a new tokenizer if required
+   - Ensure compatibility with model architecture
+
+5. **Text Generation Testing**
+   - Implement proper prompt formatting
+   - Test basic generation:
+     ```bash
+     go run . run <my-model> "hello"
+     ```
+
+### 5. Testing
+
+1. Add comprehensive tests to:
+   - `model_test.go`
+   - `convert_test.go`
+
+2. Ensure tests cover:
+   - Weight conversion
+   - Model initialization
+   - Text generation
+
+### 6. Model Deployment
+
+1. **Finalize Pull Request**
+   - Move PR out of draft state
+   - Address reviewer feedback
+
+2. **Deploy to ollama.com**
+   - Determine model prompt format
+   - Convert prompt format to Go template
+   - Create final Modelfile:
+     ```
+     FROM <converted-gguf>
+     TEMPLATE <prompt-template>
+     LICENSE <license-info>
+     # Add additional parameters if needed
+     ```
+   - Push to ollama.com:
+     ```bash
+     ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
+     ollama push <your-namespace>/<your-model>
+     ```
+
+3. **Integration Testing**
+   - Run end-to-end tests
+   - Verify model behavior in production environment