docs: add basic steps to implement a new model

Add detailed guide for implementing new models in Ollama's Go inference engine.
The guide walks through the full process from initial setup to deployment, including architecture overview,
file structure, conversion process, and testing requirements. This will help new contributors understand how to add models to Ollama.
This commit is contained in:
Bruce MacDonald 2025-02-19 11:17:30 -08:00
parent d2eb226c91
commit 0d15036d82

216
docs/implement.md Normal file
View File

@ -0,0 +1,216 @@
# Guide: Implementing Models in Ollama's Go Inference Engine
> **Note**: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.
This guide outlines the process of implementing a new model in Ollama's Go GGML inference engine. It covers everything from initial setup to deploying your model to ollama.com.
## Architecture Overview
Below is a diagram showing Ollama's inference engine architecture layers and how they interact:
```mermaid
graph TB
subgraph Models["Model Layer: LLM Implementations"]
direction TB
llama["model/models/llama/model.go"]
mllama["model/models/mllama/model.go"]
qwen["model/models/qwen2/model.go"]
qwen_vl["model/models/qwen2vl/model.go"]
note1["Each model implements a specific architecture
- Defines model parameters
- Implements forward pass"]
end
subgraph ML_Ops["Neural Network Operations"]
direction TB
nn_ops["nn/
linear.go - Matrix operations
embedding.go - Token embeddings
normalization.go - Layer normalization
convolution.go - Conv operations"]
backend["ml/backend.go
Hardware Abstraction Layer
- Defines tensor operations
- Manages computation graphs
- Handles memory allocation"]
note2["Common neural net operations
used across different models
- Abstracts hardware details
- Provides unified API
- Manages computation flow"]
end
subgraph GGML["Hardware Execution Layer"]
direction TB
ggml["ggml.go
CGO Interface
- Bridges Go and C++
- Handles type conversion
- Manages memory between languages"]
subgraph Hardware_Specific["Hardware-Specific Implementations"]
direction LR
cpu["ggml-cpu.h
CPU optimized ops"]
cuda["ggml-cuda.h
NVIDIA GPU ops"]
metal["ggml-metal.h
Apple GPU ops"]
vulkan["ggml-vulkan.h
Cross-platform GPU"]
opencl["ggml-opencl.h
OpenCL acceleration"]
end
note3["GGML provides optimized
implementations for each hardware:
- Automatic dispatch
- Hardware-specific optimizations
- Memory management
- Parallel execution"]
end
%% Connections with explanations
Models --> |"Makes high-level calls
(e.g., self-attention)"| ML_Ops
ML_Ops --> |"Translates to tensor operations
(e.g., matmul, softmax)"| GGML
GGML --> |"Executes optimized code
on target hardware"| Hardware_Specific
%% Styling
classDef model fill:#fff,stroke:#01579b,stroke-width:2px
classDef ml fill:#fff,stroke:#e65100,stroke-width:2px
classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px
classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5
class llama,mllama,qwen,qwen_vl,pixtral model
class nn_ops,backend ml
class ggml,cpu,cuda,metal,vulkan,opencl hw
class note1,note2,note3 note
%% Style subgraphs
style Models fill:#fff,stroke:#01579b,stroke-width:2px
style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px
style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px
style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px
```
When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.
## Implementation Steps
### 1. Environment Setup
First, clone the Ollama repository and get it running locally. Follow the development setup guide at:
https://github.com/ollama/ollama/blob/main/docs/development.md
### 2. Research Implementation
Get the original model implementation running. This typically involves:
- Cloning the research code repository (usually Python-based)
- Setting up the required environment
- Running inference with sample inputs
- Understanding the model architecture and forward pass
### 3. Project Structure Setup
Create the necessary file structure by referencing previous model implementations. You'll need:
```
model/
└── your-model/
├── model.go # Architecture and forward pass implementation
├── convert.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
└── convert_test.go # Conversion logic tests
```
Add your model to the main paths in [model/models/models.go](https://github.com/ollama/ollama/blob/main/model/models/models.go):
```
package models
import (
_ "github.com/ollama/ollama/model/models/llama"
_ "github.com/ollama/ollama/model/models/mllama"
_ "github.com/ollama/ollama/model/models/your-model" // Add your model here
)
```
### 4. Development Process
1. **Open a Draft PR**
- Create a draft pull request in the `ollama/ollama` repository
- Use this as a communication channel with Ollama maintainers
2. **Implement Weight Conversion**
- Work on `convert.go`
- Reference existing conversion implementations
- Create a basic Modelfile:
```
FROM /path/to/model
```
- Test conversion:
```bash
go run . create <my-model> -f /path/to/Modelfile
```
3. **Implement Model Logic**
- Implement `New()` and `Forward()` functions in `model.go`
- Reference existing model implementations
- Debug forward pass:
```bash
OLLAMA_DEBUG=1 go run . run <my-model>
```
- Compare output with research implementation
4. **Tokenizer Implementation**
- Implement a new tokenizer if required
- Ensure compatibility with model architecture
5. **Text Generation Testing**
- Implement proper prompt formatting
- Test basic generation:
```bash
go run . run <my-model> "hello"
```
### 5. Testing
1. Add comprehensive tests to:
- `model_test.go`
- `convert_test.go`
2. Ensure tests cover:
- Weight conversion
- Model initialization
- Text generation
### 6. Model Deployment
1. **Finalize Pull Request**
- Move PR out of draft state
- Address reviewer feedback
2. **Deploy to ollama.com**
- Determine model prompt format
- Convert prompt format to Go template
- Create final Modelfile:
```
FROM <converted-gguf>
TEMPLATE <prompt-template>
LICENSE <license-info>
# Add additional parameters if needed
```
- Push to ollama.com:
```bash
ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
ollama push <your-namespace>/<your-model>
```
3. **Integration Testing**
- Run end-to-end tests
- Verify model behavior in production environment