mirror of
https://github.com/ollama/ollama.git
synced 2025-06-14 05:20:57 +02:00
Update add-a-model.md
This commit is contained in:
parent
b5fc84c930
commit
e3f3043f5b
@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
> **Note**: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.
|
> **Note**: This guide and the Go inference engine are in early development and will be updated as implementation details evolve.
|
||||||
|
|
||||||
This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to deploying your model to ollama.com.
|
This guide outlines the process of implementing a new model in Ollama's inference engine. It covers everything from initial setup to publishing your model to ollama.com.
|
||||||
|
|
||||||
## Architecture Overview
|
## Architecture Overview
|
||||||
|
|
||||||
@ -12,96 +12,49 @@ Below is a diagram showing Ollama's inference engine architecture layers and how
|
|||||||
graph TB
|
graph TB
|
||||||
subgraph Models["Model Layer: LLM Implementations"]
|
subgraph Models["Model Layer: LLM Implementations"]
|
||||||
direction TB
|
direction TB
|
||||||
llama["model/models/llama/model.go"]
|
llama["model/models/llama"]
|
||||||
mllama["model/models/mllama/model.go"]
|
mllama["model/models/mllama"]
|
||||||
qwen["model/models/qwen2/model.go"]
|
qwen["model/models/qwen2"]
|
||||||
qwen_vl["model/models/qwen2vl/model.go"]
|
etc["...etc"]
|
||||||
|
|
||||||
note1["Each model implements a specific architecture
|
note1[" Each model implements a<br>specific architecture:<br>- Defines model parameters<br>- Implements forward pass"]
|
||||||
- Defines model parameters
|
|
||||||
- Implements forward pass"]
|
|
||||||
end
|
end
|
||||||
|
|
||||||
subgraph ML_Ops["Neural Network Operations"]
|
subgraph ML_Ops["Neural Network Operations"]
|
||||||
direction TB
|
direction TB
|
||||||
nn_ops["nn/
|
nn_ops[" nn/<br>linear.go: Matrix multiplication<br>embedding.go: Token embedding lookups<br>normalization.go: Layer norm operations<br>convolution.go: Convolutional operations "]
|
||||||
linear.go - Matrix operations
|
|
||||||
embedding.go - Token embeddings
|
|
||||||
normalization.go - Layer normalization
|
|
||||||
convolution.go - Conv operations"]
|
|
||||||
|
|
||||||
backend["ml/backend.go
|
backend[" ml/backend.go<br>Hardware Abstraction Layer:<br>- Defines tensor operations<br>- Manages computation graphs<br>- Handles memory allocation "]
|
||||||
Hardware Abstraction Layer
|
|
||||||
- Defines tensor operations
|
|
||||||
- Manages computation graphs
|
|
||||||
- Handles memory allocation"]
|
|
||||||
|
|
||||||
note2["Common neural net operations
|
note2[" Common neural net operations:<br>- Abstracts hardware details<br>- Provides unified API<br>- Manages computation flow "]
|
||||||
used across different models
|
|
||||||
- Abstracts hardware details
|
|
||||||
- Provides unified API
|
|
||||||
- Manages computation flow"]
|
|
||||||
end
|
end
|
||||||
|
|
||||||
subgraph GGML["Hardware Execution Layer"]
|
subgraph Hardware["Backend Execution Layer"]
|
||||||
direction TB
|
direction TB
|
||||||
ggml["ggml.go
|
backend_impl[" The backend package provides:<br>- Unified computation interface<br>- Automatic hardware selection<br>- Optimized kernels<br>- Efficient memory management "]
|
||||||
CGO Interface
|
|
||||||
- Bridges Go and C++
|
|
||||||
- Handles type conversion
|
|
||||||
- Manages memory between languages"]
|
|
||||||
|
|
||||||
subgraph Hardware_Specific["Hardware-Specific Implementations"]
|
|
||||||
direction LR
|
|
||||||
cpu["ggml-cpu.h
|
|
||||||
CPU optimized ops"]
|
|
||||||
cuda["ggml-cuda.h
|
|
||||||
NVIDIA GPU ops"]
|
|
||||||
metal["ggml-metal.h
|
|
||||||
Apple GPU ops"]
|
|
||||||
vulkan["ggml-vulkan.h
|
|
||||||
Cross-platform GPU"]
|
|
||||||
opencl["ggml-opencl.h
|
|
||||||
OpenCL acceleration"]
|
|
||||||
end
|
|
||||||
|
|
||||||
note3["GGML provides optimized
|
|
||||||
implementations for each hardware:
|
|
||||||
- Automatic dispatch
|
|
||||||
- Hardware-specific optimizations
|
|
||||||
- Memory management
|
|
||||||
- Parallel execution"]
|
|
||||||
end
|
end
|
||||||
|
|
||||||
%% Connections with explanations
|
Models --> |" Makes high-level calls<br>(e.g., self-attention) "| ML_Ops
|
||||||
Models --> |"Makes high-level calls
|
ML_Ops --> |" Translates to tensor operations<br>(e.g., matmul, softmax) "| Hardware
|
||||||
(e.g., self-attention)"| ML_Ops
|
|
||||||
ML_Ops --> |"Translates to tensor operations
|
|
||||||
(e.g., matmul, softmax)"| GGML
|
|
||||||
GGML --> |"Executes optimized code
|
|
||||||
on target hardware"| Hardware_Specific
|
|
||||||
|
|
||||||
%% Styling
|
|
||||||
classDef model fill:#fff,stroke:#01579b,stroke-width:2px
|
|
||||||
classDef ml fill:#fff,stroke:#e65100,stroke-width:2px
|
|
||||||
classDef hw fill:#fff,stroke:#b71c1c,stroke-width:2px
|
|
||||||
classDef note fill:#fff,stroke:#666,stroke-dasharray: 5 5
|
|
||||||
|
|
||||||
class llama,mllama,qwen,qwen_vl,pixtral model
|
|
||||||
class nn_ops,backend ml
|
|
||||||
class ggml,cpu,cuda,metal,vulkan,opencl hw
|
|
||||||
class note1,note2,note3 note
|
|
||||||
|
|
||||||
%% Style subgraphs
|
|
||||||
style Models fill:#fff,stroke:#01579b,stroke-width:2px
|
|
||||||
style ML_Ops fill:#fff,stroke:#e65100,stroke-width:2px
|
|
||||||
style GGML fill:#fff,stroke:#b71c1c,stroke-width:2px
|
|
||||||
style Hardware_Specific fill:#fff,stroke:#b71c1c,stroke-width:1px
|
|
||||||
```
|
```
|
||||||
|
|
||||||
When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.
|
When implementing a new model, you'll primarily work in the model layer, interfacing with the neural network operations layer.
|
||||||
|
|
||||||
## Implementation Steps
|
## Implementation Process Overview
|
||||||
|
|
||||||
|
Here's the high-level process for implementing a new model in Ollama:
|
||||||
|
|
||||||
|
1. **Environment Setup**: Clone the repository and set up your development environment
|
||||||
|
2. **Research Implementation**: Understand the original model architecture
|
||||||
|
3. **Project Structure Setup**: Set up the necessary file structure
|
||||||
|
4. **Create Basic Modelfile**: Create a simple Modelfile for testing
|
||||||
|
5. **Implement Weight Conversion**: Map from original format to GGUF
|
||||||
|
6. **Open a Draft PR**: Create a draft pull request to establish communication with maintainers
|
||||||
|
7. **Implement Model Logic**: Create the model architecture and forward pass
|
||||||
|
8. **Quality Check and Final Steps**: Create a Modelfile, add tests and ensure functionality
|
||||||
|
10. **Finalize PR and Publish**: Complete the PR and publish to ollama.com
|
||||||
|
|
||||||
|
## Implementation Steps in Detail
|
||||||
|
|
||||||
### 1. Environment Setup
|
### 1. Environment Setup
|
||||||
|
|
||||||
@ -121,11 +74,11 @@ Get the original model implementation running. This typically involves:
|
|||||||
Create the necessary file structure by referencing previous model implementations. You'll need:
|
Create the necessary file structure by referencing previous model implementations. You'll need:
|
||||||
|
|
||||||
```
|
```
|
||||||
|
convert/
|
||||||
|
└── convert_your-model.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
|
||||||
model/
|
model/
|
||||||
└── your-model/
|
└── your-model/
|
||||||
├── model.go # Architecture and forward pass implementation
|
└── model.go # Architecture and forward pass implementation
|
||||||
├── convert.go # Weight conversion logic (PyTorch/SafeTensors to GGML)
|
|
||||||
└── convert_test.go # Conversion logic tests
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Add your model to the main paths in [model/models/models.go](https://github.com/ollama/ollama/blob/main/model/models/models.go):
|
Add your model to the main paths in [model/models/models.go](https://github.com/ollama/ollama/blob/main/model/models/models.go):
|
||||||
@ -140,45 +93,194 @@ import (
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### 4. Development Process
|
### 4. Create a Basic Modelfile
|
||||||
|
|
||||||
1. **Open a Draft PR**
|
Create a simple Modelfile early in the process to facilitate testing:
|
||||||
- Create a draft pull request in the `ollama/ollama` repository
|
|
||||||
- Use this as a communication channel with Ollama maintainers
|
|
||||||
|
|
||||||
2. **Implement Weight Conversion**
|
```
|
||||||
- Work on `convert.go`
|
FROM /path/to/model
|
||||||
- Reference existing conversion implementations
|
TEMPLATE "{{.Prompt}}" # Use a static prompt format for initial testing
|
||||||
- Create a basic Modelfile:
|
```
|
||||||
```
|
|
||||||
FROM /path/to/model
|
This allows you to test your implementation with consistent inputs before finalizing the proper prompt template.
|
||||||
```
|
|
||||||
- Test conversion:
|
### 5. Implement Weight Conversion
|
||||||
```bash
|
|
||||||
go run . create <my-model> -f /path/to/Modelfile
|
- Work on `convert/convert_your-model.go`
|
||||||
|
- Reference existing conversion implementations
|
||||||
|
- Conversion involves mapping from PyTorch/SafeTensors naming to GGUF naming as you see fit
|
||||||
|
- Understand typical GGUF layout and structure:
|
||||||
|
|
||||||
|
**Typical GGUF Layout:**
|
||||||
|
```
|
||||||
|
GGUF
|
||||||
|
├── Metadata Section
|
||||||
|
│ ├── Model Parameters
|
||||||
|
│ │ ├── General architecture parameters
|
||||||
|
│ │ │ ├── "{arch}.vocab_size" (e.g., "llama.vocab_size")
|
||||||
|
│ │ │ ├── "{arch}.context_length" (e.g., "llama.context_length")
|
||||||
|
│ │ │ ├── "{arch}.embedding_length" (e.g., "llama.embedding_length")
|
||||||
|
│ │ │ └── "{arch}.block_count" (e.g., "llama.block_count")
|
||||||
|
│ │ │
|
||||||
|
│ │ └── Architecture-specific parameters
|
||||||
|
│ │ ├── "{arch}.attention.head_count" (e.g., "llama.attention.head_count")
|
||||||
|
│ │ ├── "{arch}.attention.head_count_kv" (e.g., "llama.attention.head_count_kv")
|
||||||
|
│ │ ├── "{arch}.rope.dimension_count" (e.g., "llama.rope.dimension_count")
|
||||||
|
│ │ └── "{arch}.attention.layer_norm_rms_epsilon" (e.g., "llama.attention.layer_norm_rms_epsilon")
|
||||||
|
│ │
|
||||||
|
│ ├── Tokenizer parameters
|
||||||
|
│ │ ├── "tokenizer.ggml.model" (e.g., "llama")
|
||||||
|
│ │ ├── "tokenizer.ggml.tokens" (vocabulary tokens)
|
||||||
|
│ │ ├── "tokenizer.ggml.bos_id" (beginning of sequence token ID)
|
||||||
|
│ │ └── "tokenizer.ggml.eos_id" (end of sequence token ID)
|
||||||
|
│ │
|
||||||
|
│ └── General metadata
|
||||||
|
│ └── "general.architecture" (e.g., "llama", "qwen2", "phi")
|
||||||
|
│
|
||||||
|
└── Tensor Data Section
|
||||||
|
├── Common tensors:
|
||||||
|
│ ├── "token_embd.weight" (token embedding matrix)
|
||||||
|
│ ├── "rope_freqs.weight" (RoPE frequency weights)
|
||||||
|
│ ├── "output_norm.weight" (final layer normalization)
|
||||||
|
│ └── "output.weight" (output projection)
|
||||||
|
│
|
||||||
|
└── Layer-specific tensors:
|
||||||
|
├── "blk.{i}.attn_q.weight" (query projection)
|
||||||
|
├── "blk.{i}.attn_k.weight" (key projection)
|
||||||
|
├── "blk.{i}.attn_v.weight" (value projection)
|
||||||
|
├── "blk.{i}.attn_output.weight" (attention output)
|
||||||
|
├── "blk.{i}.attn_norm.weight" (attention normalization)
|
||||||
|
├── "blk.{i}.ffn_norm.weight" (feed-forward normalization)
|
||||||
|
├── "blk.{i}.ffn_up.weight" (FFN up projection)
|
||||||
|
├── "blk.{i}.ffn_down.weight" (FFN down projection)
|
||||||
|
└── "blk.{i}.ffn_gate.weight" (FFN gate projection)
|
||||||
|
```
|
||||||
|
|
||||||
|
- Key conversion details include:
|
||||||
|
- Linear weight matrices (sometimes need transposition)
|
||||||
|
- Layer normalization weights (might need reshaping)
|
||||||
|
- **Note: In GGML, FFN values are for the MLP (Multi-Layer Perceptron) part of the architecture**
|
||||||
|
|
||||||
|
- Test conversion:
|
||||||
|
```bash
|
||||||
|
go run . create <my-model> -f /path/to/Modelfile
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6. Open a Draft PR
|
||||||
|
|
||||||
|
After implementing the initial weight conversion, creating a draft pull request is recommended as it:
|
||||||
|
- Establishes a communication channel with Ollama maintainers
|
||||||
|
- Allows for early feedback on your approach
|
||||||
|
- Makes it easier to track progress and changes
|
||||||
|
|
||||||
|
To open a draft PR:
|
||||||
|
1. Fork the repository
|
||||||
|
2. Create a new branch for your model implementation
|
||||||
|
3. Make initial commits with your weight conversion implementation
|
||||||
|
4. Open a PR in the `ollama/ollama` repository and mark it as draft
|
||||||
|
5. Include a clear description of the model you're implementing
|
||||||
|
|
||||||
|
### 7. Implement Model Logic
|
||||||
|
|
||||||
|
- Reference existing model implementations
|
||||||
|
- Implement `New()` and `Forward()` functions in `model.go`:
|
||||||
|
|
||||||
|
**The `New()` function:**
|
||||||
|
- Creates and initializes your model structure
|
||||||
|
- Loads configuration parameters (embedding size, attention heads, etc.)
|
||||||
|
- Sets up the tokenizer with vocabulary and special tokens
|
||||||
|
- Initializes all model layers and weights
|
||||||
|
- **Important**: Sets up the KV cache for efficient inference
|
||||||
|
- Example:
|
||||||
|
```go
|
||||||
|
func New(c ml.Config) (model.Model, error) {
|
||||||
|
m := &Model{
|
||||||
|
// Initialize tokenizer
|
||||||
|
BytePairEncoding: model.NewBytePairEncoding(...),
|
||||||
|
// Create layer arrays
|
||||||
|
Layers: make([]Layer, c.Uint("block_count")),
|
||||||
|
// Set model parameters
|
||||||
|
Options: &Options{...},
|
||||||
|
}
|
||||||
|
// Initialize KV cache for efficient inference
|
||||||
|
m.Cache = kvcache.NewCausalCache(m.Shift)
|
||||||
|
return m, nil
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**The `Forward()` function:**
|
||||||
|
- **What it does**: Defines the computational graph of your model
|
||||||
|
- **Important**: The graph is NOT executed immediately - it's built first, then executed later when predictions are needed
|
||||||
|
- Takes input tokens and converts them to embeddings
|
||||||
|
- Processes inputs through transformer layers (attention and feed-forward networks)
|
||||||
|
- Creates the path for data flow through your model's components
|
||||||
|
- Example:
|
||||||
|
```go
|
||||||
|
func (m *Model) Forward(ctx ml.Context, opts model.Options) (ml.Tensor, error) {
|
||||||
|
// Convert inputs to tensors
|
||||||
|
inputTensor, _ := ctx.FromIntSlice(opts.Inputs, len(opts.Inputs))
|
||||||
|
positionsTensor, _ := ctx.FromIntSlice(opts.Positions, len(opts.Positions))
|
||||||
|
|
||||||
|
// Initial token embedding
|
||||||
|
hiddenStates := m.TokenEmbedding.Forward(ctx, inputTensor)
|
||||||
|
|
||||||
|
// Process through transformer layers
|
||||||
|
for i, layer := range m.Layers {
|
||||||
|
m.Cache.SetLayer(i)
|
||||||
|
hiddenStates = layer.Forward(ctx, hiddenStates, positionsTensor, m.Cache, m.Options)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Final processing and output
|
||||||
|
normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)
|
||||||
|
logits := m.Output.Forward(ctx, normalizedOutput)
|
||||||
|
|
||||||
|
// Return logits for requested positions
|
||||||
|
outputsTensor, _ := ctx.FromIntSlice(opts.Outputs, len(opts.Outputs))
|
||||||
|
return logits.Rows(ctx, outputsTensor), nil
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Key Components to Implement:**
|
||||||
|
|
||||||
|
1. **KV Cache**:
|
||||||
|
- Improves inference performance for text generation
|
||||||
|
- How it works: Stores previously computed key and value tensors from self-attention, avoiding redundant computations
|
||||||
|
- Implementation: Use the `kvcache.NewCausalCache()` for autoregressive models
|
||||||
|
- Important: Must implement the `Shift()` function to handle rotary position embeddings with the cache
|
||||||
|
|
||||||
|
2. **Self-Attention**:
|
||||||
|
- Core component that learns contextual relationships between tokens
|
||||||
|
- Implements query, key, value projections and their interactions
|
||||||
|
- Must handle positional encoding (usually Rotary Position Embeddings)
|
||||||
|
- Uses the KV cache to make generation efficient
|
||||||
|
|
||||||
|
3. **Normalization Layers**:
|
||||||
|
- Purpose: Stabilizes training and maintains consistent activation distributions
|
||||||
|
- Types: RMSNorm, LayerNorm, etc. depending on model architecture
|
||||||
|
- Implementation: Apply before attention and feed-forward networks
|
||||||
|
- Example: `normalizedOutput := m.OutputNorm.Forward(ctx, hiddenStates, m.modelEpsilon)`
|
||||||
|
|
||||||
|
4. **Activation Functions**:
|
||||||
|
- Purpose: Introduces non-linearity into the model
|
||||||
|
- Common types: SILU (Sigmoid Linear Unit), GELU, ReLU
|
||||||
|
- Found in feed-forward/MLP blocks
|
||||||
|
- Example:
|
||||||
|
```go
|
||||||
|
// SwiGLU activation in MLP
|
||||||
|
gateActivation := mlp.Gate.Forward(ctx, hiddenState).SILU(ctx)
|
||||||
|
upProjection := mlp.Up.Forward(ctx, hiddenState)
|
||||||
|
intermediateStates := gateActivation.Mul(ctx, upProjection)
|
||||||
```
|
```
|
||||||
|
- Run your forward pass:
|
||||||
|
```bash
|
||||||
|
# in the root of the ollama directory
|
||||||
|
go build .
|
||||||
|
OLLAMA_DEBUG=1 ./ollama serve
|
||||||
|
OLLAMA_DEBUG=1 ./ollama run <my-model>
|
||||||
|
```
|
||||||
|
- Compare output with research implementation
|
||||||
|
|
||||||
3. **Implement Model Logic**
|
### 8. Quality Check and Final Steps
|
||||||
- Implement `New()` and `Forward()` functions in `model.go`
|
|
||||||
- Reference existing model implementations
|
|
||||||
- Debug forward pass:
|
|
||||||
```bash
|
|
||||||
OLLAMA_DEBUG=1 go run . run <my-model>
|
|
||||||
```
|
|
||||||
- Compare output with research implementation
|
|
||||||
|
|
||||||
4. **Tokenizer Implementation**
|
|
||||||
- Implement a new tokenizer if required
|
|
||||||
- Ensure compatibility with model architecture
|
|
||||||
|
|
||||||
5. **Text Generation Testing**
|
|
||||||
- Implement proper prompt formatting
|
|
||||||
- Test basic generation:
|
|
||||||
```bash
|
|
||||||
go run . run <my-model> "hello"
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5. Testing
|
|
||||||
|
|
||||||
1. Add comprehensive tests to:
|
1. Add comprehensive tests to:
|
||||||
- `model_test.go`
|
- `model_test.go`
|
||||||
@ -189,28 +291,36 @@ import (
|
|||||||
- Model initialization
|
- Model initialization
|
||||||
- Text generation
|
- Text generation
|
||||||
|
|
||||||
### 6. Model Deployment
|
3. **Create Final Modelfile**
|
||||||
|
- Replace the static prompt with the proper Go template for your model:
|
||||||
|
```
|
||||||
|
FROM <converted-gguf>
|
||||||
|
TEMPLATE <prompt-template> # Add the proper Go template for your model, including tools if needed
|
||||||
|
LICENSE <license-info> # Add appropriate license information
|
||||||
|
# Add additional parameters if needed
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **End-to-end Testing**
|
||||||
|
- Run your model with your local Ollama build to ensure that it functions as expected
|
||||||
|
|
||||||
|
5. Benchmark
|
||||||
|
- Run performance benchmarks on your model implementation
|
||||||
|
```go
|
||||||
|
# from the root of the Ollama directory, while a server is running locally
|
||||||
|
go build .
|
||||||
|
OLLAMA_DEBUG=1 ./ollama serve
|
||||||
|
go test -bench=. -m <your-model-name> ./...
|
||||||
|
```
|
||||||
|
|
||||||
|
### 9. Finalize PR and Publish to ollama.com
|
||||||
|
|
||||||
1. **Finalize Pull Request**
|
1. **Finalize Pull Request**
|
||||||
- Move PR out of draft state
|
- Move PR out of draft state
|
||||||
- Address reviewer feedback
|
- Address reviewer feedback
|
||||||
|
|
||||||
2. **Deploy to ollama.com**
|
2. **Publish to ollama.com**
|
||||||
- Determine model prompt format
|
|
||||||
- Convert prompt format to Go template
|
|
||||||
- Create final Modelfile:
|
|
||||||
```
|
|
||||||
FROM <converted-gguf>
|
|
||||||
TEMPLATE <prompt-template>
|
|
||||||
LICENSE <license-info>
|
|
||||||
# Add additional parameters if needed
|
|
||||||
```
|
|
||||||
- Push to ollama.com:
|
- Push to ollama.com:
|
||||||
```bash
|
```bash
|
||||||
ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
|
ollama create <your-namespace>/<your-model> -f /path/to/Modelfile
|
||||||
ollama push <your-namespace>/<your-model>
|
ollama push <your-namespace>/<your-model>
|
||||||
```
|
```
|
||||||
|
|
||||||
3. **Integration Testing**
|
|
||||||
- Run end-to-end tests
|
|
||||||
- Verify model behavior in production environment
|
|
Loading…
x
Reference in New Issue
Block a user