From 9245c8a1df52c5edf47609b75c1ca673797941dd Mon Sep 17 00:00:00 2001 From: Matt Williams Date: Thu, 12 Oct 2023 15:34:57 -0700 Subject: [PATCH 1/3] add how to quantize doc Signed-off-by: Matt Williams --- docs/modelfile.md | 3 +- docs/quantize.md | 80 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 82 insertions(+), 1 deletion(-) create mode 100644 docs/quantize.md diff --git a/docs/modelfile.md b/docs/modelfile.md index 64cebcb56..15072b0f7 100644 --- a/docs/modelfile.md +++ b/docs/modelfile.md @@ -124,6 +124,7 @@ PARAMETER | repeat_last_n | Sets how far back for the model to look back to prevent repetition. (Default: 64, 0 = disabled, -1 = num_ctx) | int | repeat_last_n 64 | | repeat_penalty | Sets how strongly to penalize repetitions. A higher value (e.g., 1.5) will penalize repetitions more strongly, while a lower value (e.g., 0.9) will be more lenient. (Default: 1.1) | float | repeat_penalty 1.1 | | temperature | The temperature of the model. Increasing the temperature will make the model answer more creatively. (Default: 0.8) | float | temperature 0.7 | +| seed | Sets the random number seed to use for generation. Setting this to a specific number will make the model generate the same text for the same prompt. (Default: 0) | int | seed 42 | | stop | Sets the stop sequences to use. | string | stop "AI assistant:" | | tfs_z | Tail free sampling is used to reduce the impact of less probable tokens from the output. A higher value (e.g., 2.0) will reduce the impact more, while a value of 1.0 disables this setting. (default: 1) | float | tfs_z 1 | | num_predict | Maximum number of tokens to predict when generating text. (Default: 128, -1 = infinite generation, -2 = fill context) | int | num_predict 42 | @@ -132,7 +133,7 @@ PARAMETER ### TEMPLATE -`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system prompt and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific. +`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system prompt and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific. You can usually find the template for a given model in the read #### Template Variables diff --git a/docs/quantize.md b/docs/quantize.md new file mode 100644 index 000000000..a5d20e9b2 --- /dev/null +++ b/docs/quantize.md @@ -0,0 +1,80 @@ +# How to Quantize a Model + +Sometimes the model you want to work with is not available at [https://ollama.ai/library](https://ollama.ai/library). If you want to try out that model before we have a chance to quantize it, you can use this process. + +## Figure out if we can run the model? + +Not all models will work with Ollama. There are a number of factors that go into whether we are able to work with the next cool model. First it has to work with llama.cpp. Then we have to have implemented the features of llama.cpp that it requires. And then, sometimes, even with both of those, the model might not work… + +1. What is the model you want to convert and upload? +2. Visit the model’s page on HuggingFace. +3. Switch to the **Files and versions** tab. +4. Click on the **config.json** file. If there is no config.json file, it may not work. +5. Take note of the **architecture** list in the json file. +6. Does any entry in the list match one of the following architectures? + 1. LlamaForCausalLM + 2. MistralForCausalLM + 3. RWForCausalLM + 4. FalconForCausalLM + 5. GPTNeoXForCausalLM + 6. GPTBigCodeForCausalLM +7. If the answer is yes, then there is a good chance the model will run after being converted and quantized. +8. An alternative to this process is to visit [https://caniquant.tvl.st](https://caniquant.tvl.st) and enter the org/modelname in the box and submit. + +## Clone llama.cpp to your machine + +If we know the model has a chance of working, then we need to convert and quantize. This is a matter of running two separate scripts in the llama.cpp project. + +1. Decide where you want the llama.cpp repository on your machine. +2. Navigate to that location and then run: + [`git clone https://github.com/ggerganov/llama.cpp.git`](https://github.com/ggerganov/llama.cpp.git) + 1. If you don’t have git installed, download this zip file and unzip it to that location: https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.zip +3. Install the Python dependencies: `pip install torch transformers sentencepiece` + +## Convert the model to GGUF + +1. Decide on the right convert script to run. What was the model architecture you found in the first section. + 1. LlamaForCausalLM or MistralForCausalLM: + run `python3 convert.py ` + No need to specify fp16 or fp32. + 2. FalconForCausalLM or RWForCausalLM: + run `python3 convert-falcon-hf-to-gguf.py ` + fpsize depends on the weight size. 1 for fp16, 0 for fp32 + 3. GPTNeoXForCausalLM: + run `python3 convert-gptneox-hf-to-gguf.py ` + fpsize depends on the weight size. 1 for fp16, 0 for fp32 + 4. GPTBigCodeForCausalLM: + run `python3 convert-starcoder-hf-to-gguf.py ` + fpsize depends on the weight size. 1 for fp16, 0 for fp32 + +## Quantize the model + +If the model converted successfully, there is a good chance it will also quantize successfully. Now you need to decide on the quantization to use. We will always try to create all the quantizations and upload them to the library. You should decide which level is more important to you and quantize accordingly. + +The quantization options are as follows. Note that some architectures such as Falcon do not support K quants. + +- Q4_0 +- Q4_1 +- Q5_0 +- Q5_1 +- Q2_K +- Q3_K +- Q3_K_S +- Q3_K_M +- Q3_K_L +- Q4_K +- Q4_K_S +- Q4_K_M +- Q5_K +- Q5_K_S +- Q5_K_M +- Q6_K +- Q8_0 +- F16 +- F32 + +Run the following command `quantize ` + +## Now Create the Model + +Now you can create the Ollama model. Refer to the [modelfile](./modelfile.md) doc for more information on doing that. \ No newline at end of file From 3c975f898f0b775de91b8aab80c0b4613ebe64bd Mon Sep 17 00:00:00 2001 From: Matt Williams Date: Thu, 12 Oct 2023 15:57:50 -0700 Subject: [PATCH 2/3] update doc to refer to docker image Signed-off-by: Matt Williams --- docs/quantize.md | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/docs/quantize.md b/docs/quantize.md index a5d20e9b2..89bb7a6ec 100644 --- a/docs/quantize.md +++ b/docs/quantize.md @@ -21,17 +21,32 @@ Not all models will work with Ollama. There are a number of factors that go into 7. If the answer is yes, then there is a good chance the model will run after being converted and quantized. 8. An alternative to this process is to visit [https://caniquant.tvl.st](https://caniquant.tvl.st) and enter the org/modelname in the box and submit. -## Clone llama.cpp to your machine +At this point there are two processes you can use. You can either use a Docker container to convert and quantize, OR you can manually run the scripts. The Docker container is the easiest way to do it, but it requires you to have Docker installed on your machine. If you don't have Docker installed, you can follow the manual process. + +## Convert and Quantize with Docker + +Run `docker run --rm -v /path/to/model/repo:/repo ollama/quantize -q quantlevel /repo`. For instance, if you have downloaded the latest Mistral 7B model, then clone it to your machine. Then change into that directory and you can run +```shell +docker run --rm -v .:/repo ollama/quantize -q q4_0 /repo +``` + +You can find the different quantization levels below under **Quantize the Model**. + +This will output two files into the directory. First is a f16.bin file that is the model converted to GGUF. The second file is a q4_0.bin file which is the model quantized to a 4 bit quantization. You should rename it to something more descriptive. + +## Convert and Quantize Manually + +### Clone llama.cpp to your machine If we know the model has a chance of working, then we need to convert and quantize. This is a matter of running two separate scripts in the llama.cpp project. 1. Decide where you want the llama.cpp repository on your machine. 2. Navigate to that location and then run: [`git clone https://github.com/ggerganov/llama.cpp.git`](https://github.com/ggerganov/llama.cpp.git) - 1. If you don’t have git installed, download this zip file and unzip it to that location: https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.zip + 1. If you don't have git installed, download this zip file and unzip it to that location: https://github.com/ggerganov/llama.cpp/archive/refs/heads/master.zip 3. Install the Python dependencies: `pip install torch transformers sentencepiece` -## Convert the model to GGUF +### Convert the model to GGUF 1. Decide on the right convert script to run. What was the model architecture you found in the first section. 1. LlamaForCausalLM or MistralForCausalLM: @@ -47,7 +62,7 @@ If we know the model has a chance of working, then we need to convert and quanti run `python3 convert-starcoder-hf-to-gguf.py ` fpsize depends on the weight size. 1 for fp16, 0 for fp32 -## Quantize the model +### Quantize the model If the model converted successfully, there is a good chance it will also quantize successfully. Now you need to decide on the quantization to use. We will always try to create all the quantizations and upload them to the library. You should decide which level is more important to you and quantize accordingly. From b2974a709524ed4a7e4948ab2b74128cc2e4577d Mon Sep 17 00:00:00 2001 From: Matt Williams Date: Sat, 14 Oct 2023 08:29:24 -0700 Subject: [PATCH 3/3] applied mikes comments Signed-off-by: Matt Williams --- docs/modelfile.md | 2 +- docs/quantize.md | 17 +++++++++-------- 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/docs/modelfile.md b/docs/modelfile.md index 15072b0f7..7b33d8c6d 100644 --- a/docs/modelfile.md +++ b/docs/modelfile.md @@ -133,7 +133,7 @@ PARAMETER ### TEMPLATE -`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system prompt and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific. You can usually find the template for a given model in the read +`TEMPLATE` of the full prompt template to be passed into the model. It may include (optionally) a system prompt and a user's prompt. This is used to create a full custom prompt, and syntax may be model specific. You can usually find the template for a given model in the readme for that model. #### Template Variables diff --git a/docs/quantize.md b/docs/quantize.md index 89bb7a6ec..afe0e78d5 100644 --- a/docs/quantize.md +++ b/docs/quantize.md @@ -4,10 +4,10 @@ Sometimes the model you want to work with is not available at [https://ollama.ai ## Figure out if we can run the model? -Not all models will work with Ollama. There are a number of factors that go into whether we are able to work with the next cool model. First it has to work with llama.cpp. Then we have to have implemented the features of llama.cpp that it requires. And then, sometimes, even with both of those, the model might not work… +Not all models will work with Ollama. There are a number of factors that go into whether we are able to work with the next cool model. First it has to work with llama.cpp. Then we have to have implemented the features of llama.cpp that it requires. And then, sometimes, even with both of those, the model might not work... 1. What is the model you want to convert and upload? -2. Visit the model’s page on HuggingFace. +2. Visit the model's page on HuggingFace. 3. Switch to the **Files and versions** tab. 4. Click on the **config.json** file. If there is no config.json file, it may not work. 5. Take note of the **architecture** list in the json file. @@ -25,8 +25,9 @@ At this point there are two processes you can use. You can either use a Docker c ## Convert and Quantize with Docker -Run `docker run --rm -v /path/to/model/repo:/repo ollama/quantize -q quantlevel /repo`. For instance, if you have downloaded the latest Mistral 7B model, then clone it to your machine. Then change into that directory and you can run -```shell +Run `docker run --rm -v /path/to/model/repo:/repo ollama/quantize -q quantlevel /repo`. For instance, if you have downloaded the latest Mistral 7B model, then clone it to your machine. Then change into that directory and you can run: + +```shell docker run --rm -v .:/repo ollama/quantize -q q4_0 /repo ``` @@ -34,6 +35,8 @@ You can find the different quantization levels below under **Quantize the Model* This will output two files into the directory. First is a f16.bin file that is the model converted to GGUF. The second file is a q4_0.bin file which is the model quantized to a 4 bit quantization. You should rename it to something more descriptive. +You can find the repository for the Docker container here: [https://github.com/mxyng/quantize](https://github.com/mxyng/quantize) + ## Convert and Quantize Manually ### Clone llama.cpp to your machine @@ -49,7 +52,7 @@ If we know the model has a chance of working, then we need to convert and quanti ### Convert the model to GGUF 1. Decide on the right convert script to run. What was the model architecture you found in the first section. - 1. LlamaForCausalLM or MistralForCausalLM: + 1. LlamaForCausalLM or MistralForCausalLM: run `python3 convert.py ` No need to specify fp16 or fp32. 2. FalconForCausalLM or RWForCausalLM: @@ -85,11 +88,9 @@ The quantization options are as follows. Note that some architectures such as Fa - Q5_K_M - Q6_K - Q8_0 -- F16 -- F32 Run the following command `quantize ` ## Now Create the Model -Now you can create the Ollama model. Refer to the [modelfile](./modelfile.md) doc for more information on doing that. \ No newline at end of file +Now you can create the Ollama model. Refer to the [modelfile](./modelfile.md) doc for more information on doing that.