Quantization¶
For design details, quantization scheme tradeoffs and why Q4_K_M is used, see the Training Pipeline architecture page.
Prerequisites¶
- The merged model from the fuse step (see the Training guide) at
ground_segment/training/orion_merged/. - llama.cpp built from source (for the
llama-quantizebinary). - ~11 GB free disk space for all intermediate artifacts. See Compute Budgets for details.
Step 1: Build llama.cpp (if not built already)¶
cd ground_segment/llama.cpp
cmake -B build \
-DBUILD_SHARED_LIBS=OFF \
-DLLAMA_BUILD_TESTS=OFF \
-DLLAMA_BUILD_EXAMPLES=OFF \
-DLLAMA_BUILD_SERVER=OFF
cmake --build build -j$(nproc)
This produces the llama-quantize binary in the build directory.
Step 2: Convert to GGUF (FP16)¶
Use the convert_hf_to_gguf.py script to convert the merged Hugging Face model to GGUF format:
# make sure you are in the uv environment
# created during data generation
cd ground_segment/training/
# using `uv run` does not work here
python ../llama.cpp/convert_hf_to_gguf.py \
./orion_merged/ \
--outfile orion-f16.gguf \
--outtype f16
This produces an FP16 GGUF file. For VLMs with a multimodal projector, the conversion also generates a separate projector file.
Step 3: Extract the Multimodal Projector¶
For vision-language models, the multimodal projector (mmproj) must be extracted separately. The conversion script typically outputs this automatically. If not, you can extract it:
python ../llama.cpp/convert_hf_to_gguf.py \
./orion_merged/ \
--outfile orion-mmproj-f16.gguf \
--outtype f16 \
--mmproj
The projector file stays in FP16 as it is small and does not need quantization.
Step 4: Quantize to Q4_K_M¶
Quantize the main model file (not the projector):
../llama.cpp/build/bin/llama-quantize orion-f16.gguf orion-q4_k_m.gguf Q4_K_M
Output Files¶
After quantization, you should have two files ready for deployment:
orion-q4_k_m.gguf: Quantized language model (Q4_K_M), approximately 730 MB.orion-mmproj-f16.gguf: Multimodal projector (FP16), approximately 814 MB.
Troubleshooting¶
Quantized model produces degraded output: Try Q5_K_M for higher quality at the cost of larger file size. Compare outputs against the FP16 GGUF to isolate whether the issue is from quantization or the fine-tuning itself.
Out of memory on Pi 5: Ensure no other large processes are running. The Q4_K_M model uses ~1.75 GB RSS (measured) on the 8 GB Pi 5, leaving ample headroom. If memory is tight, consider Q4_K_S for a slightly smaller footprint.