Skip to content

Quantization

For design details, quantization scheme tradeoffs and why Q4_K_M is used, see the Training Pipeline architecture page.

Prerequisites

  • The merged model from the fuse step (see the Training guide) at ground_segment/training/orion_merged/.
  • llama.cpp built from source (for the llama-quantize binary).
  • ~11 GB free disk space for all intermediate artifacts. See Compute Budgets for details.

Step 1: Build llama.cpp (if not built already)

cd ground_segment/llama.cpp
cmake -B build \
  -DBUILD_SHARED_LIBS=OFF \
  -DLLAMA_BUILD_TESTS=OFF \
  -DLLAMA_BUILD_EXAMPLES=OFF \
  -DLLAMA_BUILD_SERVER=OFF
cmake --build build -j$(nproc)

This produces the llama-quantize binary in the build directory.

Step 2: Convert to GGUF (FP16)

Use the convert_hf_to_gguf.py script to convert the merged Hugging Face model to GGUF format:

# make sure you are in the uv environment
# created during data generation
cd ground_segment/training/

# using `uv run` does not work here
python ../llama.cpp/convert_hf_to_gguf.py \
    ./orion_merged/ \
    --outfile orion-f16.gguf \
    --outtype f16

This produces an FP16 GGUF file. For VLMs with a multimodal projector, the conversion also generates a separate projector file.

Step 3: Extract the Multimodal Projector

For vision-language models, the multimodal projector (mmproj) must be extracted separately. The conversion script typically outputs this automatically. If not, you can extract it:

python ../llama.cpp/convert_hf_to_gguf.py \
    ./orion_merged/ \
    --outfile orion-mmproj-f16.gguf \
    --outtype f16 \
    --mmproj

The projector file stays in FP16 as it is small and does not need quantization.

Step 4: Quantize to Q4_K_M

Quantize the main model file (not the projector):

../llama.cpp/build/bin/llama-quantize orion-f16.gguf orion-q4_k_m.gguf Q4_K_M

Output Files

After quantization, you should have two files ready for deployment:

  • orion-q4_k_m.gguf: Quantized language model (Q4_K_M), approximately 730 MB.
  • orion-mmproj-f16.gguf: Multimodal projector (FP16), approximately 814 MB.

Troubleshooting

Quantized model produces degraded output: Try Q5_K_M for higher quality at the cost of larger file size. Compare outputs against the FP16 GGUF to isolate whether the issue is from quantization or the fine-tuning itself.

Out of memory on Pi 5: Ensure no other large processes are running. The Q4_K_M model uses ~1.75 GB RSS (measured) on the 8 GB Pi 5, leaving ample headroom. If memory is tight, consider Q4_K_S for a slightly smaller footprint.