Quantization¶
For design details - quantization scheme tradeoffs and why Q4_K_M is used - see the Training Pipeline architecture page.
Prerequisites¶
- The merged model from the fuse step (see the Training guide) at
ground_segment/training/orion_merged/. - Python 3.10+ with the
ggufpackage installed (pip install gguf). - llama.cpp built from source (for the
llama-quantizebinary). - ~11 GB free disk space for all intermediate artifacts. See Compute Budgets for details.
Step 1: Build llama.cpp¶
cd ground_segment/llama.cpp
mkdir build && cd build
cmake ..
cmake --build . --config Release
This produces the llama-quantize binary in the build directory.
Step 2: Convert to GGUF (FP16)¶
Use the convert_hf_to_gguf.py script to convert the merged Hugging Face model to GGUF format:
cd ground_segment/llama.cpp
python convert_hf_to_gguf.py \
../training/orion_merged/ \
--outfile orion-f16.gguf \
--outtype f16
This produces an FP16 GGUF file. For VLMs with a multimodal projector, the conversion also generates a separate projector file.
Step 3: Extract the Multimodal Projector¶
For vision-language models, the multimodal projector (mmproj) must be extracted separately. The conversion script typically outputs this automatically. If not, you can extract it:
python convert_hf_to_gguf.py \
../training/orion_merged/ \
--outfile orion-mmproj-f16.gguf \
--outtype f16 \
--mmproj
The projector file stays in FP16 - it is small and does not need quantization.
Step 4: Quantize to Q4_K_M¶
Quantize the main model file (not the projector):
./build/bin/llama-quantize orion-f16.gguf orion-q4_k_m.gguf Q4_K_M
Output Files¶
After quantization, you should have two files ready for deployment:
orion-q4_k_m.gguf- Quantized language model (Q4_K_M), approximately 1 GB.orion-mmproj-f16.gguf- Multimodal projector (FP16), approximately 100 MB.
Step 5: Deploy to the Raspberry Pi 5¶
Transfer both GGUF files to the Pi 5's filesystem:
scp orion-q4_k_m.gguf orion-mmproj-f16.gguf pi@<pi-address>:/path/to/orion/models/
Verifying the Deployment¶
You can test the quantized model on the Pi using llama.cpp's CLI:
./llama-cli \
-m /path/to/orion-q4_k_m.gguf \
--mmproj /path/to/orion-mmproj-f16.gguf \
-p "You are an autonomous orbital triage assistant..." \
--image /path/to/test-image.png \
-n 200
Troubleshooting¶
Conversion fails with "unsupported model architecture": Ensure you are using a recent version of llama.cpp that supports the LFM2.5-VL architecture. Pull the latest ground_segment/llama.cpp submodule.
Quantized model produces degraded output: Try Q5_K_M for higher quality at the cost of larger file size. Compare outputs against the FP16 GGUF to isolate whether the issue is from quantization or the fine-tuning itself.
Out of memory on Pi 5: Ensure no other large processes are running. The Q4_K_M model should fit within 4 GB RAM alongside the inference runtime. If memory is tight, consider Q4_K_S for a slightly smaller footprint.