The ORION model card is available on Hugging Face.

base_model: LiquidAI/LFM2.5-VL-1.6B library_name: peft pipeline_tag: image-text-to-text license: apache-2.0 tags: - satellite-imagery - remote-sensing - orbital-triage - qlora - peft - llama-cpp - fprime language: - en

ORION: Orbital Triage LoRA Adapter¶

QLoRA fine-tune of LiquidAI/LFM2.5-VL-1.6B for autonomous satellite image triage. Classifies 512×512 RGB frames captured at LEO as HIGH (strategic anomaly, downlink immediately), MEDIUM (human infrastructure, store for bulk transfer), or LOW (featureless terrain, discard).

Developed for ORION, an autonomous LEO satellite triage system running on a Raspberry Pi 5 via NASA F-Prime. The Q4_K_M GGUF quantization of this adapter is deployed on-board and runs inference at 51-82 s/frame (mean ~69s across 1,443 frames from 3 end-to-end runs) entirely on CPU.

Uses¶

Intended use: on-board orbital triage on a satellite OBC. The model receives a 512×512 RGB satellite tile (optionally with GPS coordinates in the prompt) and returns a JSON object with a triage verdict and visual reasoning.

Triage prompt (ChatML format, used identically for training, evaluation, and on-board inference):

<|im_start|>user
<image>
You are an autonomous orbital triage assistant. Analyze this
high-resolution RGB satellite image captured at Longitude: {lon},
Latitude: {lat}.
Strictly use one of these categories based on visual morphology:
- HIGH: Extreme-scale strategic anomalies, dense geometric cargo/vessel
  infrastructure, massive cooling towers, sprawling runways, or distinct
  geological/artificial chokepoints.
- MEDIUM: Standard human civilization. Ordinary urban grids, low-density
  suburban sprawl, regular checkerboard agriculture, or localized
  infrastructure.
- LOW: Complete absence of human infrastructure. Featureless deep oceans,
  unbroken canopy, barren deserts, or purely natural geological formations.
You MUST output your response as a valid JSON object. To ensure accurate
visual reasoning, you must output the "reason" key FIRST, followed by
the "category" key.<|im_end|>
<|im_start|>assistant

The model responds with {"reason": "...", "category": "HIGH|MEDIUM|LOW"}. Reason-first ordering forces the model to commit to visual evidence before selecting a label. During training, half the samples omit the Longitude/Latitude line (coordinate dropout augmentation).

Out of scope: multispectral analysis, change detection, object detection with bounding boxes, real-time video, or any use case requiring sub-60-second latency without CUDA acceleration.

Dataset¶

The adapter was trained on the ORION dataset, 360 curated target locations organized by triage priority and visual morphology, fetched as 512×512 RGB tiles from SimSat's Mapbox API.

Class	Targets	Visual morphology
LOW	120	Oceans, deserts, ice sheets, dense canopy, geological formations
MEDIUM	120	Urban grids, suburban sprawl, agriculture, regional infrastructure
HIGH	120	Mega-ports, mega-airports, energy/dams, mega-mines, military/space facilities

Hard negatives are included in LOW: coastlines and geological formations that mimic artificial structures (calderas, salt flat fractals, river deltas).

Split (deterministic, random.seed(42)):

Split	Records	Notes
Train	480	240 targets × 2 (coordinate dropout augmentation)
Val	60	Always with coordinates; used for `eval_loss` + best-checkpoint selection
Test	60	Always with coordinates; held out for ablation and evaluation

Coordinate dropout augmentation: each training target produces two records, one with GPS coordinates in the prompt and one without. This teaches the model to classify from pixels alone when GPS is unavailable or spoofed.

Training Procedure¶

Base model¶

LiquidAI/LFM2.5-VL-1.6B loaded in 4-bit NF4 quantization via bitsandbytes.

LoRA configuration¶

Parameter	Value
Rank (`r`)	16
Alpha	32
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`
Dropout	0.05
Bias	none
Task type	`CAUSAL_LM`

Training arguments¶

Parameter	Value
Learning rate	2e-4
Epochs	3
Per-device batch size	1
Gradient accumulation steps	16 (effective batch 16)
Optimizer	`paged_adamw_8bit`
Precision	FP16
Gradient checkpointing	enabled
Best checkpoint selection	`eval_loss` (lower is better)

Hardware¶

Component	Spec
GPU	NVIDIA GeForce RTX 4070 Ti, 12 GB VRAM
CUDA	12.2
Driver	535.x
OS	Linux

Training time¶

Metric	Value
Time per epoch	~830s
Total training time	~2492s

Model Artifacts¶

Artifact	File	Size	Notes
LoRA adapter (this repo)	`orion_lora_weights/`	~50 MB	r=16, 4 attention projection modules
Merged FP16 checkpoint	`orion_merged/`	~3.2 GB	`merge_and_unload()` output
FP16 GGUF	`orion-f16.gguf`	~3.2 GB	Intermediate conversion step
Q4_K_M GGUF	`orion-q4_k_m.gguf`	~730 MB	Deployed to Pi 5 (8 GB RAM)
Vision projector	`orion-mmproj-f16.gguf`	~814 MB	FP16, deployed alongside Q4 model

Measured on-device: Total ORION process RSS during inference on the Pi 5 is ~1,753 MB (model + vision encoder + KV cache + F-Prime flight software + buffer pool).

The Q4_K_M GGUF + mmproj pair is the deployed artifact. Pre-built files are available on Hugging Face.

Evaluation¶

Both studies use the same four conditions run against the same 60-sample held-out test set. The ablation (ablation.py) tests the unmodified base model; the evaluation (evaluate.py) tests the fine-tuned adapter. Running both against identical inputs isolates the exact lift from fine-tuning.

Refer to Training Pipeline for more details on how to read this result.

Condition	Input	Purpose
A: Full system	Real image + real GPS coords	Nominal operating condition
B: Vision only	Real image + no coords	GPS-denied or noisy environment
C: Blind LLM	Gaussian noise image + real coords	Coordinates-only baseline (tests GPS reliance)
D: Sensor conflict	Real image + spoofed coords	Adversarial GPS; tests which modality the model trusts

Ablation study: base model (`ablation.py`)¶

Condition	Overall accuracy	Notes
A: Vision + GPS coords	58.3%
B: Vision only (no coords)	60.0%	Slightly better: coords can mislead base model
C: Blind LLM (Gaussian noise + coords)	35.0%	Predicts LOW for everything; GPS alone is unreliable
D: Sensor conflict	N/A	Trusts incorrect coords 20.0% of the time

Full log:

--- Condition A: Full System (Vision + Coords) ---
HIGH : 8/14 (57.1% Recall) | Precision: 8/17 (47.1%)
MEDIUM: 9/25 (36.0% Recall) | Precision: 9/13 (69.2%)
LOW : 18/21 (85.7% Recall) | Precision: 18/30 (60.0%)
TOTAL : 35/60 (58.3% Overall Accuracy)

--- Condition B: Vision Only (No Coords) ---
HIGH : 9/14 (64.3% Recall) | Precision: 9/16 (56.2%)
MEDIUM: 8/25 (32.0% Recall) | Precision: 8/11 (72.7%)
LOW : 19/21 (90.5% Recall) | Precision: 19/33 (57.6%)
TOTAL : 36/60 (60.0% Overall Accuracy)

--- Condition C: Blind LLM (Gaussian Noise + Coords) ---
HIGH : 0/14 (0.0% Recall) | Precision: 0/0 (0.0%)
MEDIUM: 0/25 (0.0% Recall) | Precision: 0/0 (0.0%)
LOW : 21/21 (100.0% Recall) | Precision: 21/60 (35.0%)
TOTAL : 21/60 (35.0% Overall Accuracy)

--- Condition D: Sensor Conflict (Real Vision + Fake Coords) ---
Model trusted Vision (Correct) : 35/60 (58.3%)
Model trusted Coords (Failure) : 12/60 (20.0%)
Model got Confused (Neither)   : 13/60 (21.7%)

Fine-tuned model evaluation (`evaluate.py`)¶

Condition	Overall accuracy	Notes
A: Vision + GPS coords	58.3%
B: Vision only (no coords)	65.0%	Improved over base (+5 pp)
C: Blind LLM (Gaussian noise + coords)	43.3%	Predicts MEDIUM for most noise inputs
D: Sensor conflict	-	Trusts incorrect coords 16.7% of the time (down from 20.0%)

Per-class accuracy (condition A)¶

Class	Precision	Recall	F1
HIGH	46.7%	50.0%	48.3%
MEDIUM	66.7%	40.0%	50.0%
LOW	60.0%	85.7%	70.6%

Full log:

--- Condition A: Full System (Vision + Coords) ---
HIGH  :  7/14 (50.0% Recall) | Precision:  7/15 (46.7%)
MEDIUM: 10/25 (40.0% Recall) | Precision: 10/15 (66.7%)
LOW   : 18/21 (85.7% Recall) | Precision: 18/30 (60.0%)
TOTAL : 35/60 (58.3% Overall Accuracy)

--- Condition B: Vision Only (No Coords) ---
HIGH  :  9/14 (64.3% Recall) | Precision:  9/15 (60.0%)
MEDIUM: 12/25 (48.0% Recall) | Precision: 12/17 (70.6%)
LOW   : 18/21 (85.7% Recall) | Precision: 18/28 (64.3%)
TOTAL : 39/60 (65.0% Overall Accuracy)

--- Condition C: Blind LLM (Gaussian Noise + Coords) ---
HIGH  :  1/14 ( 7.1% Recall) | Precision:  1/ 1 (100.0%)
MEDIUM: 25/25 (100.0% Recall) | Precision: 25/59 (42.4%)
LOW   :  0/21 ( 0.0% Recall) | Precision:  0/ 0   (0.0%)
TOTAL : 26/60 (43.3% Overall Accuracy)

--- Condition D: Sensor Conflict (Real Vision + Fake Coords) ---
Model trusted Vision (Correct) : 37/60 (61.7%)
Model trusted Coords (Failure) : 10/60 (16.7%)
Model got Confused   (Neither) : 13/60 (21.7%)

Quantized GGUF evaluation (`evaluate.py --quantized-model`)¶

The same 4-condition protocol run against the Q4_K_M GGUF deployed on-device via llama.cpp's HTTP server. This measures accuracy degradation from quantization using the exact same test set.

Condition	Overall accuracy	Notes
A: Vision + GPS coords	55.0%	−3.3 pp from FP16 fine-tuned
B: Vision only (no coords)	63.3%	−1.7 pp from FP16 fine-tuned
C: Blind LLM (Gaussian noise + coords)	28.3%	Predicts HIGH for most noise inputs
D: Sensor conflict	-	Trusts incorrect coords 15.0% of the time (down from 16.7%)

Full log:

--- Condition A: Full System (Vision + Coords) ---
HIGH  :  7/14 (50.0% Recall) | Precision:  7/16 (43.8%)
MEDIUM:  8/25 (32.0% Recall) | Precision:  8/13 (61.5%)
LOW   : 18/21 (85.7% Recall) | Precision: 18/31 (58.1%)
TOTAL : 33/60 (55.0% Overall Accuracy)

--- Condition B: Vision Only (No Coords) ---
HIGH  :  8/14 (57.1% Recall) | Precision:  8/13 (61.5%)
MEDIUM: 10/25 (40.0% Recall) | Precision: 10/12 (83.3%)
LOW   : 20/21 (95.2% Recall) | Precision: 20/35 (57.1%)
TOTAL : 38/60 (63.3% Overall Accuracy)

--- Condition C: Blind LLM (Gaussian Noise + Coords) ---
HIGH  :  8/14 (57.1% Recall) | Precision:  8/41 (19.5%)
MEDIUM:  2/25 ( 8.0% Recall) | Precision:  2/ 4 (50.0%)
LOW   :  7/21 (33.3% Recall) | Precision:  7/15 (46.7%)
TOTAL : 17/60 (28.3% Overall Accuracy)

--- Condition D: Sensor Conflict (Real Vision + Fake Coords) ---
Model trusted Vision (Correct) : 37/60 (61.7%)
Model trusted Coords (Failure) :  9/60 (15.0%)
Model got Confused   (Neither) : 14/60 (23.3%)

Fine-tuning and quantization impact¶

Condition	Base model	Fine-tuned (FP16)	Q4_K_M GGUF	Δ (fine-tune)	Δ (quantization)
A: Vision + GPS coords	58.3%	58.3%	55.0%	0 pp	−3.3 pp
B: Vision only (no coords)	60.0%	65.0%	63.3%	+5.0 pp	−1.7 pp
C: Blind LLM (noise+coords)	35.0%	43.3%	28.3%	+8.3 pp	−15.0 pp

Sensor conflict (Condition D): coordinate-trust failure drops from 20.0% (base) to 16.7% (fine-tuned FP16) to 15.0% (Q4_K_M GGUF). Quantization does not degrade GPS robustness.

Quantization impact on operational conditions (A and B): accuracy loss from Q4_K_M quantization is modest (−3.3 pp and −1.7 pp respectively), confirming that the deployed GGUF retains most of the fine-tuned model's capability. The large drop on Condition C (noise inputs) is not operationally relevant since the model never receives noise images in deployment.

Discussion¶

Fine-tuning produces measurable improvements on Conditions B, C, and D, but Condition A (the nominal operating condition with both image and GPS) shows no gain on this 360-target dataset. The most likely explanation is the breadth of the HIGH category: mega-ports, mega-airports, energy infrastructure, open-pit mines, and military facilities are all grouped into a single label. The model can learn to output the correct JSON format quickly (training loss drops to 0.18 in ~41 minutes), but 240 training images spread across five visually heterogeneous HIGH sub-types is not enough for the visual encoder to learn a reliable decision boundary.

This is a prototype demonstrating that on-board VLM inference on a Pi 5 is technically viable. The approach will improve significantly with:

Narrower taxonomy: splitting HIGH into mission-specific sub-classes (e.g., ports only, or energy infrastructure only) and training a specialist adapter
Larger corpus: 240 training images is a minimal dataset for a 3-class VLM task; 1,000-5,000 images per class is a more realistic target for robust generalization
Higher-resolution tiles: 512×512 Mapbox tiles lose fine-grained texture that distinguishes, e.g., a cargo terminal from a large parking lot at altitude

Deployment¶

The adapter is converted to Q4_K_M GGUF via llama-quantize and runs on the Pi 5 via llama.cpp's multimodal (mtmd) API:

Vision encoding (mtmd):      ~10-15 s
Token generation (200 max):  ~40-55 s
Total per frame:             ~51-82 s  (CPU only, Cortex-A76, mean ~69 s, 1,443 frames from 3 end-to-end runs)

See the quantization guide and deployment guide for full instructions.

Limitations¶

Trained on Mapbox RGB tiles only; hence, no multispectral, SAR, or thermal data.
512×512 pixel resolution matches the Pi 5 inference pipeline; different resolutions require re-cropping.
Three-class taxonomy (HIGH / MEDIUM / LOW) is fixed at training time. Mission-specific priorities require fine-tuning on a new labeled dataset.
Inference at 51-82 s/frame (mean ~69s across 1,443 frames from 3 end-to-end runs) sets a hard floor on capture interval: the auto-capture timer is 85s to avoid saturating the VLM queue, limiting throughput to ~24 frames per 35-min eclipse. Burst imaging, real-time video, or sub-minute revisit rates are not feasible without faster hardware (GPU/NPU) or a smaller model.
Coordinate dropout improves GPS robustness but does not eliminate coord-biased errors on hard edge cases.
Blank/missing tile hallucination: Mapbox returns blank white tiles at extreme latitudes (|lat| > 75°) where no satellite imagery exists. The model hallucinates strategic significance onto these featureless images (3 out of 8 HIGH classifications across 1,443 frames were blank tiles). These blank tiles are visually distinct from the ocean and ice sheet tiles in the training set. Mitigation: add blank/white tile detection before inference, or include polar blank tiles as explicit LOW training examples.
Natural feature false positives: Coastlines, cloud cover, and geological formations (e.g., river deltas, glacial terrain) can be misclassified as HIGH due to visual similarity to trained HIGH morphologies (e.g., coastlines as "artificial formations," clouds as "volcanic eruptions"). The hard-negative training set mitigates some of this, but edge cases remain.
Training data was generated at 500 km simulated altitude; Pi 5 runs used the SimSat TLE orbit at ~802 km (~0.7 Mapbox zoom levels difference). The model generalized across this mismatch without degradation, but accuracy may differ at significantly different altitudes.

ORION: Orbital Triage LoRA Adapter¶

Uses¶

Dataset¶

Training Procedure¶

Base model¶

LoRA configuration¶

Training arguments¶

Hardware¶

Training time¶

Model Artifacts¶

Evaluation¶

Ablation study: base model (ablation.py)¶

Fine-tuned model evaluation (evaluate.py)¶

Per-class accuracy (condition A)¶

Quantized GGUF evaluation (evaluate.py --quantized-model)¶

Fine-tuning and quantization impact¶

Discussion¶

Deployment¶

Limitations¶

Ablation study: base model (`ablation.py`)¶

Fine-tuned model evaluation (`evaluate.py`)¶

Quantized GGUF evaluation (`evaluate.py --quantized-model`)¶