Orion::VlmInferenceEngine Component¶

1. Introduction¶

The Orion::VlmInferenceEngine component runs the LFM2.5-VL-1.6B vision-language model on the satellite's CPU. It receives raw 512x512 RGB image frames from CameraManager, constructs a ChatML-formatted prompt with fused GPS coordinates, executes the llama.cpp forward pass, parses the JSON output into a triage verdict (HIGH/MEDIUM/LOW), and emits the result to TriageRouter.

The model is a ~700 MB Q4_K_M GGUF file loaded into RAM on demand. Inference takes 50-60 seconds per frame on the Pi 5's Cortex-A76 cores (CPU-only, no GPU). The component runs on a dedicated low-priority thread to avoid blocking other flight software.

2. Requirements¶

Requirement	Description	Verification Method
ORION-VLM-001	VlmInferenceEngine shall load the GGUF model and mmproj vision encoder on MEASURE entry	System test
ORION-VLM-002	VlmInferenceEngine shall unload the model on transition to IDLE or SAFE	System test
ORION-VLM-003	VlmInferenceEngine shall keep the model loaded during DOWNLINK (short pass, reload is expensive)	Inspection
ORION-VLM-004	VlmInferenceEngine shall classify each frame as HIGH, MEDIUM, or LOW and emit the result to TriageRouter	System test
ORION-VLM-005	VlmInferenceEngine shall return buffers to the pool on inference failure	Inspection
ORION-VLM-006	VlmInferenceEngine shall drop all frames in SAFE mode without inference	System test
ORION-VLM-007	VlmInferenceEngine shall abort and recover from any inference exceeding `INFERENCE_TIMEOUT_S` (120 s)	System test

3. Design¶

3.1 Data Flow¶

flowchart LR
    CM[CameraManager] -->|buffer + lat/lon| VLM[VlmInferenceEngine]
    VLM -->|verdict + reason + buffer| TR[TriageRouter]
    VLM -->|buffer return on failure| BM[BufferManager]
    EA[EventAction] -->|modeChangeIn| VLM

3.2 Inference Pipeline¶

For each frame, runInference() executes the following stages:

Prompt construction — ChatML format matching the fine-tuning template:

<|im_start|>user
<image>
You are an autonomous orbital triage assistant...
captured at Longitude: X, Latitude: Y.
...<|im_end|>
<|im_start|>assistant

Image encoding — mtmd_bitmap_init() wraps the raw RGB buffer, mtmd_tokenize() replaces the image marker with vision encoder tokens
KV cache evaluation — mtmd_helper_eval_chunks() processes all prompt chunks (text + vision tokens) into the context. Timeout checked after eval.
Autoregressive generation — Greedy sampling up to MAX_RESPONSE_TOKENS (200 tokens), stopping on EOG token. Timeout checked per token.
KV cache reset — llama_memory_clear() and llama_sampler_reset() prepare for the next frame
JSON parsing — parseVerdictJson() extracts "category" and "reason" from the response

3.3 Inference Timeout¶

A self-watchdog checks elapsed time at two points during inference:

After prompt eval — catches cases where vision encoding + context evaluation exceeds the limit
Per token in generation loop — catches slow or stuck token generation

If elapsed time exceeds INFERENCE_TIMEOUT_S (120s), the inference is aborted: KV cache is cleared, sampler is reset, InferenceTimeout event is logged, and the frame is dropped. The model stays loaded and ready for the next frame — no restart required.

3.3 JSON Parser¶

The model is fine-tuned to output:

{ "reason": "Dense geometric infrastructure...", "category": "HIGH" }

The parser:

Searches the entire response for "HIGH", "MEDIUM" (case-sensitive, quoted) to determine category. Defaults to LOW if neither found.
Finds the "reason" key, extracts the string value while handling escaped quotes
Falls back to the raw response as the reason if no "reason" key is found
Falls back to "Empty model response" if the response is empty

3.4 Model Lifecycle¶

stateDiagram-v2
    [*] --> Unloaded
    Unloaded --> Loaded : MEASURE entry / LOAD_MODEL cmd
    Loaded --> Unloaded : IDLE or SAFE entry / UNLOAD_MODEL cmd
    Loaded --> Loaded : DOWNLINK entry (stays loaded)

The model auto-loads on MEASURE entry and auto-unloads on IDLE or SAFE entry. During DOWNLINK, the model stays loaded to avoid the ~15s reload penalty on the Pi. Manual LOAD_MODEL and UNLOAD_MODEL commands are available for ground control.

3.5 Port Diagram¶

Port	Direction	Type	Description
`inferenceRequestIn`	async input	`InferenceRequestPort`	Receives image buffer + GPS from CameraManager (queue depth 5)
`modeChangeIn`	async input	`ModeChangePort`	Receives mode broadcasts from EventAction
`triageDecisionOut`	output	`TriageDecisionPort`	Emits verdict + reason + buffer to TriageRouter
`bufferReturnOut`	output	`Fw.BufferSend`	Returns buffer to pool on inference failure

3.6 Commands¶

Command	Opcode	Behavior
`LOAD_MODEL`	0x00	Loads GGUF text model + mmproj vision encoder. Idempotent. Rejected if not in MEASURE or DOWNLINK.
`UNLOAD_MODEL`	0x01	Frees all llama.cpp state from RAM.

3.7 Events¶

Event	Severity	Description
`ModelLoaded`	ACTIVITY_HI	Model and vision encoder loaded into RAM
`ModelUnloaded`	ACTIVITY_HI	Model freed from RAM
`ModelLoadFailed`	WARNING_HI	GGUF file or mmproj failed to load (with path)
`InferenceFailed`	WARNING_HI	Tokenization, eval, or generation failed for a frame
`FrameDroppedModelNotLoaded`	WARNING_LO	Frame arrived but model not loaded — buffer returned
`LoadModelRejectedWrongMode`	WARNING_LO	LOAD_MODEL rejected — not in MEASURE or DOWNLINK
`InferenceComplete`	ACTIVITY_HI	Successful classification with category, reason, and time in ms

3.8 Telemetry¶

Channel	Type	Description
`InferenceTime_Ms`	U32	Wall-clock time of the most recent inference pass
`TotalInferences`	U32	Running total of successful classifications
`InferenceFailures`	U32	Running total of failed inference attempts

3.9 Configuration Constants¶

Constant	Value	Description
`IMAGE_W` / `IMAGE_H`	512	Expected input image dimensions
`N_CTX`	4096	KV cache context size in tokens
`N_BATCH`	512	Batch size for prompt evaluation
`N_THREADS`	4	CPU threads for inference (Pi 5 quad-core)
`MAX_RESPONSE_TOKENS`	200	Maximum tokens to generate per frame
`IMAGE_MAX_TOKENS`	1024	Cap on vision encoder output tokens
`INFERENCE_TIMEOUT_S`	120	Abort inference after this many seconds

3.10 Environment Variables¶

Variable	Default	Description
`ORION_GGUF_PATH`	`./orion-q4_k_m.gguf`	Path to the Q4_K_M quantized text model
`ORION_MMPROJ_PATH`	`orion-mmproj-f16.gguf`	Path to the FP16 vision encoder projection

4. Change Log¶

Date	Description
2026-04-17	Initial implementation: llama.cpp integration, ChatML prompt, JSON parser
2026-04-18	Fixed chat template (Phi-3 to ChatML), token limit, auto-lifecycle
2026-04-18	Fixed model not unloading on DOWNLINK → SAFE transition
2026-04-20	Added mode gating, FrameDroppedModelNotLoaded, LoadModelRejectedWrongMode
2026-04-24	Removed health ping; added 120s self-watchdog with InferenceTimeout event