Orion::VlmInferenceEngine Component¶

1. Introduction¶

The Orion::VlmInferenceEngine component runs the LFM2.5-VL-1.6B vision-language model on the satellite's CPU. It receives raw 512x512 RGB image frames from CameraManager, constructs a ChatML-formatted prompt with fused GPS coordinates, executes the llama.cpp forward pass, parses the JSON output into a triage verdict (HIGH/MEDIUM/LOW), and emits the result to TriageRouter.

The model is a ~730 MB Q4_K_M GGUF file loaded into RAM on demand (total process RSS measured at ~1,753 MB on the 8 GB Pi 5). Inference takes 51-82 seconds per frame (mean ~69s across 1,443 frames from 3 end-to-end runs) on the Pi 5's Cortex-A76 cores (CPU-only, no GPU). The component runs on a dedicated low-priority thread to avoid blocking other flight software.

2. Requirements¶

Requirement	Description	Verification Method
ORION-VLM-001	VlmInferenceEngine shall load the GGUF model and mmproj vision encoder on MEASURE entry	System test
ORION-VLM-002	VlmInferenceEngine shall unload the model on transition to IDLE or SAFE	System test
ORION-VLM-003	VlmInferenceEngine shall keep the model loaded during DOWNLINK (short pass, reload is expensive)	Inspection
ORION-VLM-004	VlmInferenceEngine shall classify each frame as HIGH, MEDIUM, or LOW and emit the result to TriageRouter	System test
ORION-VLM-005	VlmInferenceEngine shall return buffers to the pool on inference failure	Inspection
ORION-VLM-006	VlmInferenceEngine shall drop all frames in SAFE mode without inference	System test
ORION-VLM-007	VlmInferenceEngine shall abort and recover from any inference exceeding `INFERENCE_TIMEOUT_S` (120 s)	System test

3. Design¶

3.1 Data Flow¶

flowchart LR
    CM[CameraManager] -->|buffer + lat/lon| VLM[VlmInferenceEngine]
    VLM -->|verdict + reason + buffer| TR[TriageRouter]
    VLM -->|buffer return on failure| BM[BufferManager]
    EA[EventAction] -->|modeChangeIn| VLM

3.2 Inference Pipeline¶

For each frame, runInference() executes the following stages:

Prompt construction: ChatML format matching the fine-tuning template:

<|im_start|>user
<image>
You are an autonomous orbital triage assistant...
captured at Longitude: X, Latitude: Y.
...<|im_end|>
<|im_start|>assistant

Image encoding: mtmd_bitmap_init() wraps the raw RGB buffer, mtmd_tokenize() replaces the image marker with vision encoder tokens
KV cache evaluation: mtmd_helper_eval_chunks() processes all prompt chunks (text + vision tokens) into the context. Timeout checked after eval.
Autoregressive generation: Greedy sampling up to MAX_RESPONSE_TOKENS (200 tokens), stopping on EOG token. Timeout checked per token.
KV cache reset: llama_memory_clear() and llama_sampler_reset() prepare for the next frame
JSON parsing: parseVerdictJson() extracts "category" and "reason" from the response

3.3 Inference Timeout¶

A self-watchdog checks elapsed time at two points during inference:

After prompt eval: catches cases where vision encoding + context evaluation exceeds the limit
Per token in generation loop: catches slow or stuck token generation

If elapsed time exceeds INFERENCE_TIMEOUT_S (120s), the inference is aborted: KV cache is cleared, sampler is reset, InferenceTimeout event is logged, and the frame is dropped. The model stays loaded and ready for the next frame; hence, no restart required.

3.4 JSON Parser¶

The model is fine-tuned to output:

{ "reason": "Dense geometric infrastructure...", "category": "HIGH" }

The parser:

Searches the entire response for "HIGH", "MEDIUM" (case-sensitive, quoted) to determine category. Defaults to LOW if neither found.
Finds the "reason" key, extracts the string value while handling escaped quotes
Falls back to the raw response as the reason if no "reason" key is found
Falls back to "Empty model response" if the response is empty

3.5 Model Lifecycle¶

stateDiagram-v2
    [*] --> Unloaded
    Unloaded --> Loaded : MEASURE entry / LOAD_MODEL cmd
    Loaded --> Unloaded : IDLE or SAFE entry / UNLOAD_MODEL cmd
    Loaded --> Loaded : DOWNLINK entry (stays loaded)

The model auto-loads on MEASURE entry and auto-unloads on IDLE or SAFE entry. During DOWNLINK, the model stays loaded to avoid the ~21s reload penalty on the Pi. Manual LOAD_MODEL and UNLOAD_MODEL commands are available for ground control.

3.6 Port Diagram¶

Port	Direction	Type	Description
`inferenceRequestIn`	async input	`InferenceRequestPort`	Receives image buffer + GPS from CameraManager (queue depth 5)
`modeChangeIn`	async input	`ModeChangePort`	Receives mode broadcasts from EventAction
`triageDecisionOut`	output	`TriageDecisionPort`	Emits verdict + reason + buffer to TriageRouter
`bufferReturnOut`	output	`Fw.BufferSend`	Returns buffer to pool on inference failure

3.7 Commands¶

Command	Opcode	Behavior
`LOAD_MODEL`	0x00	Loads GGUF text model + mmproj vision encoder. Idempotent. Rejected if not in MEASURE or DOWNLINK.
`UNLOAD_MODEL`	0x01	Frees all llama.cpp state from RAM.

3.8 Events¶

Event	Severity	Description
`ModelLoaded`	ACTIVITY_HI	Model and vision encoder loaded into RAM
`ModelUnloaded`	ACTIVITY_HI	Model freed from RAM
`ModelLoadFailed`	WARNING_HI	GGUF file or mmproj failed to load (with path)
`InferenceFailed`	WARNING_HI	Tokenization, eval, or generation failed for a frame
`FrameDroppedModelNotLoaded`	WARNING_LO	Frame arrived but model not loaded - buffer returned
`LoadModelRejectedWrongMode`	WARNING_LO	LOAD_MODEL rejected - not in MEASURE or DOWNLINK
`InferenceTimeout`	WARNING_HI	Inference exceeded `INFERENCE_TIMEOUT_S`; frame dropped, model stays loaded
`InferenceComplete`	ACTIVITY_HI	Successful classification with category, reason, and time in ms

3.9 Telemetry¶

Channel	Type	Description
`InferenceTime_Ms`	U32	Wall-clock time of the most recent inference pass
`TotalInferences`	U32	Running total of successful classifications
`InferenceFailures`	U32	Running total of failed inference attempts

3.10 llama.cpp Integration¶

The inference engine uses the llama.cpp C API to run the quantized VLM entirely on CPU. The integration involves three layers:

Static linking. llama.cpp is built as a set of static libraries (libllama.a, libmtmd.a, libggml.a, libggml-base.a, libggml-cpu.a) from the ground_segment/llama.cpp submodule. The component's CMakeLists.txt links these directly into the F-Prime binary. On macOS, Metal and Accelerate frameworks are additionally linked for GPU/BLAS acceleration; on Linux/Pi 5, OpenMP (gomp) provides CPU parallelism.

Header vendoring. llama.cpp's public headers (llama.h, mtmd.h, mtmd-helper.h) are included directly from the submodule source tree via include_directories() in CMake. The headers define opaque struct types (llama_model, llama_context, mtmd_context, llama_sampler) that the component forward-declares in its own .hpp to avoid exposing llama.cpp includes to F-Prime's autocoded headers. A DEPRECATED macro conflict between F-Prime and llama.cpp is resolved with #pragma push_macro / #pragma pop_macro in the .cpp file.

API surface used. The component uses four llama.cpp subsystems:

Subsystem	Key functions	Purpose
Model loading	`llama_model_load_from_file`, `llama_context_init_from_model`	Load GGUF text model and create inference context
Vision encoder	`mtmd_init_from_file`, `mtmd_bitmap_init`, `mtmd_tokenize`, `mtmd_helper_eval_chunks`	Load mmproj, wrap raw RGB buffer, tokenize image, evaluate into KV cache
Sampling	`llama_sampler_chain_init`, `llama_sampler_chain_add`, `llama_sampler_sample`	Greedy autoregressive token generation
State management	`llama_memory_clear`, `llama_sampler_reset`	Reset KV cache and sampler between frames

3.11 F-Prime Constant Overrides¶

The InferenceComplete event carries the VLM's reason string (up to 400 characters). F-Prime's default FW_LOG_STRING_MAX_SIZE (200) truncates this in the GDS event log. The project overrides four framework constants via config/FpConstants.fpp (registered as a CONFIGURATION_OVERRIDES target in CMake):

Constant	Default	Override	Reason
`FW_LOG_STRING_MAX_SIZE`	200	400	Match the `InferenceComplete` reason field size
`FW_COM_BUFFER_MAX_SIZE`	512	768	Accommodate larger log buffers within CCSDS TmFramer limits (payload capacity 1016, overhead 13)
`FW_LOG_TEXT_BUFFER_SIZE`	256	600	Fit the fully formatted event text
`FW_FIXED_LENGTH_STRING_SIZE`	256	400	Must be >= `FW_LOG_STRING_MAX_SIZE` per framework static_assert

3.12 Configuration Constants¶

Constant	Value	Description
`IMAGE_W` / `IMAGE_H`	512	Expected input image dimensions
`N_CTX`	4096	KV cache context size in tokens
`N_BATCH`	512	Batch size for prompt evaluation
`N_THREADS`	4	CPU threads for inference (Pi 5 quad-core)
`MAX_RESPONSE_TOKENS`	200	Maximum tokens to generate per frame
`IMAGE_MAX_TOKENS`	1024	Cap on vision encoder output tokens
`INFERENCE_TIMEOUT_S`	120	Abort inference after this many seconds

3.13 Environment Variables¶

Variable	Default	Description
`ORION_GGUF_PATH`	`./orion-q4_k_m.gguf`	Path to the Q4_K_M quantized text model
`ORION_MMPROJ_PATH`	`./orion-mmproj-f16.gguf`	Path to the FP16 vision encoder projection

4. Change Log¶

Date	Description
2026-04-17	Initial implementation: llama.cpp integration, ChatML prompt, JSON parser
2026-04-18	Fixed chat template (Phi-3 to ChatML), token limit, auto-lifecycle
2026-04-18	Fixed model not unloading on DOWNLINK → SAFE transition
2026-04-20	Added mode gating, FrameDroppedModelNotLoaded, LoadModelRejectedWrongMode
2026-04-24	Removed health ping; added 120s self-watchdog with InferenceTimeout event
2026-05-01	Improved JSON parser: extract category value by key instead of global search
2026-05-02	InferenceComplete event reason field reduced from string size 512 to 400
2026-05-03	Fixed SDD cross-reference links for mkdocs; corrected model size to ~730 MB
2026-05-03	Added llama.cpp integration design, header vendoring, and FPP constant overrides sections