Data Generation¶
For design details - target definitions, triage categories, and morphology sub-types - see the Dataset architecture page.
Prerequisites¶
- SimSat must be running and serving images at
http://localhost:9005/data/image/mapbox. - Python 3.10+ with the
requestspackage installed.
Running the Data Generator¶
cd ground_segment/data
python data_gen.py
What Happens¶
- Deduplication: Targets within 2 km of each other are removed.
- Shuffle and Split: The deduplicated targets are shuffled and split 80/20 into train and test sets.
- Image Capture: For each target, a 512x512 RGB satellite image is fetched from SimSat.
- Prompt Generation: A triage prompt is generated with longitude and latitude telemetry.
- Coordinate Dropout (Training Only): Each training sample is duplicated - once with coordinates in the prompt, once without. Test samples always include coordinates.
- JSONL Writing: Records are appended to the appropriate JSONL file.
The generator waits 0.5 seconds between requests to avoid overloading SimSat.
Expected Output¶
After running, the following directory structure is created:
orion_dataset/
images/
high_port_rotterdam.png
low_ocean_pacific_nemo.png
med_city_chicago.png
...
train_dataset.jsonl
test_dataset.jsonl
Each line in the JSONL files is a JSON object with this structure:
{
"image": "orion_dataset/images/high_port_rotterdam.png",
"conversations": [
{
"role": "user",
"content": "<image>\nYou are an autonomous orbital triage assistant. Analyze this high-resolution RGB satellite image captured at Longitude: 4.05, Latitude: 51.95.\nStrictly use one of these categories based on visual morphology:\n- HIGH: ...\n- MEDIUM: ...\n- LOW: ...\nYou MUST output your response as a valid JSON object. To ensure accurate visual reasoning, you must output the \"reason\" key FIRST, followed by the \"category\" key."
},
{
"role": "assistant",
"content": "{\"reason\": \"Extreme-density geometric cargo terminals and massive vessel berthing.\", \"category\": \"HIGH\"}"
}
]
}
The <image> token in the user content marks where the image will be injected during training.
Customizing Targets¶
To add new targets, edit the target lists in ground_segment/data/data.py:
HIGH_TARGETS = [
# Add your target
{
"name": "high_my_new_target",
"lon": -77.05,
"lat": 38.87,
"cat": "HIGH",
"reason": "Describe the visual morphology that justifies this classification.",
},
# ... existing targets
]
Guidelines for adding targets:
- Use a descriptive
nameprefixed with the category abbreviation (high_,med_,low_). - The
reasonshould describe the visual morphology an orbital observer would see, not domain knowledge about the location. - Ensure new targets are at least 2 km from existing targets to avoid being filtered out by the proximity check.
- For robust training, consider adding targets that are visually challenging for their category (e.g., natural formations that resemble human infrastructure for LOW).