Skip to content

data_gen

data.data_gen

ORION dataset generator - fetches satellite tiles from SimSat and writes JSONL splits.

For each target in data.ALL_TARGETS, this script:

  1. Fetches a 512x512 Mapbox satellite tile from SimSat's static image API.
  2. Assigns the target to a deterministic train/val/test split (seeded shuffle).
  3. Writes a conversation-format JSONL record suitable for LLaVA-style fine-tuning.

Train records are augmented with coordinate dropout: each target produces two records, one with GPS coordinates in the prompt and one without, so the model learns to classify from imagery alone when telemetry is unavailable.

Usage:

cd ground_segment/data
uv run data_gen.py        # requires SimSat running on localhost:9005

Output structure:

orion_dataset/
    images/              # 512x512 PNG tiles
    train_dataset.jsonl  # 2x train targets (coord augmentation)
    val_dataset.jsonl    # eval-loss tracking during training
    test_dataset.jsonl   # held-out evaluation set

fetch_image(lon, lat, filename)

Fetch a satellite tile from SimSat's static Mapbox API and save it to disk.

Parameters:

Name Type Description Default
lon

Target longitude.

required
lat

Target latitude.

required
filename

Output file path for the PNG image.

required

Returns:

Type Description

True on success, False if the request failed.

Source code in ground_segment/data/data_gen.py
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def fetch_image(lon, lat, filename):
    """Fetch a satellite tile from SimSat's static Mapbox API and save it to disk.

    Args:
        lon: Target longitude.
        lat: Target latitude.
        filename: Output file path for the PNG image.

    Returns:
        ``True`` on success, ``False`` if the request failed.
    """
    params = {
        "lon_target": lon,
        "lat_target": lat,
        "lon_satellite": lon,
        "lat_satellite": lat,
        "alt_satellite": 500.0,
    }
    try:
        res = requests.get(SIMSAT_STATIC_API, params=params, timeout=10)
        res.raise_for_status()
        with open(filename, "wb") as f:
            f.write(res.content)
        return True
    except Exception:
        return False

filter_overlaps(targets, min_dist_km=2.0)

Remove targets whose coordinates are within min_dist_km of an already-kept target.

Uses a greedy first-come-first-kept strategy in list order. This prevents near-duplicate tiles from inflating a single geographic area in the dataset.

Parameters:

Name Type Description Default
targets

List of target dicts (must contain lon and lat keys).

required
min_dist_km

Minimum separation in km (default 2.0).

2.0

Returns:

Type Description

Filtered list of targets with overlaps removed.

Source code in ground_segment/data/data_gen.py
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
def filter_overlaps(targets, min_dist_km=2.0):
    """Remove targets whose coordinates are within *min_dist_km* of an already-kept target.

    Uses a greedy first-come-first-kept strategy in list order. This prevents
    near-duplicate tiles from inflating a single geographic area in the dataset.

    Args:
        targets: List of target dicts (must contain ``lon`` and ``lat`` keys).
        min_dist_km: Minimum separation in km (default 2.0).

    Returns:
        Filtered list of targets with overlaps removed.
    """
    unique_targets = []
    skipped = 0
    for t in targets:
        is_too_close = False
        for u in unique_targets:
            if haversine(t["lon"], t["lat"], u["lon"], u["lat"]) < min_dist_km:
                is_too_close = True
                break
        if not is_too_close:
            unique_targets.append(t)
        else:
            skipped += 1
    print(
        f" Proximity Filter: Kept {len(unique_targets)} targets, skipped {skipped} overlaps."
    )
    return unique_targets

get_prompt(lon, lat, include_coords=True)

Build the ChatML user prompt for a single satellite image.

Parameters:

Name Type Description Default
lon

Longitude of the capture location.

required
lat

Latitude of the capture location.

required
include_coords

If False, omit GPS coordinates from the prompt (coordinate dropout for training augmentation).

True

Returns:

Type Description

The triage instruction prompt as a string.

Source code in ground_segment/data/data_gen.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def get_prompt(lon, lat, include_coords=True):
    """Build the ChatML user prompt for a single satellite image.

    Args:
        lon: Longitude of the capture location.
        lat: Latitude of the capture location.
        include_coords: If ``False``, omit GPS coordinates from the prompt
            (coordinate dropout for training augmentation).

    Returns:
        The triage instruction prompt as a string.
    """
    if include_coords:
        telemetry_str = f" captured at Longitude: {lon}, Latitude: {lat}"
    else:
        telemetry_str = ""

    return f"""You are an autonomous orbital triage assistant. Analyze this high-resolution RGB satellite image{telemetry_str}.
Strictly use one of these categories based on visual morphology:
- HIGH: Extreme-scale strategic anomalies, dense geometric cargo/vessel infrastructure, massive cooling towers, sprawling runways, or distinct geological/artificial chokepoints.
- MEDIUM: Standard human civilization. Ordinary urban grids, low-density suburban sprawl, regular checkerboard agriculture, or localized infrastructure (malls, regional strips).
- LOW: Complete absence of human infrastructure. Featureless deep oceans, unbroken canopy, barren deserts, or purely natural geological formations (craters, natural cliffs).
You MUST output your response as a valid JSON object. To ensure accurate visual reasoning, you must output the "reason" key FIRST, followed by the "category" key."""

haversine(lon1, lat1, lon2, lat2)

Compute the great-circle distance in km between two points using the Haversine formula.

Parameters:

Name Type Description Default
lon1

Longitude of the first point in degrees.

required
lat1

Latitude of the first point in degrees.

required
lon2

Longitude of the second point in degrees.

required
lat2

Latitude of the second point in degrees.

required

Returns:

Type Description

Distance in kilometres.

Source code in ground_segment/data/data_gen.py
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
def haversine(lon1, lat1, lon2, lat2):
    """Compute the great-circle distance in km between two points using the Haversine formula.

    Args:
        lon1: Longitude of the first point in degrees.
        lat1: Latitude of the first point in degrees.
        lon2: Longitude of the second point in degrees.
        lat2: Latitude of the second point in degrees.

    Returns:
        Distance in kilometres.
    """
    R = 6371
    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (
        math.sin(dlat / 2) ** 2
        + math.cos(math.radians(lat1))
        * math.cos(math.radians(lat2))
        * math.sin(dlon / 2) ** 2
    )
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    return R * c

main()

Generate the full ORION training dataset.

Shuffles all targets with a fixed seed, splits into train/val/test, fetches each tile from SimSat, and writes the corresponding JSONL records. Train targets are augmented with coordinate dropout (2x records).

Source code in ground_segment/data/data_gen.py
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
def main():
    """Generate the full ORION training dataset.

    Shuffles all targets with a fixed seed, splits into train/val/test,
    fetches each tile from SimSat, and writes the corresponding JSONL records.
    Train targets are augmented with coordinate dropout (2x records).
    """
    setup_dirs()
    random.seed(SPLIT_SEED)

    # 1. Combine OLD + NEW (via ALL_TARGETS) and filter proximity overlaps once.
    clean_targets = filter_overlaps(ALL_TARGETS)

    # 2. Deterministic shuffle, then carve fixed-size test and val sets off the front.
    #    Remaining samples become train. IID so no distributional gap between splits.
    random.shuffle(clean_targets)
    test_set = clean_targets[:TEST_SIZE]
    val_set = clean_targets[TEST_SIZE : TEST_SIZE + VAL_SIZE]
    train_set = clean_targets[TEST_SIZE + VAL_SIZE :]

    print(
        f" Splits: {len(train_set)} train | {len(val_set)} val | {len(test_set)} test"
    )

    # Process every sample: fetch image once, write to the appropriate JSONL.
    all_samples = train_set + val_set + test_set
    train_names = {s["name"] for s in train_set}
    val_names = {s["name"] for s in val_set}

    for idx, sample in enumerate(all_samples):
        img_filename = f"{sample['name']}.png"
        img_path = os.path.join(IMAGES_DIR, img_filename)

        # \033[K = ANSI "erase from cursor to end of line", prevents leftover
        # characters when a shorter sample name follows a longer one.
        print(
            f"\r\033[K[{idx + 1}/{len(all_samples)}] Fetching {sample['name']}...",
            end="",
            flush=True,
        )

        if fetch_image(sample["lon"], sample["lat"], img_path):
            if sample["name"] in train_names:
                # TRAIN: augment with both coords-present and coords-absent variants.
                for include_coords in [True, False]:
                    with open(TRAIN_FILE, "a") as f:
                        f.write(
                            json.dumps(make_record(sample, img_path, include_coords))
                            + "\n"
                        )
            elif sample["name"] in val_names:
                # VAL: single record with coords. Used for eval_loss tracking
                # during training (early-stopping / best-checkpoint selection).
                with open(VAL_FILE, "a") as f:
                    f.write(json.dumps(make_record(sample, img_path, True)) + "\n")
            else:
                # TEST: held-out NEW_TARGETS, always with coords. evaluate.py
                # ablation script handles coord stripping for Conditions B/C/D.
                with open(TEST_FILE, "a") as f:
                    f.write(json.dumps(make_record(sample, img_path, True)) + "\n")

        time.sleep(0.5)  # SimSat should handle 2 req/sec

    print("\n\n Dataset generated successfully.")
    print(f"   Train (augmented): {len(train_set) * 2} records → {TRAIN_FILE}")
    print(f"   Val:               {len(val_set)} records → {VAL_FILE}")
    print(f"   Test (held-out):   {len(test_set)} records → {TEST_FILE}")

make_record(sample, img_path, include_coords=True)

Build a single conversation-format JSONL record for LLaVA-style fine-tuning.

Parameters:

Name Type Description Default
sample

Target dict with name, lon, lat, cat, and reason.

required
img_path

Path to the saved satellite tile image.

required
include_coords

Whether to include GPS coordinates in the prompt.

True

Returns:

Type Description

A dict with image and conversations keys matching the training schema.

Source code in ground_segment/data/data_gen.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
def make_record(sample, img_path, include_coords=True):
    """Build a single conversation-format JSONL record for LLaVA-style fine-tuning.

    Args:
        sample: Target dict with ``name``, ``lon``, ``lat``, ``cat``, and ``reason``.
        img_path: Path to the saved satellite tile image.
        include_coords: Whether to include GPS coordinates in the prompt.

    Returns:
        A dict with ``image`` and ``conversations`` keys matching the training schema.
    """
    return {
        "image": img_path,
        "conversations": [
            {
                "role": "user",
                "content": f"<image>\n{get_prompt(sample['lon'], sample['lat'], include_coords)}",
            },
            {
                "role": "assistant",
                "content": json.dumps(
                    {"reason": sample["reason"], "category": sample["cat"]}
                ),
            },
        ],
    }

setup_dirs()

Create the output directory structure and clear any previous JSONL files.

Source code in ground_segment/data/data_gen.py
160
161
162
163
164
165
def setup_dirs():
    """Create the output directory structure and clear any previous JSONL files."""
    os.makedirs(IMAGES_DIR, exist_ok=True)
    for f in [TRAIN_FILE, VAL_FILE, TEST_FILE]:
        if os.path.exists(f):
            os.remove(f)