We propose SatBEV, a cross-view geospatial foundation model that learns dense lane-level correspondence between satellite imagery and street-level driving observations. Instead of treating satellite images as auxiliary features for HD map construction, we train a dual-encoder architecture that aligns overhead and perspective views in a shared geometric embedding space at instance-level lane granularity.
The model is pretrained on OpenSatMap’s 38k satellite images across 60 cities using a masked autoencoder objective, then fine-tuned on the spatially aligned nuScenes and Argoverse 2 subsets with three joint objectives: contrastive satellite-BEV alignment, lane-level cross-view correspondence prediction via GPS-supervised cross-attention, and a map transfer objective that predicts ego-vehicle-frame HD maps from satellite tiles alone.
The key technical contribution is a cross-view transformer with GPS-derived positional encoding that produces dense correspondence fields mapping satellite lane pixels to BEV lane pixels, handling projective distortion and partial observability. At inference, this enables HD map generation for any location with satellite coverage — no driving data or onboard sensors required.
We evaluate on zero-shot HD map prediction (IoU on nuScenes val for divider/crossing/boundary), few-shot city adaptation with 10 street-level samples, and ablations on whether OpenSatMap’s instance-level annotations improve transfer over coarse semantic labels. The baseline to beat is SatforHDMap at 50.9 IoU; the target is generalization to cities entirely absent from driving datasets.