Key takeaway §
- Universal remote sensing foundation model
- Trained on one million spectral remote sensing Sentinel-2 images
- 3 models (Million Parameters): Base (100M), large, (300M) huge (600M)
- 3D masking strategy, encoder for learning visual representation from spatial-spectral mixed tokens, decoder with multi-target reconstruction for preserving spectrally sequential characteristics
- Masked autoencoder (MAE) framework, masking rate 90%
- New benchmark dataset SegMunich, focuses on urban areas
- Pretrained: 700k data 12 spectral bands from fMoW-S2 200 epoch, 350k BigEarthNet Sentinel-2 100 epoch. 8 NVIDIA RTX 4090 GPUs for training, 4 gpus for finetuning
Downstream task §
- single/multi-label scene classification: EuroSAT S2 images 10 land classes 13 bands 3k labeled images, 150 epochs
- semantic segmentation
- change detection
Prev works §
- Momentum contrast (MoCo): Introduce momentum updates to improve contrastive learning process
- Simple contrastive learning (SimCLR): Leverages data augmentations to enhance the variety and complexity of the image pairs used for contrastive learning
- Generative learning based on masked image modeling (MIM)
- Example, MIM architecture: Bidirectional encoder representation from image transformer (BEiT) built on top of vision transformers (ViT). Also MIM allows for flexible use of various deep architectures as network backbones (ViT, Swin Transformers)
- MIM allows input of all images patches, high computational cost, limited for certain application
- MIM alternative: Masked autoencoders (MAE), unmasked patcher or pixels used to reconstruct those that are masked. Computationally more efficient
- SatMAE: Pre-training transformers for temporal and multi-spectral satellite imagery
- fMoW-S2 dataset: Training from scratch 700k images, 3D tensor-based random weight initialization
- BigEarthNet-S2 dataset: Progressive pre training on 350k images, varying image sizes, resolution, time series information, and geographic regions. Fine tuning data 35k images with labels