DeepSwinLite: A Swin Transformer-Based Light Deep Learning Model for Building Extraction Using VHR Aerial Imagery
Tarih
Yazarlar
Dergi Başlığı
Dergi ISSN
Cilt Başlığı
Yayıncı
Erişim Hakkı
Özet
Accurate extraction of building features from remotely sensed data is essential for supporting research and applications in urban planning, land management, transportation infrastructure development, and disaster monitoring. Despite the prominence of deep learning as the state-of-the-art (SOTA) methodology for building extraction, substantial challenges remain, largely stemming from the diversity of building structures and the complexity of background features. To mitigate these issues, this study introduces DeepSwinLite, a lightweight architecture based on the Swin Transformer, designed to extract building footprints from very high-resolution (VHR) imagery. The model integrates a novel local-global attention module to enhance the interpretation of objects across varying spatial resolutions and facilitate effective information exchange between different feature abstraction levels. It comprises three modules: multi-scale feature aggregation (MSFA), improving recognition across varying object sizes; multi-level feature pyramid (MLFP), fusing detailed and semantic features; and AuxHead, providing auxiliary supervision to stabilize and enhance learning. Experimental evaluations on the Massachusetts and WHU Building Datasets reveal the superior performance of DeepSwinLite architecture when compared to existing SOTA models. On the Massachusetts dataset, the model attained an OA of 92.54% and an IoU of 77.94%, while on the WHU dataset, it achieved an OA of 98.32% and an IoU of 92.02%. Following the correction of errors identified in the Massachusetts ground truth and iterative enhancement, the model's performance further improved, reaching 94.63% OA and 79.86% IoU. A key advantage of the DeepSwinLite model is its computational efficiency, requiring fewer floating-point operations (FLOPs) and parameters compared to other SOTA models. This efficiency makes the model particularly suitable for deployment in mobile and resource-constrained systems.








