Unmanned Aerial Vehicles (UAVs) increasingly rely on accurate object detection, yet current technology struggles with the difference between standard image datasets and the unique perspective of aerial imagery. Zhenhai Weng from Chongqing University and Zhongliang Yu, along with their colleagues, tackle this challenge by creating resources specifically designed to improve UAV-based object detection. The team addresses the performance gap by constructing two new datasets, UAVDE-2M and UAVCAP-15k, which together contain over two million labelled objects across a wide range of categories. They further enhance detection capabilities with a novel image processing module, integrated into an existing detection framework, and demonstrate significant improvements on standard UAV imagery benchmarks, paving the way for more reliable automated analysis in applications like remote sensing and environmental monitoring.
Aerial Image Datasets for Object Detection
This research surveys the rapidly evolving field of computer vision applied to drone technology, focusing on the datasets and methods driving advancements in aerial image analysis. Researchers are developing robust, accurate, and efficient algorithms for a wide range of applications, from disaster response to precision agriculture. A significant portion of this work centers on the creation and utilization of specialized datasets designed to train and evaluate these algorithms. Numerous datasets are available, categorized by their focus. General aerial imagery and object detection datasets include COCO and Objects365, providing broad coverage of common objects.
DOTA and UCAS-AeroCap specifically address the challenges of small object detection in aerial views. Specialized datasets cater to specific needs, such as Air-to-Air Visual Detection of Micro-UAVs for drone detection and Rescuenet for damage assessment during disasters. Other datasets focus on maritime scenes, wildfire management, and various specialized tasks. Researchers employ a variety of computer vision techniques to analyze drone imagery. Object detection, a primary focus, utilizes methods like YOLO, Faster R-CNN, and DETR to accurately identify objects.
Semantic and instance segmentation assign labels to pixels and individual objects, respectively. Multi-modal learning combines data from different sensors, while cross-view image understanding analyzes images from multiple perspectives. Addressing the unique challenges of aerial imagery, scientists also explore small object detection, uncertainty-aware learning, and robust performance under varying environmental conditions.
UAV Dataset Creation for Open-Vocabulary Detection
Researchers tackled the challenge of open-vocabulary object detection (OVD) in unmanned aerial vehicle (UAV) imagery by developing a refined methodology for dataset creation. Recognizing that existing datasets, primarily composed of ground-level images, are inadequate for UAV applications, the team engineered the UAV-Label engine to generate accurate object detection annotations and corresponding caption datasets. This engine facilitates the creation of datasets specifically tailored for UAV-based tasks. Leveraging the UAV-Label engine, scientists constructed two novel datasets: UAVDE-2M, a large-scale object detection dataset containing over 2,000,000 instance annotations across more than 1,800 object categories, and UAVCAP-15K, a caption dataset comprising over 10,000 image-text pairs.
To effectively integrate textual information into the visual processing stream, the team proposed a Cross-Attention Gated Enhancement (CAGE) fusion module. This innovative module employs a dual-path architecture, utilizing cross-attention guided by a gate mechanism for precise spatial grounding and a parallel FiLM layer for global feature modulation. Both pathways are integrated via a residual connection, enhancing robustness and feature representation. Scientists then integrated the CAGE module into the YOLO-World-v2 architecture, creating a system capable of high-performance, real-time OVD. Extensive experiments conducted on the VisDrone and SIMD datasets demonstrate the effectiveness of this approach, showcasing significant improvements over existing real-time OVD methods in UAV-based imagery and remote sensing applications.
UAV Imagery Boosts Open-Vocabulary Object Detection
Scientists have achieved a significant advancement in open-vocabulary object detection (OVD) specifically for unmanned aerial vehicle (UAV) imagery, overcoming a critical performance gap that previously hindered real-world applications. Researchers discovered that existing large-scale datasets, primarily composed of ground-level images, perform poorly when applied to UAV perspectives due to differences in background clutter and object scale. To address this, the team developed a refined UAV-Label Engine, enabling the creation of two new datasets: UAVDE-2M, containing over 2,000,000 instances across 1,800 categories, and UAVCAP-15K, a caption dataset with over 10,000 image-text pairs. Experiments reveal that these datasets are crucial for pre-training OVD models, allowing them to accurately identify objects in aerial imagery.
The team further enhanced performance by proposing a novel Cross-Attention Gated Enhancement (CAGE) module, integrating it into the YOLO-World-v2 architecture. This module utilizes a dual-path system, employing cross-attention and a FiLM layer to precisely ground semantic understanding and enhance visual features. Comparative tests on the VisDrone dataset demonstrate that the proposed model outperforms existing real-time OVD methods. Results demonstrate a substantial improvement in performance, particularly in challenging UAV scenarios with small objects and occlusions. The new approach effectively bridges the domain gap between training data and UAV imagery, unlocking the potential for practical applications in areas such as intelligent surveillance, robotic vision, and autonomous driving. The team developed two new datasets, UAVDE-2M and UAVCAP-15K, containing a large number of images and categories relevant to UAV-based remote sensing. These datasets demonstrably improve the ability of object detection models to accurately identify objects in UAV imagery, surpassing performance achieved with models trained solely on standard datasets. The researchers integrated these datasets with a modified YOLO-World-v2 architecture, incorporating a Cross-Attention Gated Enhancement Fusion (CAGE) module.
Experiments on both VisDrone and SIMD datasets confirm the effectiveness of this approach, with substantial gains in mean Average Precision (mAP) achieved when using the new datasets for pre-training. Ablation studies demonstrate that UAVDE-2M provides the primary performance boost, while UAVCAP-15K offers complementary information, further enhancing detection capabilities. The results highlight the importance of domain-specific data for achieving robust performance in UAV-based object detection and remote sensing applications. Future work could explore the application of these datasets to other object detection models and investigate their potential for transfer learning to related remote sensing tasks.
👉 More information
🗞 Cross-Modal Enhancement and Benchmark for UAV-based Open-Vocabulary Object Detection
🧠ArXiv: https://arxiv.org/abs/2509.06011
