Researchers have created a network that integrates 3D LiDAR and 2D image data, aiming to enhance the detection of small objects. Robots and autonomous vehicles often utilize 3D point clouds from LiDAR sensors combined with camera images for 3D object detection. However, existing methods that merge these data types have shown limitations in detecting small objects accurately. A team from Japan has developed a network, DPPFA−Net, to address these challenges, particularly those caused by occlusion and noise in adverse weather conditions.
In the fields of robotics and autonomous vehicles, accurate perception of surroundings is crucial. Most 3D object detection methods rely on LiDAR sensors to generate 3D point clouds. LiDAR sensors, while effective, are sensitive to noise, especially in challenging weather conditions like rain.
To address this, researchers have explored multi-modal 3D object detection methods that combine 3D LiDAR data with 2D RGB images from standard cameras. While this fusion has led to improved accuracy in 3D detection, the accurate detection of small objects remains a challenge. This difficulty often arises from aligning the semantic information extracted from both 2D and 3D data sources, with calibration or occlusion issues.
A research team, led by Professor Hiroyuki Tomiyama from Ritsumeikan University, Japan, has developed a new approach to enhance the accuracy and robustness of multi-modal 3D object detection. This method, named “Dynamic Point-Pixel Feature Alignment Network” (DPPFA−Net), was detailed in a paper published in the IEEE Internet of Things Journal on November 3, 2023.
The model consists of three newly designed modules: the Memory-based Point-Pixel Fusion (MPPF) module, the Deformable Point-Pixel Fusion (DPPF) module, and the Semantic Alignment Evaluator (SAE) module. The MPPF module facilitates interactions between and across 2D and 3D modal features. The 2D image serves as a memory bank, which helps in reducing learning difficulties for the network and increases robustness against 3D point cloud noise. The DPPF module focuses on interactions at key pixel positions, identified through a sampling strategy, to allow high-resolution feature fusion at lower computational costs. The SAE module is designed to ensure semantic alignment during the fusion process.
In tests using the KITTI Vision Benchmark, DPPFA−Net showed average precision improvements of up to 7.18% under various noise conditions. The researchers also created a new dataset with artificial multi-modal noise, simulating rainfall, to further evaluate their model. This testing indicated that DPPFA−Net outperformed existing models in scenarios with severe occlusions and different weather conditions. Prof. Tomiyama stated that their experiments demonstrate the network’s advanced capabilities.
Accurate 3D object detection has potential applications in enhancing the safety and efficiency of self-driving cars and improving robots’ understanding and adaptation to their working environments. It can also contribute to reducing the costs of manual annotation in deep-learning perception systems by providing pre-labeling of raw data.