Paper ID: 2003.08162
3D Crowd Counting via Geometric Attention-guided Multi-View Fusion
Qi Zhang, Antoni B. Chan
Recently multi-view crowd counting using deep neural networks has been proposed to enable counting in large and wide scenes using multiple cameras. The current methods project the camera-view features to the average-height plane of the 3D world, and then fuse the projected multi-view features to predict a 2D scene-level density map on the ground (i.e., birds-eye view). Unlike the previous research, we consider the variable height of the people in the 3D world and propose to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps, instead of the 2D density map on the ground plane. Compared to 2D fusion, the 3D fusion extracts more information of the people along the z-dimension (height), which helps to address the scale variations across multiple views. The 3D density maps still preserve the 2D density maps property that the sum is the count, while also providing 3D information about the crowd density. Furthermore, instead of using the standard method of copying the features along the view ray in the 2D-to-3D projection, we propose an attention module based on a height estimation network, which forces each 2D pixel to be projected to one 3D voxel along the view ray. We also explore the projection consistency among the 3D prediction and the ground truth in the 2D views to further enhance the counting performance. The proposed method is tested on the synthetic and real-world multiview counting datasets and achieves better or comparable counting performance to the state-of-the-art.
Submitted: Mar 18, 2020