Diverse Video Captioning by Adaptive Spatio-temporal Attention [2208.09266]