Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [2210.16428]