A Self-Adjusting Fusion Representation Learning Model for Unaligned Text-Audio Sequences [2212.11772]