Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation [2406.10970]