Learning Joint Representation of Human Motion and Language [2210.15187]