Speech Text Manifold Mixup

Speech Text Manifold Mixup (STMM) techniques aim to improve the performance of models that process both speech and text data, particularly in scenarios with limited labeled data or significant cross-modal discrepancies. Current research focuses on adapting mixup methods—which create synthetic data by interpolating existing samples—to address these challenges, often incorporating self-learning or semi-supervised approaches and leveraging pre-trained language models. These techniques show promise in enhancing the robustness and generalization of speech-to-text translation and other cross-modal tasks, potentially leading to more accurate and efficient natural language processing systems.

Papers