Text Only Training

Text-only training aims to develop machine learning models for tasks traditionally requiring paired image-text or audio-speech data, using only text data during training. Current research focuses on leveraging pre-trained models like CLIP and transformers, adapting them for tasks such as image captioning, visual storytelling, and audio-to-intent classification through innovative training strategies like noise injection and multimodal approaches. This approach significantly reduces data acquisition costs and enables model development in low-resource scenarios, impacting various fields including medical image analysis, speech recognition, and natural language understanding.

Papers