Speech to Image

Speech-to-image generation aims to create realistic images directly from spoken descriptions, bridging the gap between auditory and visual modalities. Current research focuses on improving the efficiency and quality of these models, exploring single-stage architectures that avoid the limitations of multi-stage approaches and leveraging pre-trained vision-language models to enhance image and speech understanding. These advancements are improving the accuracy and speed of image generation, with applications ranging from accessibility tools for the visually impaired to creative content generation.

Papers