Lip to Speech

Lip-to-speech (LTS) research aims to synthesize natural-sounding speech directly from silent video recordings of lip movements, overcoming the inherent ambiguity of lipreading. Current efforts focus on improving speech quality and intelligibility by incorporating text-based information (e.g., from lip-reading models) to disambiguate homophones and model diverse speech styles, often employing deep neural networks and diffusion models. This technology holds significant potential for assistive technologies, benefiting individuals with speech impairments, and also advances our understanding of audio-visual speech processing within the broader field of artificial intelligence.

Papers

October 25, 2024