Image to Video

Image-to-video (I2V) generation aims to create video sequences from a single input image, often incorporating text prompts for control. Current research heavily utilizes diffusion models, often adapting architectures like UNets and incorporating techniques like classifier-free guidance and various control mechanisms (e.g., pose, sketch, motion reference videos) to enhance realism, fidelity, and controllability. This field is significant for its potential applications in content creation, video editing, and animation, driving advancements in both model architectures and evaluation metrics to better align with human perception of video quality and motion. The development of open-source models and standardized benchmarks is also a key focus, fostering wider accessibility and more rigorous comparisons.

Papers