Paper ID: 2410.19825

Automating Video Thumbnails Selection and Generation with Multimodal and Multistage Analysis

Elia Fantini

This thesis presents an innovative approach to automate video thumbnail selection for traditional broadcast content. Our methodology establishes stringent criteria for diverse, representative, and aesthetically pleasing thumbnails, considering factors like logo placement space, incorporation of vertical aspect ratios, and accurate recognition of facial identities and emotions. We introduce a sophisticated multistage pipeline that can select candidate frames or generate novel images by blending video elements or using diffusion models. The pipeline incorporates state-of-the-art models for various tasks, including downsampling, redundancy reduction, automated cropping, face recognition, closed-eye and emotion detection, shot scale and aesthetic prediction, segmentation, matting, and harmonization. It also leverages large language models and visual transformers for semantic consistency. A GUI tool facilitates rapid navigation of the pipeline's output. To evaluate our method, we conducted comprehensive experiments. In a study of 69 videos, 53.6% of our proposed sets included thumbnails chosen by professional designers, with 73.9% containing similar images. A survey of 82 participants showed a 45.77% preference for our method, compared to 37.99% for manually chosen thumbnails and 16.36% for an alternative method. Professional designers reported a 3.57-fold increase in valid candidates compared to the alternative method, confirming that our approach meets established criteria. In conclusion, our findings affirm that the proposed method accelerates thumbnail creation while maintaining high-quality standards and fostering greater user engagement.

Submitted: Oct 18, 2024