Paper ID: 2503.08377 • Published Mar 11, 2025
Layton: Latent Consistency Tokenizer for 1024-pixel Image Reconstruction and Generation by 256 Tokens
Qingsong Xie, Zhao Zhang, Zhe Huang, Yanhao Zhang, Haonan Lu, Zhenyu Yang
OPPO AI Center•ByteDance•Tsinghua University
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Image tokenization has significantly advanced visual generation and
multimodal modeling, particularly when paired with autoregressive models.
However, current methods face challenges in balancing efficiency and fidelity:
high-resolution image reconstruction either requires an excessive number of
tokens or compromises critical details through token reduction. To resolve
this, we propose Latent Consistency Tokenizer (Layton) that bridges discrete
visual tokens with the compact latent space of pre-trained Latent Diffusion
Models (LDMs), enabling efficient representation of 1024x1024 images using only
256 tokens-a 16 times compression over VQGAN. Layton integrates a transformer
encoder, a quantized codebook, and a latent consistency decoder. Direct
application of LDM as the decoder results in color and brightness
discrepancies. Thus, we convert it to latent consistency decoder, reducing
multi-step sampling to 1-2 steps for direct pixel-level supervision.
Experiments demonstrate Layton's superiority in high-fidelity reconstruction,
with 10.8 reconstruction Frechet Inception Distance on MSCOCO-2017 5K benchmark
for 1024x1024 image reconstruction. We also extend Layton to a text-to-image
generation model, LaytonGen, working in autoregression. It achieves 0.73 score
on GenEval benchmark, surpassing current state-of-the-art methods. The code and
model will be released.
Figures & Tables
Unlock access to paper figures and tables to enhance your research experience.