| --- |
| license: apache-2.0 |
| pipeline_tag: feature-extraction |
| --- |
| |
| # UniTok: A Unified Tokenizer for Visual Generation and Understanding |
|
|
| This repository contains UniTok, a unified visual tokenizer for both image generation and understanding tasks, as presented in [UniTok: A Unified Tokenizer for Visual Generation and Understanding](https://hf.co/papers/2502.20321). |
|
|
| Project Page: https://foundationvision.github.io/UniTok/ |
|
|
| Code: https://github.com/FoundationVision/UniTok |
|
|
|  |
|
|
| UniTok encodes fine-grained details for generation and captures high-level semantics for understanding. It's compatible with autoregressive generative models (e.g., LlamaGen), multimodal understanding models (e.g., LLaVA), and unified MLLMs (e.g., Chameleon and Liquid). |
|
|
|
|
| Built upon UniTok, we construct an MLLM capable of both multimodal generation and understanding, which sets a new state-of-the-art among unified autoregressive MLLMs. The weights of our MLLM will be released soon. |
|
|
|  |
|
|
| ## Performance |
|
|
| <table> |
| <thead> |
| <tr> |
| <th>Method</th> |
| <th>#Tokens</th> |
| <th>rFID ↓</th> |
| <th>Accuracy</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td colspan="4"><i>VQVAE Model</i></td> |
| </tr> |
| <tr align="center"> |
| <td>VQ-GAN</td> |
| <td>256</td> |
| <td>4.98</td> |
| <td>--</td> |
| </tr> |
| <tr align="center"> |
| <td>RQ-VAE</td> |
| <td>256</td> |
| <td>1.30</td> |
| <td>--</td> |
| </tr> |
| <tr align="center"> |
| <td>VAR</td> |
| <td>680</td> |
| <td>0.90</td> |
| <td>--</td> |
| </tr> |
| <tr> |
| <td colspan="4"><i>CLIP Model</i></td> |
| </tr> |
| <tr align="center"> |
| <td>CLIP</td> |
| <td>256</td> |
| <td>--</td> |
| <td>76.2</td> |
| </tr> |
| <tr align="center"> |
| <td>SigLIP</td> |
| <td>256</td> |
| <td>--</td> |
| <td>80.5</td> |
| </tr> |
| <tr align="center"> |
| <td>ViTamin</td> |
| <td>256</td> |
| <td>--</td> |
| <td>81.2</td> |
| </tr> |
| <tr> |
| <td colspan="4"><i>Unified Model</i></td> |
| </tr> |
| <tr align="center"> |
| <td>TokenFlow †</td> |
| <td>680</td> |
| <td>1.37</td> |
| <td>--</td> |
| </tr> |
| <tr align="center"> |
| <td>VILA-U †</td> |
| <td>256</td> |
| <td>1.80</td> |
| <td>73.3</td> |
| </tr> |
| <tr align="center"> |
| <td>UniTok</td> |
| <td>256</td> |
| <td>0.39</td> |
| <td>70.5</td> |
| </tr> |
| <tr align="center"> |
| <td>UniTok †</td> |
| <td>256</td> |
| <td>0.38</td> |
| <td>78.6</td> |
| </tr> |
| </tbody> |
| </table> |
| |
|
|
| † indicates the model uses pretrained CLIP weights for initialization. Although CLIP weight initialization boosts ImageNet zero-shot accuracy, |
| we notice that random initialization leads to better downstream understanding performance. |
| We thus release the model checkpoint of UniTok that is trained from scratch. |
|
|
|
|
|
|
| ## Model Weights |
|
|
| | Model | Res. | #Token | Code Shape | rFID | Checkpoint | |
| |:------------:|:----:|:------:|:-------------------------:|:----:|:------------:| |
| | UniTok-Large | 256 | 256 | 16 $\times$ 16 $\times$ 8 | 0.39 | [Download](https://huggingface.co/FoundationVision/UniTok/blob/main/unitok_tokenizer.pth) | |
|
|
|
|
| ## Usage |
|
|
| (... rest of README content ...) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{unitok, |
| title={UniTok: A Unified Tokenizer for Visual Generation and Understanding}, |
| author={Ma, Chuofan and Jiang, Yi and Wu, Junfeng and Yang, Jihan and Yu, Xin and Yuan, Zehuan and Peng, Bingyue and Qi, Xiaojuan}, |
| journal={arXiv preprint arXiv:2502.20321}, |
| year={2025} |
| } |
| ``` |