TextME
Collection
TextME: Text-only Training for Modality Expansion (arXiv 2026) β’ 3 items β’ Updated
Official projection checkpoints and offset vectors for TextME, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data.
TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses only text descriptions β no paired multimodal data is needed.
βββ projections/
β βββ languagebind/ # Source text encoder projections (per-domain)
β β βββ languagebind_coco.pt # Image domain (59M)
β β βββ languagebind_audiocaps.pt # Audio domain (59M)
β β βββ languagebind_objaverse.pt # 3D domain (59M)
β β βββ languagebind_chestxray.pt # X-ray domain (59M)
β β βββ languagebind_pubchem.pt # Molecule domain (59M)
β β βββ languagebind_remoteclip_ret3.pt # Remote sensing domain (59M)
β β βββ languagebind_internvid.pt # Video domain (59M)
β βββ target_encoders/ # Target modality encoder projections
β βββ clip.pt # CLIP β image (85M)
β βββ viclip.pt # ViCLIP β video (59M)
β βββ clap.pt # CLAP β audio (37M)
β βββ uni3d.pt # Uni3D β 3D point cloud (85M)
β βββ cxr_clip.pt # CXR-CLIP β X-ray (37M)
β βββ moleculestm.pt # MoleculeSTM β molecule (17M)
β βββ remoteclip.pt # RemoteCLIP β remote sensing (59M)
β βββ languagebind.pt # LanguageBind β multi-modal (59M)
βββ offsets/ # Precomputed modality gap offset vectors
βββ clip_coco/
βββ clap_audiocaps/
βββ uni3d_objaverse/
βββ cxr_clip_chestxray/
βββ moleculestm_pubchem/
βββ remoteclip_ret3/
βββ languagebind_coco/
βββ viclip_internvid/
| Modality | Source Encoder | Target Encoder | Embedding Dim |
|---|---|---|---|
| Image | LanguageBind (768) | CLIP (1024) | β 2560 |
| Video | LanguageBind (768) | ViCLIP (768) | β 2560 |
| Audio | LanguageBind (768) | CLAP (512) | β 2560 |
| 3D | LanguageBind (768) | Uni3D (1024) | β 2560 |
| X-ray | LanguageBind (768) | CXR-CLIP (512) | β 2560 |
| Molecule | LanguageBind (768) | MoleculeSTM (256) | β 2560 |
| Remote Sensing | LanguageBind (768) | RemoteCLIP (768) | β 2560 |
from huggingface_hub import hf_hub_download
import torch
# Download a projection checkpoint
ckpt_path = hf_hub_download(
repo_id="SoyeonHH/TextME",
filename="projections/target_encoders/clip.pt"
)
# Load checkpoint
checkpoint = torch.load(ckpt_path, map_location="cpu")
# Download offset vectors
offset_path = hf_hub_download(
repo_id="SoyeonHH/TextME",
filename="offsets/clip_coco/text_embed_mean.pkl"
)
See the GitHub repository for full evaluation and training code.
| Parameter | Value |
|---|---|
| Anchor Space | Qwen3-Embedding-4B (2560-dim) |
| Projection | 2-layer MLP with GELU, BatchNorm |
| Batch size | 512 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.999) |
| Learning rate | 5Γ10β»β΄ (target) / 5Γ10β»Β² (LanguageBind) |
| Epochs | 50 |
| Temperature | 0.07 |
| Training data | ~100K text descriptions per modality |
| Offset samples | 5,000 per modality |
| GPU | Single NVIDIA A6000 (48GB) |
| Image (Flickr) | Video (MSVD) | Audio (ACaps) | Molecule (Drug) |
|---|---|---|---|
| 51.66 (PPR 66.5%) | 45.82 (PPR 89.7%) | 15.35 (PPR 68.3%) | 34.75 (PPR 43.9%) |
| 3D (MN40) | 3D (Scan) | Audio (ESC) | X-ray (RSNA) |
|---|---|---|---|
| 70.86 (PPR 104.6%) | 42.15 (PPR 99.9%) | 77.25 (PPR 90.7%) | 46.59 (PPR 88.5%) |
@article{hong2026textme,
title={TextME: Bridging Unseen Modalities Through Text Descriptions},
author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk},
journal={arXiv preprint arXiv:2602.03098},
year={2026}
}
MIT License