TextME: Bridging Unseen Modalities Through Text Descriptions

Official projection checkpoints and offset vectors for TextME, a text-only modality expansion framework that projects diverse modalities into LLM embedding space without paired cross-modal data.

Model Description

TextME trains lightweight projection heads (2-layer MLP, ~10M params each) to map pretrained encoder embeddings into a unified Qwen3-Embedding-4B anchor space (2560-dim). Training uses only text descriptions — no paired multimodal data is needed.

Repository Structure

├── projections/
│   ├── languagebind/           # Source text encoder projections (per-domain)
│   │   ├── languagebind_coco.pt            # Image domain (59M)
│   │   ├── languagebind_audiocaps.pt       # Audio domain (59M)
│   │   ├── languagebind_objaverse.pt       # 3D domain (59M)
│   │   ├── languagebind_chestxray.pt       # X-ray domain (59M)
│   │   ├── languagebind_pubchem.pt         # Molecule domain (59M)
│   │   ├── languagebind_remoteclip_ret3.pt # Remote sensing domain (59M)
│   │   └── languagebind_internvid.pt       # Video domain (59M)
│   └── target_encoders/        # Target modality encoder projections
│       ├── clip.pt              # CLIP → image (85M)
│       ├── viclip.pt            # ViCLIP → video (59M)
│       ├── clap.pt              # CLAP → audio (37M)
│       ├── uni3d.pt             # Uni3D → 3D point cloud (85M)
│       ├── cxr_clip.pt          # CXR-CLIP → X-ray (37M)
│       ├── moleculestm.pt       # MoleculeSTM → molecule (17M)
│       ├── remoteclip.pt        # RemoteCLIP → remote sensing (59M)
│       └── languagebind.pt      # LanguageBind → multi-modal (59M)
└── offsets/                    # Precomputed modality gap offset vectors
    ├── clip_coco/
    ├── clap_audiocaps/
    ├── uni3d_objaverse/
    ├── cxr_clip_chestxray/
    ├── moleculestm_pubchem/
    ├── remoteclip_ret3/
    ├── languagebind_coco/
    └── viclip_internvid/

Supported Modalities

Modality	Source Encoder	Target Encoder	Embedding Dim
Image	LanguageBind (768)	CLIP (1024)	→ 2560
Video	LanguageBind (768)	ViCLIP (768)	→ 2560
Audio	LanguageBind (768)	CLAP (512)	→ 2560
3D	LanguageBind (768)	Uni3D (1024)	→ 2560
X-ray	LanguageBind (768)	CXR-CLIP (512)	→ 2560
Molecule	LanguageBind (768)	MoleculeSTM (256)	→ 2560
Remote Sensing	LanguageBind (768)	RemoteCLIP (768)	→ 2560

Usage

from huggingface_hub import hf_hub_download
import torch

# Download a projection checkpoint
ckpt_path = hf_hub_download(
    repo_id="SoyeonHH/TextME",
    filename="projections/target_encoders/clip.pt"
)

# Load checkpoint
checkpoint = torch.load(ckpt_path, map_location="cpu")

# Download offset vectors
offset_path = hf_hub_download(
    repo_id="SoyeonHH/TextME",
    filename="offsets/clip_coco/text_embed_mean.pkl"
)

See the GitHub repository for full evaluation and training code.

Training Details

Parameter	Value
Anchor Space	Qwen3-Embedding-4B (2560-dim)
Projection	2-layer MLP with GELU, BatchNorm
Batch size	512
Optimizer	AdamW (β₁=0.9, β₂=0.999)
Learning rate	5×10⁻⁴ (target) / 5×10⁻² (LanguageBind)
Epochs	50
Temperature	0.07
Training data	~100K text descriptions per modality
Offset samples	5,000 per modality
GPU	Single NVIDIA A6000 (48GB)

Results

Text→X Retrieval (R@1)

Image (Flickr)	Video (MSVD)	Audio (ACaps)	Molecule (Drug)
51.66 (PPR 66.5%)	45.82 (PPR 89.7%)	15.35 (PPR 68.3%)	34.75 (PPR 43.9%)

Zero-Shot Classification (Top-1)

3D (MN40)	3D (Scan)	Audio (ESC)	X-ray (RSNA)
70.86 (PPR 104.6%)	42.15 (PPR 99.9%)	77.25 (PPR 90.7%)	46.59 (PPR 88.5%)

Citation

@article{hong2026textme,
  title={TextME: Bridging Unseen Modalities Through Text Descriptions},
  author={Hong, Soyeon and Kim, Jinchan and You, Jaegook and Choi, Seungtaek and Kwak, Suha and Cho, Hyunsouk},
  journal={arXiv preprint arXiv:2602.03098},
  year={2026}
}

License

MIT License

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including SoyeonHH/TextME

TextME

Collection

TextME: Text-only Training for Modality Expansion (arXiv 2026) • 3 items • Updated Feb 23

Paper for SoyeonHH/TextME

TextME: Bridging Unseen Modalities Through Text Descriptions

Paper • 2602.03098 • Published Feb 3