UTRBERT-4mer

A BERT-base language model pre-trained on human 3' UTR sequences using 4-mer tokenization. Part of the 3UTRBERT model family introduced in Yang et al. (2024).

Architecture

Parameter Value
Layers 12
Attention heads 12
Embedding dimension 768
Intermediate size 3072
Vocabulary size 261 (5 special tokens + RNA 4-mers)
Positional encoding Learned absolute (BERT-style)
Architecture BERT-base
Max sequence length 512 tokens (~515 nucleotides for 4-mer)

Tokenization: raw RNA (or DNA) sequences are converted T->U, then split into overlapping 4-mers (stride 1). A sequence of length L produces L-3 tokens. A [CLS] and [SEP] token are prepended and appended by the tokenizer.

Pretraining

  • Objective: Masked Language Modeling (MLM) on 4-mer tokens
  • Data: Human 3' UTR sequences
  • Source checkpoint: 4-new-12w-0/pytorch_model.bin from figshare article 22851119

Checkpoint selection

The only publicly released pre-trained checkpoint for the 4-mer variant is 4-new-12w-0.

Parity Verification

Hidden-state representations verified identical (max abs diff = 0.00) to the original BertForMaskedLM implementation at all 13 representation levels (embedding + 12 transformer layers). Verified on GPU with PyTorch 2.7 / CUDA 12.6. SDPA also verified (max diff < 2e-5 vs eager).

Related Models

See the full UTRBERT collection.

Model k-mer Vocab size Notes
UTRBERT-3mer 3 69
UTRBERT-4mer 4 261 This model
UTRBERT-5mer 5 1029
UTRBERT-6mer 6 4101

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/UTRBERT-4mer", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/UTRBERT-4mer")
model.eval()

sequences = ["AUGCAUGCAUGCAUGCAUGC", "GCGCGCGCGCGCGCGCGCGC"]
enc = tokenizer(sequences, return_tensors="pt", padding=True, truncation=True, max_length=512)

with torch.no_grad():
    out = model(**enc)

cls_emb   = out.last_hidden_state[:, 0, :]   # (batch, 768) -- CLS token
token_emb = out.last_hidden_state             # (batch, seq_len, 768)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer6_emb = out_all.hidden_states[6]         # (batch, seq_len, 768)

Fine-tuning

Standard HF conventions apply. For sequence-level tasks, use the CLS token embedding as input to a classification or regression head.

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained(
    "Taykhoom/UTRBERT-4mer",
    num_labels=2,
)

Implementation Notes

This is a minimal HF port using standard BertModel with no custom modeling code. The original checkpoint (BertForMaskedLM) was converted by stripping the bert. prefix and dropping the cls.* MLM head. trust_remote_code=True is required only for the tokenizer (k-mer splitting), not for the model.

Citation

@article{yang2024_utrbert,
  title   = {Deciphering 3'{UTR} Mediated Gene Regulation Using Interpretable Deep Representation Learning},
  author  = {Yang, Yuning and Li, Gen and Pang, Kuan and Cao, Wuxinhao and Zhang, Zhaolei and Li, Xiangtao},
  journal = {Advanced Science},
  volume  = {11},
  number  = {39},
  pages   = {e2407013},
  year    = {2024},
  doi     = {10.1002/advs.202407013}
}

Credits

Original model and code by Yang et al. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

MIT, following the original repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support