RNABERT

A small BERT-style RNA language model pretrained on non-coding RNA sequences from Rfam 14.3, using Masked Language Modeling (MLM) and Structural Alignment Learning (SAL). Designed for RNA clustering and structural alignment tasks.

Architecture

Parameter	Value
Layers	6
Attention heads	12
Embedding dimension	120
FFN intermediate size	40
Vocabulary size	6 (PAD, MASK, A, U, G, C)
Positional encoding	Learned absolute
Architecture	Post-LN BERT encoder
Max sequence length	440

Vocabulary:

Token	ID
`<pad>`	0
`<mask>`	1
A	2
U	3
G	4
C	5

No CLS or EOS tokens are added. Sequences are tokenized character-by-character; T is silently converted to U.

Pretraining

Objective: Masked Language Modeling (MLM) + Structural Alignment Learning (SAL, a pairwise structural alignment contrastive objective)
Data: Rfam 14.3 (~440 nt max length sequences)
Source checkpoint: bert_mul_2.pth (distributed inside RNABERT_pretrained.pth zip)

Checkpoint selection

There is one published pretrained checkpoint from the original repository. This is it.

Parity Verification

Hidden-state representations verified identical (max abs diff = 2.2e-6) to the original implementation at all 7 representation levels (embedding + 6 transformer layers), with and without padding. Verified on GPU with PyTorch 2.7 / transformers 4.57.6.

Related Models

See the full RNABERT collection.

Model	Notes
Taykhoom/RNABERT	This model

Usage

Embedding generation

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Taykhoom/RNABERT", trust_remote_code=True)
model = AutoModel.from_pretrained("Taykhoom/RNABERT")
model.eval()

sequences = ["AUGCAUGCAUGC", "GCUAGCUAGCUA"]
enc = tokenizer(sequences, return_tensors="pt", padding=True)

with torch.no_grad():
    out = model(**enc)

# Token-level embeddings
token_emb = out.last_hidden_state   # (batch, seq_len, 120)

# Mean-pool over non-padding positions
mask = enc["attention_mask"].unsqueeze(-1).float()
mean_emb = (token_emb * mask).sum(1) / mask.sum(1)  # (batch, 120)

# Intermediate layers
out_all = model(**enc, output_hidden_states=True)
layer3_emb = out_all.hidden_states[3]   # (batch, seq_len, 120)

Fine-tuning

Standard HF conventions. The model has no CLS token, so use mean pooling over non-padding positions for sequence-level tasks.

Implementation Notes

This model uses the standard HuggingFace BertModel (model_type: "bert") with custom hyperparameters matching the original RNABERT architecture. No custom modeling code is required; trust_remote_code=True is only needed for the tokenizer.

The original implementation uses standard scaled dot-product attention (post-LN BERT). This HF port adds attn_implementation="sdpa" and attn_implementation="flash_attention_2" support via the standard HF dispatch mechanism, which were not part of the original codebase.

Citation

@article{akiyama2022_rnabert,
  title   = {Informative {RNA} base embedding for {RNA} structural alignment and clustering by deep representation learning},
  author  = {Akiyama, Manato and Sakakibara, Yasubumi},
  journal = {NAR Genomics and Bioinformatics},
  volume  = {4},
  number  = {1},
  pages   = {lqac012},
  year    = {2022},
  doi     = {10.1093/nargab/lqac012}
}

Credits

Original model and code by Akiyama and Sakakibara. Source: GitHub. The HF conversion code was authored primarily by Claude Code and reviewed manually by Taykhoom Dalal.

License

No license is specified in the original repository. Please contact the authors before redistributing or using in commercial settings.

Downloads last month: 65

Safetensors

Model size

478k params

Tensor type

F32

Collection including Taykhoom/RNABERT

RNABERT

Collection

RNABERT: RNA language model for clustering and structural alignment. • 1 item • Updated 4 days ago