Hack90/virus_dna_dataset
Viewer • Updated • 2.6M • 473 • 8
Biosaic(Bio-Mosaic) is a tokenizer library built for Enigma2. It contains: Tokenizer, Embedder for DNA & Amino Acid Protein Sequences. Has a VQ-VAE & Evoformer architecture based encoders that could convert sequences into embeddings and vice-versa for model training use-case.
It has two different Models,
VQ-VAE is around 160M parameter big(for now it's just around 40M just to test run). EvoFormer is around 136M parameter big (still in training).
class ModelConfig:
d_model: int= 768
in_dim: int= 4
beta: float= 0.15
dropout: float= 0.25
n_heads: int= 16
n_layers: int= 12
class ModelConfig:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
A = 4 # DNA alphabet
C = 21 # 21 letter for amino acid & 4 for dna
d_msa = 768
d_pair = 256
n_heads = 32
n_blocks = 28
For training the VQ-VAE & Evo-Former model, batch training is preferred, with it's own sepearte Dateset class that takes input of the strings and then Hot-encodes the DNA Sequences first and then fill them into batches according to train & val splits which is around 20% of the full dataset.
class TrainConfig:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
learning_rate = 1e-4 # bumped from 1e-5
weight_decay = 1e-4
amsgrad = True
warmup_epochs = 50 # linear warm‑up
epochs = 2000
eval_interval = 100
eval_iters = 30
batch_size = 6
block_size = 256
class TrainConfig:
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
LR = 1e-4
WD = 1e-4
AMS = True
WARMUP = 50
EPOCHS = 500
BATCH = 8
MSA_SEQ = 32 # number of sequences in each MSA
L_SEQ = 256 # length of each sequence
EVAL_ITERS = 5
EVAL_INTV = 50