Fill-Mask
Transformers
Safetensors
English
bert
protein
protbert
masked-language-modeling
bioinformatics
sequence-prediction
Instructions to use faceless-void/protbert-sequence-unmasking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use faceless-void/protbert-sequence-unmasking with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="faceless-void/protbert-sequence-unmasking")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("faceless-void/protbert-sequence-unmasking") model = AutoModelForMaskedLM.from_pretrained("faceless-void/protbert-sequence-unmasking") - Notebooks
- Google Colab
- Kaggle
| language: en | |
| tags: | |
| - protein | |
| - protbert | |
| - masked-language-modeling | |
| - bioinformatics | |
| - sequence-prediction | |
| datasets: | |
| - custom | |
| license: mit | |
| library_name: transformers | |
| pipeline_tag: fill-mask | |
| # ProtBERT-Unmasking | |
| This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context. | |
| ## Model Description | |
| - **Base Model**: ProtBERT | |
| - **Task**: Protein Sequence Unmasking | |
| - **Training**: Fine-tuned on masked protein sequences | |
| - **Use Case**: Predicting missing or masked amino acids in protein sequences | |
| - **Optimal Use**: Best performance on E. coli sequences with known amino acids K, C, Y, H, S, M | |
| For detailed information about the training methodology and approach, please refer to our paper: | |
| [https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892) | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForMaskedLM, AutoTokenizer | |
| # Load model and tokenizer | |
| model = AutoModelForMaskedLM.from_pretrained("your-username/protbert-sequence-unmasking") | |
| tokenizer = AutoTokenizer.from_pretrained("your-username/protbert-sequence-unmasking") | |
| # Example usage for E. coli sequence with known amino acids (K,C,Y,H,S,M) | |
| sequence = "MALN[MASK]KFGP[MASK]LVRK" | |
| inputs = tokenizer(sequence, return_tensors="pt") | |
| outputs = model(**inputs) | |
| predictions = outputs.logits | |
| ``` | |
| ## Inference API | |
| The model is optimized for: | |
| - **Organism**: E. coli | |
| - **Known Amino Acids**: K, C, Y, H, S, M | |
| - **Task**: Predicting unknown amino acids in a sequence | |
| Example API usage: | |
| ```python | |
| from transformers import pipeline | |
| unmasker = pipeline('fill-mask', model='your-username/protbert-sequence-unmasking') | |
| sequence = "K[MASK]YHS[MASK]" # Example with known amino acids K,Y,H,S | |
| results = unmasker(sequence) | |
| for result in results: | |
| print(f"Predicted amino acid: {result['token_str']}, Score: {result['score']:.3f}") | |
| ``` | |
| ## Limitations and Biases | |
| - This model is specifically designed for protein sequence unmasking in E. coli | |
| - Optimal performance is achieved when working with sequences containing known amino acids K, C, Y, H, S, M | |
| - The model may not perform optimally for: | |
| - Sequences from other organisms | |
| - Sequences without the specified known amino acids | |
| - Other protein-related tasks | |
| ## Training Details | |
| The complete details of the training methodology, dataset preparation, and model evaluation can be found in our paper: | |
| [https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892) | |