Instructions to use SIRIS-Lab/citation-parser-TYPE with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SIRIS-Lab/citation-parser-TYPE with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="SIRIS-Lab/citation-parser-TYPE")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("SIRIS-Lab/citation-parser-TYPE") model = AutoModelForSequenceClassification.from_pretrained("SIRIS-Lab/citation-parser-TYPE") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - citation | |
| - text-classification | |
| - science | |
| license: apache-2.0 | |
| language: | |
| - af | |
| - am | |
| - ar | |
| - as | |
| - az | |
| - be | |
| - bg | |
| - bn | |
| - br | |
| - bs | |
| - ca | |
| - cs | |
| - cy | |
| - da | |
| - de | |
| - el | |
| - en | |
| - eo | |
| - es | |
| - et | |
| - eu | |
| - fa | |
| - fi | |
| - fr | |
| - fy | |
| - ga | |
| - gd | |
| - gl | |
| - gu | |
| - ha | |
| - he | |
| - hi | |
| - hr | |
| - hu | |
| - hy | |
| - id | |
| - is | |
| - it | |
| - ja | |
| - jv | |
| - ka | |
| - kk | |
| - km | |
| - kn | |
| - ko | |
| - ku | |
| - ky | |
| - la | |
| - lo | |
| - lt | |
| - lv | |
| - mg | |
| - mk | |
| - ml | |
| - mn | |
| - mr | |
| - ms | |
| - my | |
| - ne | |
| - nl | |
| - 'no' | |
| - om | |
| - or | |
| - pa | |
| - pl | |
| - ps | |
| - pt | |
| - ro | |
| - ru | |
| - sa | |
| - sd | |
| - si | |
| - sk | |
| - sl | |
| - so | |
| - sq | |
| - sr | |
| - su | |
| - sv | |
| - sw | |
| - ta | |
| - te | |
| - th | |
| - tl | |
| - tr | |
| - ug | |
| - uk | |
| - ur | |
| - uz | |
| - vi | |
| - xh | |
| - yi | |
| - zh | |
| base_model: | |
| - distilbert/distilbert-base-multilingual-cased | |
| # Citation Pre-Screening | |
| <!-- Provide a quick summary of what the model is/does. --> | |
| ## Overview | |
| <details> | |
| <summary>Click to expand</summary> | |
| - **Model type:** Language Model | |
| - **Architecture:** DistilBERT | |
| - **Language:** Multilingual | |
| - **License:** Apache 2.0 | |
| - **Task:** Binary Classification (Citation Pre-Screening) | |
| - **Dataset:** SIRIS-Lab/citation-parser-TYPE | |
| - **Additional Resources:** | |
| - [GitHub](https://github.com/sirisacademic/citation-parser) | |
| </details> | |
| ## Model description | |
| The **Citation Pre-Screening** model is part of the [`Citation Parser`](https://github.com/sirisacademic/citation-parser) package and is fine-tuned for classifying citation texts as valid or invalid. This model, based on **DistilBERT**, is specifically designed for automated citation processing workflows, making it an essential component of the **Citation Parser** tool for citation metadata extraction and validation. | |
| The model was trained on a dataset containing citation texts, with the labels `True` (valid citation) and `False` (invalid citation). The dataset contains 3599 training samples and 400 test samples, with each example consisting of citation-related text and a corresponding label. | |
| The fine-tuning process was done with the **DistilBERT-base-multilingual-cased** architecture, making the model capable of handling multilingual text, but it was evaluated on English citation data. | |
| ## Intended Usage | |
| This model is intended to classify raw citation text as either a valid or invalid citation based on the provided input. It is ideal for automating the pre-screening process in citation databases or manuscript workflows. | |
| ## How to use | |
| ```python | |
| from transformers import pipeline | |
| # Load the model | |
| citation_classifier = pipeline("text-classification", model="sirisacademic/citation-pre-screening") | |
| # Example citation text | |
| citation_text = "MURAKAMI, H等: 'Unique thermal behavior of acrylic PSAs bearing long alkyl side groups and crosslinked by aluminum chelate', 《EUROPEAN POLYMER JOURNAL》" | |
| # Classify the citation | |
| result = citation_classifier(citation_text) | |
| print(result) | |
| ``` | |
| ## Training | |
| The model was trained using the **Citation Pre-Screening Dataset** consisting of: | |
| - **Training data**: 3599 samples | |
| - **Test data**: 400 samples | |
| The following hyperparameters were used for training: | |
| - **Model Path**: `distilbert/distilbert-base-multilingual-cased` | |
| - **Batch Size**: 32 | |
| - **Number of Epochs**: 4 | |
| - **Learning Rate**: 2e-5 | |
| - **Max Sequence Length**: 512 | |
| ## Evaluation Metrics | |
| The model's performance was evaluated on the test set, and the following results were obtained: | |
| | Metric | Value | | |
| |----------------------|--------| | |
| | **Accuracy** | 0.95 | | |
| | **Macro avg F1** | 0.94 | | |
| | **Weighted avg F1** | 0.95 | | |
| ## Additional information | |
| ### Authors | |
| - SIRIS Lab, Research Division of SIRIS Academic. | |
| ### License | |
| This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). | |
| ### Contact | |
| For further information, send an email to either [nicolau.duransilva@sirisacademic.com](mailto:nicolau.duransilva@sirisacademic.com) or [info@sirisacademic.com](mailto:info@sirisacademic.com). | |