Instructions to use SIRIS-Lab/erc-classifiers with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SIRIS-Lab/erc-classifiers with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="SIRIS-Lab/erc-classifiers")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("SIRIS-Lab/erc-classifiers") model = AutoModelForSequenceClassification.from_pretrained("SIRIS-Lab/erc-classifiers") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| license: apache-2.0 | |
| datasets: | |
| - SIRIS-Lab/erc-classification-dataset | |
| base_model: | |
| - allenai/specter2_base | |
| pipeline_tag: text-classification | |
| # ERC Panels Classifier | |
| This model is a fine-tuned version of **allenai/specter2_base** for multilabel scientific domain classification aligned with **ERC panel taxonomy**. | |
| It achieves the following results on the held-out test set: | |
| - **Best validation loss:** 0.0361 | |
| - **Micro F1:** 0.9386 | |
| - **Micro ROC-AUC:** 0.9718 | |
| - **Subset accuracy:** 0.7943 | |
| --- | |
| ## Model description | |
| This model is a fine-tuned variant of **SPECTER2** (`allenai/specter2_base`) adapted for **multilabel classification of scientific documents** into ERC research panels. | |
| The model takes as input the **title and abstract** of a scientific publication and predicts **one or more research panels**. | |
| Since scientific outputs may legitimately span multiple domains, the model is trained using **sigmoid activation** with **binary cross-entropy loss**, allowing independent assignment of multiple labels. | |
| ### Key characteristics | |
| - **Base model:** allenai/specter2_base | |
| - **Task:** multilabel document classification | |
| - **Labels:** 28 ERC scientific panels | |
| - **Activation:** sigmoid (independent scores per label) | |
| - **Loss:** BCEWithLogitsLoss | |
| - **Output:** list of predicted panels with associated probabilities | |
| - **Decision threshold:** 0.5 (tunable) | |
| This model enables automatic research-domain tagging aligned with the ERC panel structure. | |
| --- | |
| ## Intended uses & limitations | |
| ### Intended uses | |
| This model is designed for: | |
| - Automatic assignment of ERC research panels | |
| - Metadata enrichment for: | |
| - research project databases | |
| - institutional repositories | |
| - funding and grant analysis pipelines | |
| - Large-scale analytics such as: | |
| - portfolio mapping | |
| - thematic analysis of research outputs | |
| - monitoring disciplinary coverage of funded projects | |
| - Predicting subject areas for documents lacking structured domain metadata | |
| The model supports: | |
| - title only | |
| - abstract only | |
| - **title + abstract (recommended)** | |
| ### Limitations | |
| - ERC panels are **high-level categories** and do not represent fine-grained subdisciplines | |
| - Labels are derived from curated datasets, semi-automatically annotated data | |
| - Class imbalance may affect recall for underrepresented panels | |
| - The model does not encode explicit hierarchical relationships between panels | |
| Not suited for: | |
| - fine-grained subfield classification | |
| - journal recommendation | |
| - evaluation of research quality or impact | |
| - clinical, legal, or regulatory decision-making | |
| Predictions should be treated as **supportive metadata**, not authoritative classifications. | |
| --- | |
| ## How to use | |
| ``` | |
| from transformers import pipeline | |
| # Replace with your actual model repo name on HuggingFace | |
| MODEL_NAME = "nicolauduran45/erc_classifier_demo" | |
| classifier = pipeline(task="text-classification", model=MODEL_NAME, tokenizer=MODEL_NAME) | |
| text = ["Climate change impacts on Arctic ecosystems."] | |
| classifier(text) | |
| ``` | |
| --- | |
| ## Training and evaluation data | |
| ### Training data | |
| - Scientific documents with ERC-style panel annotations | |
| - Inputs: | |
| - title | |
| - abstract | |
| - Task type: **multilabel classification** | |
| ### Dataset characteristics | |
| | Property | Value | | |
| |--------|------| | |
| | Documents | ~40k | | |
| | Labels | 28 panels | | |
| | Input fields | Title, Abstract | | |
| | Task type | Multilabel | | |
| | License | Dataset-dependent | | |
| --- | |
| ## Training procedure | |
| ### Preprocessing | |
| - Input text constructed as: | |
| `title + ". " + abstract` | |
| - Tokenization using the SPECTER2 tokenizer | |
| - Maximum sequence length: **512 tokens** | |
| ### Model | |
| - Base model: `allenai/specter2_base` | |
| - Classification head: linear → sigmoid | |
| - Loss function: BCEWithLogitsLoss | |
| - Predictions: independent probability per label | |
| ### Training hyperparameters | |
| | Hyperparameter | Value | | |
| |--------------|------| | |
| | Learning rate | 2e-5 | | |
| | Train batch size | 16 | | |
| | Eval batch size | 16 | | |
| | Epochs | 6 | | |
| | Weight decay | 0.01 | | |
| | Optimizer | AdamW | | |
| | Metric for best model | Micro F1 | | |
| --- | |
| ## Training results | |
| | Epoch | Training Loss | Validation Loss | Micro F1 | ROC-AUC | Accuracy | | |
| |------|---------------|-----------------|----------|---------|----------| | |
| | 1 | 0.2089 | 0.0968 | 0.7576 | 0.8347 | 0.4043 | | |
| | 2 | 0.0961 | 0.0713 | 0.8231 | 0.8888 | 0.5171 | | |
| | 3 | 0.0719 | 0.0578 | 0.8614 | 0.9209 | 0.5829 | | |
| | 4 | 0.0579 | 0.0458 | 0.9072 | 0.9546 | 0.7029 | | |
| | 5 | 0.0479 | 0.0390 | 0.9264 | 0.9620 | 0.7614 | | |
| | 6 | 0.0407 | 0.0361 | **0.9386** | **0.9718** | **0.7943** | | |
| --- | |
| ## Evaluation results (multilabel test set) | |
| | Panel | Precision | Recall | F1-score | Support | | |
| |------|-----------|--------|----------|---------| | |
| | Biotechnology and Biosystems Engineering | 0.88 | 0.70 | 0.78 | 30 | | |
| | Cell Biology, Development, Stem Cells and Regeneration | 0.98 | 0.94 | 0.96 | 54 | | |
| | Computer Science and Informatics | 0.96 | 0.98 | 0.97 | 95 | | |
| | Condensed Matter Physics | 0.97 | 0.99 | 0.98 | 68 | | |
| | Earth System Science | 0.94 | 0.98 | 0.96 | 64 | | |
| | Environmental Biology, Ecology and Evolution | 0.91 | 0.96 | 0.94 | 54 | | |
| | Fundamental Constituents of Matter | 0.97 | 0.94 | 0.95 | 32 | | |
| | Human Mobility, Environment, and Space | 0.81 | 0.81 | 0.81 | 21 | | |
| | Immunity, Infection and Immunotherapy | 1.00 | 0.97 | 0.99 | 40 | | |
| | Individuals, Markets and Organisations | 0.94 | 0.98 | 0.96 | 48 | | |
| | Institutions, Governance and Legal Systems | 0.89 | 0.92 | 0.91 | 26 | | |
| | Integrative Biology: from Genes and Genomes to Systems | 0.91 | 0.98 | 0.94 | 49 | | |
| | Materials Engineering | 0.81 | 0.93 | 0.87 | 75 | | |
| | Mathematics | 1.00 | 1.00 | 1.00 | 36 | | |
| | Molecules of Life: Biological Mechanisms, Structures and Functions | 0.94 | 0.98 | 0.96 | 111 | | |
| | Neuroscience and Disorders of the Nervous System | 1.00 | 1.00 | 1.00 | 30 | | |
| | Physical and Analytical Chemical Sciences | 0.89 | 0.93 | 0.91 | 94 | | |
| | Physiology in Health, Disease and Ageing | 0.94 | 1.00 | 0.97 | 34 | | |
| | Prevention, Diagnosis and Treatment of Human Diseases | 0.97 | 0.96 | 0.96 | 68 | | |
| | Products and Processes Engineering | 0.90 | 0.97 | 0.93 | 109 | | |
| | Studies of Cultures and Arts | 1.00 | 0.78 | 0.88 | 9 | | |
| | Synthetic Chemistry and Materials | 0.82 | 0.77 | 0.79 | 47 | | |
| | Systems and Communication Engineering | 0.94 | 0.97 | 0.95 | 87 | | |
| | Texts and Concepts | 0.87 | 0.93 | 0.90 | 14 | | |
| | The Human Mind and Its Complexity | 1.00 | 0.93 | 0.97 | 30 | | |
| | The Social World and Its Interactions | 0.97 | 0.94 | 0.96 | 34 | | |
| | The Study of the Human Past | 0.89 | 0.94 | 0.91 | 17 | | |
| | Universe Sciences | 1.00 | 1.00 | 1.00 | 25 | | |
| **Overall performance** | |
| | | Precision | Recall | F1-score | Support | | |
| |------|-----------|--------|----------|---------| | |
| | **Micro avg** | **0.93** | **0.95** | **0.94** | **1401** | | |
| | **Macro avg** | **0.93** | **0.94** | **0.93** | **1401** | | |
| | **Weighted avg** | **0.93** | **0.95** | **0.94** | **1401** | | |
| | **Samples avg** | **0.93** | **0.94** | **0.93** | **1401** | | |
| --- | |
| ## ERC-funded projects evaluation (multiclass recall) | |
| This evaluation uses **ERC-funded projects**, where each project belongs to **exactly one panel**. | |
| Only **recall** is reported. | |
| | Panel | Recall | | |
| |------|--------| | |
| | Biotechnology and Biosystems Engineering | 0.26 | | |
| | Cell Biology, Development, Stem Cells and Regeneration | 0.81 | | |
| | Computer Science and Informatics | 1.00 | | |
| | Condensed Matter Physics | 0.77 | | |
| | Earth System Science | 0.92 | | |
| | Environmental Biology, Ecology and Evolution | 0.85 | | |
| | Fundamental Constituents of Matter | 0.84 | | |
| | Human Mobility, Environment, and Space | 0.61 | | |
| | Immunity, Infection and Immunotherapy | 0.83 | | |
| | Individuals, Markets and Organisations | 0.96 | | |
| | Institutions, Governance and Legal Systems | 0.58 | | |
| | Integrative Biology: from Genes and Genomes to Systems | 0.73 | | |
| | Materials Engineering | 0.75 | | |
| | Mathematics | 0.96 | | |
| | Molecules of Life: Biological Mechanisms, Structures and Functions | 0.95 | | |
| | Neuroscience and Disorders of the Nervous System | 0.92 | | |
| | Physical and Analytical Chemical Sciences | 0.83 | | |
| | Physiology in Health, Disease and Ageing | 0.60 | | |
| | Prevention, Diagnosis and Treatment of Human Diseases | 0.94 | | |
| | Products and Processes Engineering | 0.58 | | |
| | Studies of Cultures and Arts | 0.27 | | |
| | Synthetic Chemistry and Materials | 0.67 | | |
| | Systems and Communication Engineering | 0.75 | | |
| | Texts and Concepts | 0.62 | | |
| | The Human Mind and Its Complexity | 0.85 | | |
| | The Social World and Its Interactions | 0.73 | | |
| | The Study of the Human Past | 0.83 | | |
| | Universe Sciences | 1.00 | | |
| **Overall performance** | |
| **Overall recall** | |
| - **Micro recall:** 0.77 | |
| - **Macro recall:** 0.76 | |
| ## Citation | |
| ``` | |
| @inproceedings{bovenzi2022mapping, | |
| title={Mapping STI ecosystems via Open Data: Overcoming the limitations of conflicting taxonomies. A case study for Climate Change Research in Denmark}, | |
| author={Bovenzi, Nicandro and Duran-Silva, Nicolau and Massucci, Francesco Alessandro and Multari, Francesco and Parra-Rojas, C{\'e}sar and Pujol-Llatse, Josep}, | |
| booktitle={International Conference on Theory and Practice of Digital Libraries (TPDL)}, | |
| pages={495--499}, | |
| year={2022}, | |
| publisher={Springer International Publishing} | |
| } | |
| ``` | |
| --- | |
| ## Framework versions | |
| - **Transformers:** 4.57.x | |
| - **PyTorch:** 2.8.0 | |
| - **Datasets:** 3.x | |
| - **Tokenizers:** 0.22.x |