Update README.md

46db3f3 verified 7 months ago

3.93 kB

	---
	library_name: transformers
	tags:
	- topic
	- multi-sentiment
	license: mit
	datasets:
	- valurank/Topic_Classification
	language:
	- en
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	base_model:
	- distilbert/distilbert-base-uncased
	---

	# Model Card for Topic Classification Model

	A fine-tuned DistilBERT model for multi-class topic classification. This model predicts the most relevant topic label from a predefined set based on input text. It was trained using 🤗 Transformers and PyTorch on a custom dataset derived from academic and news-style corpora.

	## Model Details

	### Model Description

	This model was developed by Daniel (@AfroLogicInsect) to classify text into one of several predefined topics. It builds on the `distilbert-base-uncased` architecture and was fine-tuned for multi-class classification using a softmax output layer.

	- Developed by: Daniel 🇳🇬 (@AfroLogicInsect)
	- Model type: DistilBERT-based multi-class sequence classifier
	- Language(s): English
	- License: MIT
	- Finetuned from: distilbert-base-uncased

	### Model Sources

	- Repository: [AfroLogicInsect/topic-model-analysis-model](https://huggingface.co/AfroLogicInsect/topic-model-analysis-model)
	- Paper: arXiv:1910.09700 (DistilBERT)
	- Demo: [Coming soon]

	## Uses

	### Direct Use

	- Classify academic or news-style text into topics such as AI, finance, sports, climate, etc.
	- Embed in dashboards or content moderation tools for automatic tagging

	### Downstream Use

	- Can be extended to hierarchical topic classification
	- Useful for building recommendation engines or content filters

	### Out-of-Scope Use

	- Not suitable for sentiment or emotion classification
	- May not generalize well to informal or slang-heavy text

	## Bias, Risks, and Limitations

	- Trained on curated corpora — may reflect biases in source material
	- Topics are predefined and static — emerging topics may be misclassified
	- Confidence scores are probabilistic, not definitive

	### Recommendations

	- Use `top_k=5` with `return_all_scores=True` to retrieve multiple topic predictions
	- Consider fine-tuning on domain-specific data for improved accuracy

	## How to Get Started

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="AfroLogicInsect/topic-model-analysis-model",
	tokenizer="AfroLogicInsect/topic-model-analysis-model",
	return_all_scores=True
	)

	text = "New AI breakthrough in natural language processing"
	results = classifier(text)
	top_5 = sorted(results[0], key=lambda x: x['score'], reverse=True)[:5]
	for i, res in enumerate(top_5):
	print(f"Top {i+1}: {res['label']} ({res['score']:.3f})")
	```

	## Training Details

	### Dataset

	- Custom multi-class topic dataset based on arXiv abstracts and news articles
	- Labels include domains like AI, finance, sports, climate, etc.

	### Hyperparameters

	- Epochs: 3
	- Batch size: 16
	- Learning rate: 2e-5
	- Evaluation every 200 steps
	- Metric: F1 score

	### Trainer Setup

	Used Hugging Face `Trainer` API with `TrainingArguments` configured for early stopping and best model selection.

	## Evaluation

	Model achieved strong performance across multiple topic categories. Evaluation metrics include:

	- Accuracy: ~90.8%
	- F1 Score: ~0.91
	- Precision: ~0.89
	- Recall: ~0.93

	## Environmental Impact

	- Hardware: Google Colab (NVIDIA T4 GPU)
	- Training Time: ~2.5 hours
	- Carbon Emitted: ~0.3 kg CO₂eq (estimated via [ML Impact Calculator](https://mlco2.github.io/impact#compute))

	## Citation

	```bibtex
	@misc{afrologicinsect2025topicmodel,
	title = {AfroLogicInsect Topic Classification Model},
	author = {Akan Daniel},
	year = {2025},
	howpublished = {\url{https://huggingface.co/AfroLogicInsect/topic-model-analysis-model}},
	}
	```

	## Contact

	- Name: Daniel (@AfroLogicInsect)
	- Location: Lagos, Nigeria
	- Contact: GitHub / Hugging Face / email (danielamahtoday@gmail.com)