| | --- |
| | language: |
| | - ar |
| | metrics: |
| | - bleu |
| | - accuracy |
| | library_name: transformers |
| | pipeline_tag: text-classification |
| | tags: |
| | - t5 |
| | - Classification |
| | - ArabicT5 |
| | - Text Classification |
| | widget: |
| | - example_title: > |
| | الديني |
| | - text: > |
| | الحمد لله رب العالمين والصلاة والسلام على سيد المرسلين نبينا محمد وآله وصحبه أجمعين،وبعد:فإنه يجب على العبد أن يتجنب الذنوب كلها دقها وجلها صغيرها وكبيرها وأن يتعاهد نفسه بالتوبة الصادقة والإنابة إلى ربه. قال تعالى: (وَتُوبُوا إِلَى اللَّهِ جَمِيعًا أَيُّهَا الْمُؤْمِنُونَ لَعَلَّكُمْ تُفْلِحُونَ)النور 31. |
| | --- |
| | |
| | # # Arabic text classification using deep learning (ArabicT5) |
| |
|
| | # # Our experiment |
| |
|
| | - The category mapping: |
| | category_mapping = { |
| | 'Politics':1, |
| | 'Finance':2, |
| | 'Medical':3, |
| | 'Sports':4, |
| | 'Culture':5, |
| | 'Tech':6, |
| | 'Religion':7 |
| | } |
| | |
| | - Training parameters |
| | | | | |
| | | :-------------------: | :-----------:| |
| | | Training batch size | `8` | |
| | | Evaluation batch size | `8` | |
| | | Learning rate | `1e-4` | |
| | | Max length input | `200` | |
| | | Max length target | `3` | |
| | | Number workers | `4` | |
| | | Epoch | `2` | |
| | | | | |
| | |
| | - Results |
| | | | | |
| | | :---------------------: | :-----------: | |
| | | Validation Loss | `0.0479` | |
| | | Accuracy | `96.49%` | |
| | | BLeU | `96.49%` | |
| |
|
| | # # SANAD: Single-label Arabic News Articles Dataset for automatic text categorization |
| | |
| | - Paper |
| | [https://www.researchgate.net/publication/333605992_SANAD_Single-Label_Arabic_News_Articles_Dataset_for_Automatic_Text_Categorization] |
| | |
| | - Dataset |
| | [https://data.mendeley.com/datasets/57zpx667y9/2] |
| | |
| | # # Arabic text classification using deep learning models |
| |
|
| | - Paper |
| | [https://www.sciencedirect.com/science/article/abs/pii/S0306457319303413] |
| |
|
| | - Their experiment' |
| | "Our experimental results showed that all models did very well on SANAD corpus with a minimum accuracy of 93.43%, achieved by CGRU, and top performance of 95.81%, achieved by HANGRU." |
| | | Model | Accuracy | |
| | | :---------------------: | :---------------------: | |
| | | CGRU | 93.43% | |
| | | HANGRU | 95.81% | |
| |
|
| | # # Example usage |
| | ```python |
| | from transformers import T5ForConditionalGeneration, T5Tokenizer |
| | |
| | model_name="Hezam/ArabicT5_Classification" |
| | model = T5ForConditionalGeneration.from_pretrained(model_name) |
| | tokenizer = T5Tokenizer.from_pretrained(model_name) |
| | |
| | text = "الزين فيك القناه الاولي المغربيه الزين فيك القناه الاولي المغربيه اخبارنا المغربيه متابعه تفاجا زوار موقع القناه الاولي المغربي" |
| | tokens=tokenizer(text, max_length=200, |
| | truncation=True, |
| | padding="max_length", |
| | return_tensors="pt" |
| | ) |
| | |
| | output= model.generate(tokens['input_ids'], |
| | max_length=3, |
| | length_penalty=10) |
| | |
| | output = [tokenizer.decode(ids, skip_special_tokens=True,clean_up_tokenization_spaces=True)for ids in output] |
| | output |
| | |
| | ``` |
| | ```bash |
| | ['5'] |
| | ``` |