CodeSoulco
/

THETA

@@ -1,28 +1,34 @@
 ---
-language:
-  - zh
-  - en
-  - de
-  - fr
 license: mit
 pipeline_tag: feature-extraction
-library_name: transformers
 tags:
-  - embeddings
-  - lora
-  - sociology
-  - retrieval
-  - feature-extraction
-  - sentence-transformers
 ---
 # THETA: Textual Hybrid Embedding–based Topic Analysis
 ## Model Description
-THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B). It is designed to generate dense vector representations for texts in the sociology and social science domain.
-The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).
 **Base Models:**
 - [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
@@ -34,7 +40,7 @@ The model is suitable for tasks such as semantic search, similarity computation,
 ## Intended Use
-This model is intended for text embedding generation, semantic similarity computation, document retrieval, and downstream NLP tasks requiring dense representations.
 It is **not** designed for text generation or decision-making in high-risk scenarios.
@@ -45,7 +51,7 @@ It is **not** designed for text generation or decision-making in high-risk scena
 | Base model | Qwen3-Embedding (0.6B / 4B) |
 | Fine-tuning | LoRA (Low-Rank Adaptation) |
 | Output dimension | 896 (0.6B) / 2560 (4B) |
-| Framework | Transformers (PyTorch) |
 ## Repository Structure
@@ -64,7 +70,7 @@ Pre-computed embeddings are available in a separate dataset repo: [CodeSoulco/TH
 ## Training Details
-- **Fine-tuning method:** LoRA
 - **Training domain:** Sociology and social science texts
 - **Datasets:** germanCoal, FCPB, socialTwitter, hatespeech, mental_health
 - **Objective:** Improve domain-specific semantic representation
@@ -111,11 +117,11 @@ This model is released under the **MIT License**.
 ## Citation
 ```bibtex
-@misc{theta2026,
-  title={THETA: Textual Hybrid Embedding--based Topic Analysis},
-  author={CodeSoul},
   year={2026},
-  publisher={Hugging Face},
-  url={https://huggingface.co/CodeSoulco/THETA}
 }
-```

 ---
+language:
+- zh
+- en
+- de
+- fr
+library_name: transformers
 license: mit
 pipeline_tag: feature-extraction
 tags:
+- embeddings
+- lora
+- sociology
+- retrieval
+- feature-extraction
+- sentence-transformers
+- peft
+base_model:
+- Qwen/Qwen3-Embedding-0.6B
+- Qwen/Qwen3-Embedding-4B
 ---
 # THETA: Textual Hybrid Embedding–based Topic Analysis
+[Paper](https://huggingface.co/papers/2603.05972) | [GitHub](https://github.com/CodeSoul-co/THETA)
 ## Model Description
+THETA (Textual Hybrid Embedding-based Topic Analysis) is a domain-specific embedding framework designed for scalable qualitative research in sociology and the social sciences. This repository contains LoRA adapters fine-tuned on top of Qwen3-Embedding models (0.6B and 4B) using **Domain-Adaptive Fine-tuning (DAFT)**.
+The model is optimized to capture semantic vector structures within specific social contexts, making it suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).
 **Base Models:**
 - [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
 ## Intended Use
+This model is intended for text embedding generation, semantic similarity computation, document retrieval, and downstream NLP tasks requiring dense representations in the sociology and social science domains.
 It is **not** designed for text generation or decision-making in high-risk scenarios.
 | Base model | Qwen3-Embedding (0.6B / 4B) |
 | Fine-tuning | LoRA (Low-Rank Adaptation) |
 | Output dimension | 896 (0.6B) / 2560 (4B) |
+| Framework | Transformers + PEFT (PyTorch) |
 ## Repository Structure
 ## Training Details
+- **Fine-tuning method:** LoRA (DAFT)
 - **Training domain:** Sociology and social science texts
 - **Datasets:** germanCoal, FCPB, socialTwitter, hatespeech, mental_health
 - **Objective:** Improve domain-specific semantic representation
 ## Citation
 ```bibtex
+@article{duan2026theta,
+  title={THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science},
+  author={Duan, Zhenke and Pan, Jiqun and Li, Xin},
+  journal={arXiv preprint arXiv:2603.05972},
   year={2026},
+  doi={10.48550/arXiv.2603.05972}
 }
+```