Update model card: add research links, base models, and official citation

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +29 -23
README.md CHANGED
@@ -1,28 +1,34 @@
1
  ---
2
- language:
3
- - zh
4
- - en
5
- - de
6
- - fr
 
7
  license: mit
8
  pipeline_tag: feature-extraction
9
- library_name: transformers
10
  tags:
11
- - embeddings
12
- - lora
13
- - sociology
14
- - retrieval
15
- - feature-extraction
16
- - sentence-transformers
 
 
 
 
17
  ---
18
 
19
  # THETA: Textual Hybrid Embedding–based Topic Analysis
20
 
 
 
21
  ## Model Description
22
 
23
- THETA is a domain-specific embedding model fine-tuned using LoRA on top of Qwen3-Embedding models (0.6B and 4B). It is designed to generate dense vector representations for texts in the sociology and social science domain.
24
 
25
- The model is suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).
26
 
27
  **Base Models:**
28
  - [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
@@ -34,7 +40,7 @@ The model is suitable for tasks such as semantic search, similarity computation,
34
 
35
  ## Intended Use
36
 
37
- This model is intended for text embedding generation, semantic similarity computation, document retrieval, and downstream NLP tasks requiring dense representations.
38
 
39
  It is **not** designed for text generation or decision-making in high-risk scenarios.
40
 
@@ -45,7 +51,7 @@ It is **not** designed for text generation or decision-making in high-risk scena
45
  | Base model | Qwen3-Embedding (0.6B / 4B) |
46
  | Fine-tuning | LoRA (Low-Rank Adaptation) |
47
  | Output dimension | 896 (0.6B) / 2560 (4B) |
48
- | Framework | Transformers (PyTorch) |
49
 
50
  ## Repository Structure
51
 
@@ -64,7 +70,7 @@ Pre-computed embeddings are available in a separate dataset repo: [CodeSoulco/TH
64
 
65
  ## Training Details
66
 
67
- - **Fine-tuning method:** LoRA
68
  - **Training domain:** Sociology and social science texts
69
  - **Datasets:** germanCoal, FCPB, socialTwitter, hatespeech, mental_health
70
  - **Objective:** Improve domain-specific semantic representation
@@ -111,11 +117,11 @@ This model is released under the **MIT License**.
111
  ## Citation
112
 
113
  ```bibtex
114
- @misc{theta2026,
115
- title={THETA: Textual Hybrid Embedding--based Topic Analysis},
116
- author={CodeSoul},
 
117
  year={2026},
118
- publisher={Hugging Face},
119
- url={https://huggingface.co/CodeSoulco/THETA}
120
  }
121
- ```
 
1
  ---
2
+ language:
3
+ - zh
4
+ - en
5
+ - de
6
+ - fr
7
+ library_name: transformers
8
  license: mit
9
  pipeline_tag: feature-extraction
 
10
  tags:
11
+ - embeddings
12
+ - lora
13
+ - sociology
14
+ - retrieval
15
+ - feature-extraction
16
+ - sentence-transformers
17
+ - peft
18
+ base_model:
19
+ - Qwen/Qwen3-Embedding-0.6B
20
+ - Qwen/Qwen3-Embedding-4B
21
  ---
22
 
23
  # THETA: Textual Hybrid Embedding–based Topic Analysis
24
 
25
+ [Paper](https://huggingface.co/papers/2603.05972) | [GitHub](https://github.com/CodeSoul-co/THETA)
26
+
27
  ## Model Description
28
 
29
+ THETA (Textual Hybrid Embedding-based Topic Analysis) is a domain-specific embedding framework designed for scalable qualitative research in sociology and the social sciences. This repository contains LoRA adapters fine-tuned on top of Qwen3-Embedding models (0.6B and 4B) using **Domain-Adaptive Fine-tuning (DAFT)**.
30
 
31
+ The model is optimized to capture semantic vector structures within specific social contexts, making it suitable for tasks such as semantic search, similarity computation, clustering, and retrieval-augmented generation (RAG).
32
 
33
  **Base Models:**
34
  - [Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
 
40
 
41
  ## Intended Use
42
 
43
+ This model is intended for text embedding generation, semantic similarity computation, document retrieval, and downstream NLP tasks requiring dense representations in the sociology and social science domains.
44
 
45
  It is **not** designed for text generation or decision-making in high-risk scenarios.
46
 
 
51
  | Base model | Qwen3-Embedding (0.6B / 4B) |
52
  | Fine-tuning | LoRA (Low-Rank Adaptation) |
53
  | Output dimension | 896 (0.6B) / 2560 (4B) |
54
+ | Framework | Transformers + PEFT (PyTorch) |
55
 
56
  ## Repository Structure
57
 
 
70
 
71
  ## Training Details
72
 
73
+ - **Fine-tuning method:** LoRA (DAFT)
74
  - **Training domain:** Sociology and social science texts
75
  - **Datasets:** germanCoal, FCPB, socialTwitter, hatespeech, mental_health
76
  - **Objective:** Improve domain-specific semantic representation
 
117
  ## Citation
118
 
119
  ```bibtex
120
+ @article{duan2026theta,
121
+ title={THETA: A Textual Hybrid Embedding-based Topic Analysis Framework and AI Scientist Agent for Scalable Computational Social Science},
122
+ author={Duan, Zhenke and Pan, Jiqun and Li, Xin},
123
+ journal={arXiv preprint arXiv:2603.05972},
124
  year={2026},
125
+ doi={10.48550/arXiv.2603.05972}
 
126
  }
127
+ ```