Reubencf commited on
Commit
ff4f3fa
Β·
verified Β·
1 Parent(s): 34bf0cd

Deploy interactive dataset explorer (static SDK)

Browse files

Replaces the docker-based markdown README with the static HTML dashboard. Donut charts (D3), Voronoi language treemap, GSAP animations.

Files changed (4) hide show
  1. .gitattributes +1 -0
  2. README.md +9 -57
  3. Reubensdataset.png +3 -0
  4. index.html +1376 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ Reubensdataset.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,66 +1,18 @@
1
  ---
2
  title: README
3
- emoji: 🏒
4
  colorFrom: purple
5
  colorTo: yellow
6
- sdk: docker
7
  pinned: false
8
  ---
9
 
10
  # Reuben Data Lab
11
 
12
- > πŸ† Work here was produced for the
13
- > **[Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)**
14
- > hosted by **[Adaption Labs](https://www.adaptionlabs.ai)** β€” credit to
15
- > **Adaptive Data by Adaption** for organizing the hackathon.
16
-
17
- Building **open, underserved datasets** for training and evaluating modern
18
- audio, speech, and multimodal models. Every release is open-sourced on
19
- Hugging Face with permissive licensing and rich metadata, targeting the three
20
- criteria the Uncharted Data Challenge cares about: **under-served problem
21
- domains**, **scarce open-source data**, and **under-resourced languages**.
22
-
23
- ## Datasets
24
-
25
- ### 🎡 [FMA Labeled β€” Multi-Attribute Music Dataset](https://huggingface.co/datasets/Reubencf/fma-labeled)
26
- 29k Creative-Commons tracks from the Free Music Archive, automatically
27
- annotated with **lyrics, genre, sub-genres, mood, instruments, BPM, key,
28
- vocal type, energy, era, and audio quality** using `gemini-flash-latest`.
29
- Paired audio + text for music tagging, music-LM training, and auto-lyric
30
- research.
31
-
32
- ### πŸ—£οΈ [Multilingual Synthetic TTS (Qwen3)](https://huggingface.co/datasets/Reubencf/multilingual-synthetic-tts)
33
- ~69k synthetic speech clips across **9 languages** (en, ja, zh, ko, de, es,
34
- fr, ru, pt) generated with Qwen3-TTS-12Hz via zero-shot voice cloning from a
35
- rotating pool of reference speakers. Covers conversational, informational,
36
- technical, emotional, and proverb-style utterances β€” useful for TTS
37
- fine-tuning, ASR augmentation, and cross-lingual voice-conversion research.
38
-
39
- ## Focus Areas
40
-
41
- - **Under-resourced languages** β€” expanding speech and text coverage beyond
42
- English-only datasets.
43
- - **Rich supervision** β€” datasets ship with detailed structured metadata
44
- (genre/mood/BPM/key for music; language/style/voice for speech), not just
45
- audio + class labels.
46
- - **Permissive licensing** β€” Creative Commons / CC0 where possible; synthetic
47
- outputs released for open research.
48
- - **Reproducibility** β€” generation pipelines and labeling scripts are
49
- open-sourced alongside the data.
50
-
51
- ## Tooling & Pipeline
52
-
53
- - **Labeling**: Google Gemini (`gemini-flash-latest`) via Flex and Batch APIs.
54
- - **Speech synthesis**: Qwen3-TTS-12Hz-1.7B-Base on 2Γ— H100 with zero-shot
55
- voice cloning.
56
- - **Infra**: Hyperbolic GPU rentals, custom stall-watchers for long-running
57
- multi-GPU jobs, Hugging Face Hub for distribution.
58
-
59
- ## Get In Touch
60
-
61
- - Hugging Face: [@Reubencf](https://huggingface.co/Reubencf)
62
- - Datasets home: [ReubenDataLab](https://huggingface.co/ReubenDataLab)
63
-
64
- ---
65
-
66
- *More datasets coming soon as part of the Uncharted Data Challenge submission.*
 
1
  ---
2
  title: README
3
+ emoji: "\U0001F4CA"
4
  colorFrom: purple
5
  colorTo: yellow
6
+ sdk: static
7
  pinned: false
8
  ---
9
 
10
  # Reuben Data Lab
11
 
12
+ An interactive landing for the [ReubenDataLab](https://huggingface.co/ReubenDataLab)
13
+ dataset organization. The raw open-source corpus is visualized as a donut
14
+ chart alongside the
15
+ [Adaption-remastered](https://huggingface.co/collections/Reubencf/proper-adaption)
16
+ versions, with a Voronoi treemap showing every language that appears
17
+ across the corpus. Click any slice or cell to drill into the dataset /
18
+ language details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Reubensdataset.png ADDED

Git LFS Details

  • SHA256: ac90374fa00abc80c56ef84153e7aa69239012ce1006752ccbb4d8722c59c54e
  • Pointer size: 132 Bytes
  • Size of remote file: 3.57 MB
index.html ADDED
@@ -0,0 +1,1376 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8" />
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0" />
6
+ <title>ReubenDataLab Β· Dataset Explorer</title>
7
+
8
+ <link rel="preconnect" href="https://fonts.googleapis.com">
9
+ <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
10
+ <link href="https://fonts.googleapis.com/css2?family=Geist:wght@100..900&family=Google+Sans:ital,opsz,wght@0,17..18,400..700;1,17..18,400..700&display=swap" rel="stylesheet">
11
+
12
+ <script src="https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js"></script>
13
+ <script src="https://cdn.jsdelivr.net/npm/d3-weighted-voronoi@1"></script>
14
+ <script src="https://cdn.jsdelivr.net/npm/d3-voronoi-map@2"></script>
15
+ <script src="https://cdn.jsdelivr.net/npm/d3-voronoi-treemap@1"></script>
16
+ <script src="https://cdnjs.cloudflare.com/ajax/libs/gsap/3.12.5/gsap.min.js"></script>
17
+
18
+ <style>
19
+ :root {
20
+ --bg: #000000;
21
+ --fg: #ffffff;
22
+ --muted: #8a8a94;
23
+ --card: #141414;
24
+ --card-alt: #1c1c1e;
25
+ --border: #262626;
26
+ --divider: #2e2e2e;
27
+ --tooltip-bg: rgba(20, 20, 20, 0.96);
28
+
29
+ --palette-1: #3b82f6;
30
+ --palette-2: #10b981;
31
+ --palette-3: #ef4444;
32
+ --palette-4: #f59e0b;
33
+ --palette-5: #8b5cf6;
34
+ --palette-6: #ec4899;
35
+ --palette-7: #06b6d4;
36
+ --palette-8: #84cc16;
37
+ --palette-9: #f97316;
38
+ --palette-10: #14b8a6;
39
+ --palette-11: #a855f7;
40
+ --palette-12: #eab308;
41
+ }
42
+
43
+ * { box-sizing: border-box; }
44
+ html, body {
45
+ margin: 0; padding: 0;
46
+ background: var(--bg);
47
+ color: var(--fg);
48
+ font-family: "Geist", "Google Sans", -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif;
49
+ font-weight: 400;
50
+ min-height: 100vh;
51
+ -webkit-font-smoothing: antialiased;
52
+ letter-spacing: 0.005em;
53
+ }
54
+
55
+ a { color: var(--fg); text-decoration: none; }
56
+ a:hover { opacity: 0.7; }
57
+
58
+ /* Header / hero image */
59
+ header {
60
+ max-width: 1440px;
61
+ margin: 0 auto;
62
+ padding: 32px 24px 8px 24px;
63
+ text-align: center;
64
+ }
65
+ .hero-img {
66
+ display: block;
67
+ max-width: 900px;
68
+ width: 100%;
69
+ height: auto;
70
+ margin: 0 auto;
71
+ border-radius: 14px;
72
+ }
73
+
74
+ /* Hero stats banner */
75
+ .hero-stats {
76
+ max-width: 1440px;
77
+ margin: 24px auto 0 auto;
78
+ padding: 0 24px;
79
+ display: grid;
80
+ grid-template-columns: repeat(5, 1fr);
81
+ gap: 14px;
82
+ }
83
+ .stat {
84
+ background: var(--card);
85
+ border: 1px solid var(--border);
86
+ border-radius: 16px;
87
+ padding: 18px 14px;
88
+ text-align: center;
89
+ }
90
+ .stat .num {
91
+ display: block;
92
+ font-size: 1.75rem;
93
+ font-weight: 700;
94
+ color: var(--fg);
95
+ letter-spacing: -0.015em;
96
+ line-height: 1.05;
97
+ }
98
+ .stat .num .decimal { font-size: 0.55em; font-weight: 500; opacity: 0.75; margin-left: 1px; }
99
+ .stat .lbl {
100
+ display: block;
101
+ font-size: 0.68rem;
102
+ color: var(--muted);
103
+ text-transform: uppercase;
104
+ letter-spacing: 0.13em;
105
+ margin-top: 8px;
106
+ font-weight: 500;
107
+ }
108
+ .stat .sub {
109
+ display: block;
110
+ font-size: 0.6rem;
111
+ color: var(--muted);
112
+ font-weight: 400;
113
+ letter-spacing: 0.04em;
114
+ margin-top: 4px;
115
+ opacity: 0.65;
116
+ text-transform: none;
117
+ }
118
+
119
+ /* Chart sections */
120
+ .charts {
121
+ max-width: 1440px;
122
+ margin: 0 auto;
123
+ display: grid;
124
+ grid-template-columns: 1fr 1fr;
125
+ gap: 24px;
126
+ padding: 24px;
127
+ }
128
+ .chart-card {
129
+ background: var(--card);
130
+ border: 1px solid var(--border);
131
+ border-radius: 20px;
132
+ padding: 24px 20px 16px 20px;
133
+ }
134
+ .chart-card h2 {
135
+ text-align: center;
136
+ margin: 0 0 4px 0;
137
+ font-size: 1.1rem;
138
+ font-weight: 600;
139
+ color: var(--fg);
140
+ letter-spacing: -0.005em;
141
+ }
142
+ .chart-card .subtitle {
143
+ text-align: center;
144
+ margin: 0 0 14px 0;
145
+ font-size: 0.82rem;
146
+ color: var(--muted);
147
+ font-weight: 400;
148
+ }
149
+
150
+ /* Donut */
151
+ .donut-wrap {
152
+ position: relative;
153
+ width: 100%;
154
+ max-width: 560px;
155
+ aspect-ratio: 1;
156
+ margin: 0 auto;
157
+ }
158
+ .donut-wrap.small { max-width: 400px; }
159
+ .donut-svg {
160
+ width: 100%;
161
+ height: 100%;
162
+ display: block;
163
+ overflow: visible;
164
+ }
165
+ .donut-slice { cursor: pointer; transition: filter 0.2s ease; }
166
+ .donut-slice:hover { filter: brightness(1.25) drop-shadow(0 0 10px rgba(255,255,255,0.15)); }
167
+
168
+ .donut-center {
169
+ position: absolute;
170
+ inset: 0;
171
+ display: flex;
172
+ flex-direction: column;
173
+ align-items: center;
174
+ justify-content: center;
175
+ pointer-events: none;
176
+ padding: 18%;
177
+ text-align: center;
178
+ }
179
+ .donut-center.small { padding: 22%; }
180
+ .center-item { width: 100%; }
181
+ .center-label {
182
+ font-size: 0.65rem;
183
+ font-weight: 500;
184
+ color: var(--muted);
185
+ letter-spacing: 0.18em;
186
+ text-transform: uppercase;
187
+ display: flex;
188
+ align-items: center;
189
+ justify-content: center;
190
+ gap: 6px;
191
+ }
192
+ .center-label .icon { font-size: 0.85rem; opacity: 0.9; }
193
+ .center-number {
194
+ font-size: clamp(1.8rem, 4.5vw, 2.75rem);
195
+ font-weight: 700;
196
+ color: var(--fg);
197
+ line-height: 1;
198
+ letter-spacing: -0.03em;
199
+ margin: 4px 0;
200
+ }
201
+ .center-number .decimal {
202
+ font-size: 0.55em;
203
+ font-weight: 500;
204
+ color: var(--fg);
205
+ opacity: 0.72;
206
+ margin-left: 1px;
207
+ }
208
+ .center-divider {
209
+ width: 42%;
210
+ border: none;
211
+ border-top: 1px solid rgba(255, 255, 255, 0.08);
212
+ margin: 10px auto;
213
+ }
214
+
215
+ /* Details card */
216
+ .details {
217
+ max-width: 1440px;
218
+ margin: 0 auto 32px auto;
219
+ padding: 0 24px;
220
+ }
221
+ .details-card {
222
+ background: var(--card);
223
+ border: 1px solid var(--border);
224
+ border-radius: 20px;
225
+ padding: 26px 28px;
226
+ min-height: 140px;
227
+ }
228
+ .details-card h3 {
229
+ margin: 0 0 8px 0;
230
+ font-size: 1.35rem;
231
+ color: var(--fg);
232
+ display: flex;
233
+ align-items: center;
234
+ gap: 12px;
235
+ font-weight: 600;
236
+ letter-spacing: -0.01em;
237
+ }
238
+ .details-card h3 .swatch { display: inline-block; width: 14px; height: 14px; border-radius: 50%; }
239
+ .details-card h3 a { color: var(--fg); font-size: 1.05rem; opacity: 0.85; }
240
+ .details-card h3 a:hover { opacity: 1; text-decoration: underline; }
241
+ .details-card .tagline { color: var(--muted); font-size: 0.95rem; margin: 0 0 18px 0; }
242
+ .kv-grid {
243
+ display: grid;
244
+ grid-template-columns: repeat(auto-fit, minmax(220px, 1fr));
245
+ gap: 12px 24px;
246
+ }
247
+ .kv .k { color: var(--muted); font-size: 0.75rem; text-transform: uppercase; letter-spacing: 0.1em; margin-bottom: 4px; font-weight: 500; }
248
+ .kv .v { color: var(--fg); font-size: 0.9rem; }
249
+ .kv .v a { border-bottom: 1px dashed var(--muted); }
250
+ .kv .v strong { font-weight: 600; }
251
+ .schema-list { display: flex; flex-wrap: wrap; gap: 6px; margin-top: 6px; }
252
+ .schema-list code {
253
+ background: var(--card-alt);
254
+ color: var(--fg);
255
+ padding: 3px 8px;
256
+ border-radius: 6px;
257
+ font-size: 0.78rem;
258
+ font-family: "SF Mono", Consolas, monospace;
259
+ border: 1px solid var(--border);
260
+ }
261
+
262
+ /* Extras (modality + treemap) */
263
+ .extras {
264
+ max-width: 1440px;
265
+ margin: 8px auto 0 auto;
266
+ padding: 0 24px 24px 24px;
267
+ display: grid;
268
+ grid-template-columns: 1fr 2fr;
269
+ gap: 24px;
270
+ }
271
+ .plot-treemap { width: 100%; height: 900px; position: relative; }
272
+ .plot-treemap svg { width: 100%; height: 100%; display: block; }
273
+
274
+ /* Voronoi */
275
+ .voronoi-cell {
276
+ cursor: pointer;
277
+ transition: filter 0.18s ease, opacity 0.18s ease;
278
+ }
279
+ .voronoi-cell:hover { filter: brightness(1.35) drop-shadow(0 0 8px rgba(255,255,255,0.35)); }
280
+ .voronoi-label {
281
+ font-family: "Geist", "Google Sans", sans-serif;
282
+ font-weight: 600;
283
+ fill: #ffffff;
284
+ pointer-events: none;
285
+ text-anchor: middle;
286
+ user-select: none;
287
+ }
288
+ .voronoi-label .code { font-weight: 400; opacity: 0.8; fill: #ffffff; }
289
+ .voronoi-tooltip {
290
+ position: absolute;
291
+ pointer-events: none;
292
+ background: var(--tooltip-bg);
293
+ border: 1px solid var(--border);
294
+ border-radius: 10px;
295
+ padding: 10px 14px;
296
+ font-size: 0.85rem;
297
+ color: var(--fg);
298
+ box-shadow: 0 12px 32px rgba(0,0,0,0.7);
299
+ opacity: 0;
300
+ transition: opacity 0.12s ease;
301
+ white-space: nowrap;
302
+ z-index: 20;
303
+ font-family: "Geist", sans-serif;
304
+ }
305
+ .voronoi-tooltip .t-name { font-weight: 700; color: var(--fg); font-size: 0.95rem; }
306
+ .voronoi-tooltip .t-code { color: var(--muted); font-size: 0.72rem; margin-left: 4px; }
307
+ .voronoi-tooltip .t-rows { color: var(--fg); font-weight: 600; margin-top: 4px; opacity: 0.9; }
308
+
309
+ /* Donut tooltip (shared style) */
310
+ .donut-tooltip {
311
+ position: fixed;
312
+ pointer-events: none;
313
+ background: var(--tooltip-bg);
314
+ border: 1px solid var(--border);
315
+ border-radius: 10px;
316
+ padding: 10px 14px;
317
+ font-size: 0.85rem;
318
+ color: var(--fg);
319
+ box-shadow: 0 12px 32px rgba(0,0,0,0.7);
320
+ opacity: 0;
321
+ transition: opacity 0.12s ease;
322
+ white-space: nowrap;
323
+ z-index: 50;
324
+ font-family: "Geist", sans-serif;
325
+ }
326
+ .donut-tooltip .t-name { font-weight: 700; font-size: 0.95rem; }
327
+ .donut-tooltip .t-meta { color: var(--muted); font-size: 0.78rem; margin-top: 4px; }
328
+
329
+ footer {
330
+ max-width: 1440px;
331
+ margin: 0 auto 32px auto;
332
+ padding: 0 24px;
333
+ text-align: center;
334
+ color: var(--muted);
335
+ font-size: 0.8rem;
336
+ font-weight: 400;
337
+ }
338
+ footer a { border-bottom: 1px dashed var(--muted); }
339
+
340
+ @media (max-width: 900px) {
341
+ .hero-stats { grid-template-columns: repeat(2, 1fr); }
342
+ .extras { grid-template-columns: 1fr; }
343
+ }
344
+ @media (max-width: 780px) {
345
+ .charts { grid-template-columns: 1fr; }
346
+ }
347
+ </style>
348
+ </head>
349
+ <body>
350
+
351
+ <header>
352
+ <img src="Reubensdataset.png" alt="Reuben's Data Lab" class="hero-img" />
353
+ </header>
354
+
355
+ <section class="hero-stats">
356
+ <div class="stat">
357
+ <span class="num" data-value="12"></span>
358
+ <span class="lbl">Raw datasets</span>
359
+ <span class="sub">in four HF collections</span>
360
+ </div>
361
+ <div class="stat">
362
+ <span class="num" data-value="14.8M"></span>
363
+ <span class="lbl">Total rows</span>
364
+ <span class="sub">every row, every dataset</span>
365
+ </div>
366
+ <div class="stat">
367
+ <span class="num" data-value="130+"></span>
368
+ <span class="lbl">Languages</span>
369
+ <span class="sub">many rarely seen online</span>
370
+ </div>
371
+ <div class="stat">
372
+ <span class="num" data-value="4"></span>
373
+ <span class="lbl">Modalities</span>
374
+ <span class="sub">audio, text, images, code</span>
375
+ </div>
376
+ <div class="stat">
377
+ <span class="num" data-value="17"></span>
378
+ <span class="lbl">Days to build</span>
379
+ <span class="sub">April 8 to April 24, 2026</span>
380
+ </div>
381
+ </section>
382
+
383
+ <section class="charts">
384
+ <div class="chart-card">
385
+ <h2>Raw corpus</h2>
386
+ <div class="subtitle">Every dataset I've created in the <a href="https://huggingface.co/ReubenDataLab/collections" target="_blank" rel="noopener">ReubenDataLab collections</a></div>
387
+ <div class="donut-wrap">
388
+ <svg id="chart-raw" class="donut-svg"></svg>
389
+ <div class="donut-center" id="center-raw"></div>
390
+ </div>
391
+ </div>
392
+ <div class="chart-card">
393
+ <h2>Adaption-remastered</h2>
394
+ <div class="subtitle">Improved datasets after running them through <a href="https://adaptionlabs.ai" target="_blank" rel="noopener">adaptionlabs.ai</a></div>
395
+ <div class="donut-wrap">
396
+ <svg id="chart-adaption" class="donut-svg"></svg>
397
+ <div class="donut-center" id="center-adaption"></div>
398
+ </div>
399
+ </div>
400
+ </section>
401
+
402
+ <div class="details">
403
+ <div id="details-card" class="details-card" style="display: none;"></div>
404
+ </div>
405
+
406
+ <section class="extras">
407
+ <div class="chart-card">
408
+ <h2>Modality split</h2>
409
+ <div class="subtitle">Share of the corpus by data type</div>
410
+ <div class="donut-wrap small">
411
+ <svg id="chart-modality" class="donut-svg"></svg>
412
+ <div class="donut-center small" id="center-modality"></div>
413
+ </div>
414
+ </div>
415
+ <div class="chart-card">
416
+ <h2>Languages across the corpus</h2>
417
+ <div class="subtitle">Every language that appears in any raw dataset, sized (log-scale) by total row count. Hover for exact numbers.</div>
418
+ <div id="chart-treemap" class="plot-treemap">
419
+ <div id="voronoi-tooltip" class="voronoi-tooltip"></div>
420
+ </div>
421
+ </div>
422
+ </section>
423
+
424
+ <div id="donut-tooltip" class="donut-tooltip"></div>
425
+
426
+ <footer>
427
+ Data self-reported from HF dataset pages Β· Built for the
428
+ <a href="https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge" target="_blank">Uncharted Data Challenge</a>
429
+ Β· Author <a href="https://huggingface.co/Reubencf" target="_blank">@Reubencf</a>
430
+ </footer>
431
+
432
+ <script>
433
+ // ===========================================================================
434
+ // PALETTE β€” pulled from CSS variables so tokens stay single-source-of-truth
435
+ // ===========================================================================
436
+ const CSS = getComputedStyle(document.documentElement);
437
+ const PALETTE = [
438
+ CSS.getPropertyValue('--palette-1').trim(),
439
+ CSS.getPropertyValue('--palette-2').trim(),
440
+ CSS.getPropertyValue('--palette-3').trim(),
441
+ CSS.getPropertyValue('--palette-4').trim(),
442
+ CSS.getPropertyValue('--palette-5').trim(),
443
+ CSS.getPropertyValue('--palette-6').trim(),
444
+ CSS.getPropertyValue('--palette-7').trim(),
445
+ CSS.getPropertyValue('--palette-8').trim(),
446
+ CSS.getPropertyValue('--palette-9').trim(),
447
+ CSS.getPropertyValue('--palette-10').trim(),
448
+ CSS.getPropertyValue('--palette-11').trim(),
449
+ CSS.getPropertyValue('--palette-12').trim(),
450
+ ];
451
+
452
+ /** Brighten an HSL-mapped color for the inner tick mark (and voronoi hovers). */
453
+ function luminousVariant(hex, lightnessBoost = 0.4, saturationBoost = 0.12) {
454
+ const c = d3.hsl(hex);
455
+ c.l = Math.min(0.9, c.l + lightnessBoost);
456
+ c.s = Math.min(1, c.s + saturationBoost);
457
+ return c.formatHex();
458
+ }
459
+
460
+ /** Format a number as an integer + <span class="decimal">.Mk</span>. */
461
+ function formatRows(n) {
462
+ if (n >= 1_000_000) {
463
+ const v = n / 1_000_000;
464
+ const [whole, frac] = v.toFixed(1).split('.');
465
+ return `${whole}<span class="decimal">.${frac}M</span>`;
466
+ }
467
+ if (n >= 1_000) {
468
+ const v = n / 1_000;
469
+ const [whole, frac] = v.toFixed(1).split('.');
470
+ return `${whole}<span class="decimal">.${frac}k</span>`;
471
+ }
472
+ return String(n);
473
+ }
474
+ function formatShort(n) {
475
+ if (n >= 1_000_000) return (n / 1_000_000).toFixed(1).replace(/\.0$/, '') + 'M';
476
+ if (n >= 1_000) return (n / 1_000).toFixed(1).replace(/\.0$/, '') + 'k';
477
+ return String(n);
478
+ }
479
+
480
+ // Upgrade hero stat numbers to support a decimal span
481
+ document.querySelectorAll('.stat .num').forEach(el => {
482
+ const raw = el.getAttribute('data-value');
483
+ const m = raw && raw.match(/^([~β‰ˆ]?[\d,]+)(\.[\d]+)?([A-Za-z+]+)?$/);
484
+ if (!m) { el.textContent = raw || ''; return; }
485
+ const [, whole, frac = '', suffix = ''] = m;
486
+ el.innerHTML = frac || suffix
487
+ ? `${whole}<span class="decimal">${frac}${suffix}</span>`
488
+ : whole;
489
+ });
490
+
491
+ // ===========================================================================
492
+ // DATASET CATALOG
493
+ // ===========================================================================
494
+ const DATASETS = [
495
+ {
496
+ key: "speech",
497
+ title: "Multilingual Synthetic Speech",
498
+ tagline: "Zero-shot voice cloning with Qwen3-TTS across 9 languages",
499
+ raw: { repo: "Reubencf/multilingual-synthetic-tts", rows: 68677 },
500
+ adaption: { repo: "Reubencf/Adaption-multilingual-speech", rows: 10274 },
501
+ languages: "en, ja, zh, ko, de, es, fr, ru, pt",
502
+ modality: "audio + text",
503
+ license: "open / synthetic",
504
+ schema: ["audio", "text", "language", "language_name", "style", "voice", "sample_rate"],
505
+ model: "Qwen3-TTS-12Hz-1.7B-Base",
506
+ group: "paired"
507
+ },
508
+ {
509
+ key: "sentences",
510
+ title: "Multilingual Sentences (text-only)",
511
+ tagline: "Text projection of the TTS corpus β€” ready for Adaption",
512
+ raw: null,
513
+ adaption: { repo: "Reubencf/Adaption-multilingual-sentences", rows: 10000 },
514
+ languages: "ja, ru, ko, de, es, pt, zh, en, fr + 114 more",
515
+ modality: "text",
516
+ license: "open",
517
+ schema: ["text", "enhanced_prompt", "enhanced_completion", "language", "voice", "style"],
518
+ group: "paired"
519
+ },
520
+ {
521
+ key: "music",
522
+ title: "Music β€” FMA Labeled",
523
+ tagline: "Creative-Commons music tracks with lyrics, genre, mood, BPM, key",
524
+ raw: { repo: "Reubencf/fma-labeled", rows: 29000 },
525
+ adaption: { repo: "Reubencf/adaption-music-style-prompts", rows: 9950 },
526
+ languages: "en",
527
+ modality: "audio + text",
528
+ license: "CC-BY / CC0 (source-dependent)",
529
+ schema: ["audio", "lyrics", "genre", "sub_genres", "mood", "instruments", "bpm", "key", "vocal_type", "energy", "era", "quality"],
530
+ model: "gemini-flash-latest",
531
+ group: "paired"
532
+ },
533
+ {
534
+ key: "street",
535
+ title: "StreetView Global",
536
+ tagline: "Globally-sampled Mapillary street images with scene classification",
537
+ raw: { repo: "Reubencf/streetview-global", rows: 30000 },
538
+ adaption: { repo: "Reubencf/adaption-street-scene-descriptions", rows: 10100 },
539
+ languages: "en",
540
+ modality: "image + text",
541
+ license: "CC-BY-SA-4.0",
542
+ schema: ["image", "scene_description", "setting", "weather", "time_of_day", "road_type", "infrastructure", "lat", "lon", "compass"],
543
+ group: "paired"
544
+ },
545
+ {
546
+ key: "magazines",
547
+ title: "Magazines Multilingual VQA",
548
+ tagline: "Public-domain magazine OCR in 40+ source languages (including low-resource)",
549
+ raw: { repo: "Reubencf/magazines-multilingual-vqa", rows: 29039 },
550
+ adaption: { repo: "Reubencf/adaption-multilingual-doc-qa", rows: 8800 },
551
+ languages: "ar, de, en, es, fr, hi, it, ja, pt, zh + 35 more (Afrikaans, Amharic, Yoruba, Yiddish, Bengali, Santali, Somali, Vietnamese, Russian, Maithili, Tibetan, …)",
552
+ modality: "image + text",
553
+ license: "CC-BY-4.0",
554
+ schema: ["image", "ocr_text", "english_description", "question", "answer", "target_language", "page_type"],
555
+ model: "Gemma 4 31B via vLLM",
556
+ group: "paired"
557
+ },
558
+ {
559
+ key: "lowresource",
560
+ title: "Low-Resource Doc Q/A",
561
+ tagline: "Low-resource-language slice of the magazines corpus",
562
+ raw: null,
563
+ adaption: { repo: "Reubencf/Adaption-low-resource-doc-qa", rows: 10200 },
564
+ languages: "Afrikaans, Amharic, Yoruba, Yiddish, Bengali, Santali, Somali, Vietnamese, Maithili, Tigrinya, Meitei, Lao, …",
565
+ modality: "image + text",
566
+ license: "CC-BY-4.0",
567
+ schema: ["image", "ocr_text", "question", "answer", "source_language"],
568
+ group: "paired"
569
+ },
570
+ {
571
+ key: "captions",
572
+ title: "Multilingual Image Captions",
573
+ tagline: "English + multilingual captions with bounding-box visualizations",
574
+ raw: { repo: "Reubencf/multilingual-image-annotations", rows: 464 },
575
+ adaption: { repo: "Reubencf/adaption-multilingual-image-captions", rows: 462 },
576
+ languages: "en, es, fr, hi, zh, ar, pt",
577
+ modality: "image + text",
578
+ license: "CC-BY-4.0",
579
+ schema: ["image", "boxed_image", "description_en", "descriptions", "vqa", "detections"],
580
+ model: "Gemma 4 31B",
581
+ group: "paired"
582
+ },
583
+ {
584
+ key: "frontend",
585
+ title: "Frontend Coding",
586
+ tagline: "Hand-curated HTML / Tailwind / JS prompts and completions",
587
+ raw: { repo: "Reubencf/frontend-coding", rows: 500 },
588
+ adaption: { repo: "Reubencf/frontend-html-tailwind-js", rows: 145 },
589
+ languages: "en",
590
+ modality: "text (code)",
591
+ license: "MIT",
592
+ schema: ["prompt", "previous_code", "code", "reasoning"],
593
+ group: "paired"
594
+ },
595
+ {
596
+ key: "news2026",
597
+ title: "Current Affairs 2026",
598
+ tagline: "2026 Wikipedia current-events Q/A with RAG grounding (through Apr 9, 2026)",
599
+ raw: { repo: "Reubencf/future-news-events-2026", rows: 5447 },
600
+ adaption: { repo: "Reubencf/current-affairs-2026", rows: 5339 },
601
+ languages: "en",
602
+ modality: "text",
603
+ license: "open",
604
+ schema: ["question", "answer", "enhanced_prompt", "enhanced_completion", "reasoning_trace", "date", "event_id", "section", "source"],
605
+ model: "Cohere Command R + RAG",
606
+ group: "paired"
607
+ },
608
+ {
609
+ key: "news2025",
610
+ title: "Current Affairs 2025",
611
+ tagline: "2025 global events Q/A",
612
+ raw: { repo: "Reubencf/2025_events", rows: 5390 },
613
+ adaption: { repo: "Reubencf/current-affairs-2025", rows: 5390 },
614
+ languages: "en",
615
+ modality: "text",
616
+ license: "open",
617
+ schema: ["question", "answer", "enhanced_prompt", "enhanced_completion"],
618
+ group: "paired"
619
+ },
620
+ {
621
+ key: "news2024",
622
+ title: "Current Affairs 2024",
623
+ tagline: "2024 global events Q/A",
624
+ raw: { repo: "Reubencf/2024_events", rows: 5190 },
625
+ adaption: { repo: "Reubencf/current-affairs-2024", rows: 5190 },
626
+ languages: "en",
627
+ modality: "text",
628
+ license: "open",
629
+ schema: ["question", "answer", "enhanced_prompt", "enhanced_completion"],
630
+ group: "paired"
631
+ },
632
+ {
633
+ key: "news2023",
634
+ title: "Current Affairs 2023",
635
+ tagline: "2023 global events Q/A",
636
+ raw: { repo: "Reubencf/2023_events", rows: 4667 },
637
+ adaption: { repo: "Reubencf/current-affairs-2023", rows: 4667 },
638
+ languages: "en",
639
+ modality: "text",
640
+ license: "open",
641
+ schema: ["question", "answer", "enhanced_prompt", "enhanced_completion"],
642
+ group: "paired"
643
+ },
644
+ // Pre-training pools β€” now included in the raw donut too.
645
+ {
646
+ key: "polyaudio",
647
+ title: "PolyglotAudio",
648
+ tagline: "Broad multilingual audio pre-training pool",
649
+ raw: { repo: "Reubencf/PolyglotAudio", rows: 1200000 },
650
+ adaption: null,
651
+ languages: "multilingual",
652
+ modality: "audio + text",
653
+ license: "open",
654
+ schema: ["audio", "text", "language"],
655
+ group: "paired"
656
+ },
657
+ {
658
+ key: "polytext",
659
+ title: "PolyglotText",
660
+ tagline: "Large multilingual text pre-training pool",
661
+ raw: { repo: "Reubencf/PolyglotText", rows: 13400000 },
662
+ adaption: null,
663
+ languages: "multilingual",
664
+ modality: "text",
665
+ license: "open",
666
+ schema: ["text", "language"],
667
+ group: "paired"
668
+ },
669
+ ];
670
+
671
+ // Stable color per dataset key β€” cycles through PALETTE
672
+ const datasetColor = d3.scaleOrdinal(PALETTE).domain(DATASETS.map(d => d.key));
673
+ DATASETS.forEach(d => { d.color = datasetColor(d.key); });
674
+
675
+ // ===========================================================================
676
+ // DONUT CHART (D3 β€” used for both Raw/Adaption donuts and the modality donut)
677
+ // ===========================================================================
678
+ const tooltipEl = document.getElementById('donut-tooltip');
679
+
680
+ // Per-SVG selection state for drill-down (scale-up selected, dim others).
681
+ const donutState = new Map(); // svgId -> { selectedKey, paths, arcGen }
682
+
683
+ function renderDonut({ svgId, centerId, field, datasets, getValue, getKey, getTitle, getColor, getMeta, colorScale, topLabel, bottomLabel, topIcon, bottomIcon, sizing = 'linear' }) {
684
+ const svg = d3.select('#' + svgId);
685
+ svg.selectAll('*').remove();
686
+
687
+ const bbox = svg.node().getBoundingClientRect();
688
+ const size = Math.min(bbox.width, bbox.height);
689
+ const outerR = size / 2 - 6;
690
+ const innerR = outerR * 0.62;
691
+ svg.attr('viewBox', `${-size / 2} ${-size / 2} ${size} ${size}`);
692
+
693
+ const filtered = datasets.filter(d => getValue(d) > 0);
694
+ const total = d3.sum(filtered, getValue);
695
+ const count = filtered.length;
696
+
697
+ // Sizing strategy for the arc:
698
+ // "linear" β€” true proportions (small slices can vanish)
699
+ // "log" β€” power-compressed so tiny datasets stay visible while the
700
+ // big ones (PolyglotText 13M+, PolyglotAudio 1M+) still read
701
+ // as clearly the largest slices
702
+ // "sqrt" β€” lighter square-root compression
703
+ // Tooltip + center numbers always show real values.
704
+ const sizeValue = d => {
705
+ const v = getValue(d);
706
+ if (sizing === 'log') return Math.pow(v + 1, 0.38);
707
+ if (sizing === 'sqrt') return Math.sqrt(v + 1);
708
+ return v;
709
+ };
710
+
711
+ const pie = d3.pie().value(sizeValue).sort(null).padAngle(0.022);
712
+ const arcs = pie(filtered);
713
+ const arcGen = d3.arc().innerRadius(innerR).outerRadius(outerR).cornerRadius(3);
714
+ const resolveColor = getColor || (x => datasetColor(getKey(x)));
715
+
716
+ const g = svg.append('g');
717
+
718
+ // Slice paths
719
+ const paths = g.selectAll('path')
720
+ .data(arcs)
721
+ .join('path')
722
+ .attr('class', 'donut-slice')
723
+ .attr('fill', d => resolveColor(d.data))
724
+ .attr('stroke', '#000000')
725
+ .attr('stroke-width', 2)
726
+ .attr('stroke-linejoin', 'round');
727
+
728
+ // Radial sweep: interpolate endAngle from startAngle β†’ target, so arcs
729
+ // literally grow around the ring from 0Β° of arc to their full sweep.
730
+ paths.each(function (d) {
731
+ const [cx, cy] = arcGen.centroid(d);
732
+ this._centroid = [cx, cy];
733
+ this._current = { startAngle: d.startAngle, endAngle: d.startAngle, padAngle: d.padAngle };
734
+ });
735
+ paths.transition()
736
+ .delay((d, i) => i * 80)
737
+ .duration(1100)
738
+ .ease(d3.easeCubicOut)
739
+ .attrTween('d', function (d) {
740
+ const interp = d3.interpolate(this._current, d);
741
+ this._current = interp(1);
742
+ return t => arcGen(interp(t));
743
+ });
744
+
745
+
746
+
747
+ // Hover tooltip + click drill-down
748
+ paths
749
+ .on('mouseenter', function (ev, d) {
750
+ const nm = getTitle(d.data);
751
+ const v = getValue(d.data);
752
+ const meta = getMeta ? getMeta(d.data) : '';
753
+ tooltipEl.innerHTML =
754
+ `<div class="t-name">${nm}</div>` +
755
+ `<div class="t-meta">${v.toLocaleString()} rows` +
756
+ (meta ? ` Β· ${meta}` : '') + `</div>`;
757
+ gsap.to(tooltipEl, { opacity: 1, duration: 0.15, overwrite: true });
758
+ })
759
+ .on('mousemove', function (ev) {
760
+ tooltipEl.style.left = (ev.clientX + 14) + 'px';
761
+ tooltipEl.style.top = (ev.clientY + 14) + 'px';
762
+ })
763
+ .on('mouseleave', function () {
764
+ gsap.to(tooltipEl, { opacity: 0, duration: 0.12, overwrite: true });
765
+ })
766
+ .on('click', function (ev, d) {
767
+ const key = getKey(d.data);
768
+ focusDonutSlice(svgId, this, key);
769
+ if (typeof showDetails === 'function') showDetails(key);
770
+ });
771
+
772
+ // Cache for drill-down reset logic.
773
+ donutState.set(svgId, { paths, arcGen, resolveColor });
774
+
775
+ // Center content β€” start at 0 and count up with GSAP.
776
+ if (centerId) {
777
+ const centerEl = document.getElementById(centerId);
778
+ const topIconHtml = topIcon ? `<span class="icon">${topIcon}</span>` : '';
779
+ const bottomIconHtml = bottomIcon ? `<span class="icon">${bottomIcon}</span>` : '';
780
+ centerEl.innerHTML =
781
+ `<div class="center-item top">
782
+ <div class="center-label">${topIconHtml}${topLabel}</div>
783
+ <div class="center-number js-count-top">0</div>
784
+ </div>
785
+ <div class="center-item bottom">
786
+ <div class="center-number js-count-bottom">0</div>
787
+ <div class="center-label">${bottomLabel}${bottomIconHtml}</div>
788
+ </div>`;
789
+
790
+ const topEl = centerEl.querySelector('.js-count-top');
791
+ const bottomEl = centerEl.querySelector('.js-count-bottom');
792
+
793
+ const topObj = { v: 0 };
794
+ gsap.to(topObj, {
795
+ v: count,
796
+ duration: 1.0,
797
+ ease: 'power2.out',
798
+ delay: 0.55,
799
+ onUpdate: () => { topEl.textContent = Math.floor(topObj.v); },
800
+ onComplete: () => { topEl.textContent = count; }
801
+ });
802
+
803
+ const bottomObj = { v: 0 };
804
+ gsap.to(bottomObj, {
805
+ v: total,
806
+ duration: 1.6,
807
+ ease: 'power2.out',
808
+ delay: 0.65,
809
+ onUpdate: () => { bottomEl.innerHTML = formatRows(Math.floor(bottomObj.v)); },
810
+ onComplete: () => { bottomEl.innerHTML = formatRows(total); }
811
+ });
812
+
813
+ gsap.from(`#${centerId} .center-label`, { y: 14, opacity: 0, duration: 0.55, ease: 'power3.out', delay: 0.45, stagger: 0.15 });
814
+ gsap.from(`#${centerId} .center-number`, { y: 10, opacity: 0, duration: 0.55, ease: 'power3.out', delay: 0.55, stagger: 0.15 });
815
+
816
+ // Space out the top/bottom blocks since the divider is gone.
817
+ centerEl.querySelector('.center-item.bottom').style.marginTop = '14px';
818
+ }
819
+
820
+ return paths;
821
+ }
822
+
823
+ /** Click drill-down: scale up clicked slice, dim the rest, toggle on re-click. */
824
+ function focusDonutSlice(svgId, clickedEl, clickedKey) {
825
+ const state = donutState.get(svgId);
826
+ if (!state) return;
827
+ const { paths, arcGen } = state;
828
+
829
+ // Toggle off if clicking the already-selected slice
830
+ if (state.selectedKey === clickedKey) {
831
+ resetDonutFocus(svgId);
832
+ return;
833
+ }
834
+ state.selectedKey = clickedKey;
835
+
836
+ paths.nodes().forEach((node, i) => {
837
+ const d = paths.data()[i];
838
+ const isSelected = node === clickedEl;
839
+ if (isSelected) {
840
+ const [cx, cy] = node._centroid || arcGen.centroid(d);
841
+ gsap.to(node, {
842
+ scale: 1.08,
843
+ opacity: 1,
844
+ svgOrigin: `${cx} ${cy}`,
845
+ filter: 'drop-shadow(0 0 14px rgba(255,255,255,0.35)) brightness(1.15)',
846
+ duration: 0.45,
847
+ ease: 'power2.out',
848
+ overwrite: 'auto'
849
+ });
850
+ } else {
851
+ gsap.to(node, {
852
+ scale: 1,
853
+ opacity: 0.3,
854
+ filter: 'none',
855
+ duration: 0.35,
856
+ ease: 'power2.out',
857
+ overwrite: 'auto'
858
+ });
859
+ }
860
+ });
861
+ }
862
+
863
+ function resetDonutFocus(svgId) {
864
+ const state = donutState.get(svgId);
865
+ if (!state) return;
866
+ state.selectedKey = null;
867
+ state.paths.nodes().forEach(node => {
868
+ gsap.to(node, {
869
+ scale: 1, opacity: 1, filter: 'none',
870
+ duration: 0.35, ease: 'power2.out', overwrite: 'auto'
871
+ });
872
+ });
873
+ }
874
+
875
+ // ---- Raw vs Adaption donuts ----
876
+ // The Raw donut shows every dataset that has a raw repo (including the
877
+ // PolyglotText / PolyglotAudio pre-training pools). The Adaption donut shows
878
+ // every dataset with an Adaption-remastered version. renderDonut() filters
879
+ // out zero-value entries automatically.
880
+ renderDonut({
881
+ svgId: 'chart-raw',
882
+ centerId: 'center-raw',
883
+ field: 'raw',
884
+ datasets: DATASETS,
885
+ getValue: d => (d.raw && d.raw.rows) || 0,
886
+ getKey: d => d.key,
887
+ getTitle: d => d.title,
888
+ getMeta: d => d.raw ? d.raw.repo : '',
889
+ topLabel: 'RAW DATASETS',
890
+ bottomLabel: 'ROWS',
891
+ topIcon: '',
892
+ bottomIcon: '',
893
+ sizing: 'log', // compress so tiny datasets still get a visible slice
894
+ });
895
+
896
+ renderDonut({
897
+ svgId: 'chart-adaption',
898
+ centerId: 'center-adaption',
899
+ field: 'adaption',
900
+ datasets: DATASETS,
901
+ getValue: d => (d.adaption && d.adaption.rows) || 0,
902
+ getKey: d => d.key,
903
+ getTitle: d => d.title,
904
+ getMeta: d => d.adaption ? d.adaption.repo : '',
905
+ topLabel: 'ADAPTION SETS',
906
+ bottomLabel: 'ROWS',
907
+ topIcon: '',
908
+ bottomIcon: '',
909
+ });
910
+
911
+ // ---- Modality donut ----
912
+ const MODALITIES = [
913
+ { key: 'text', name: 'Text', count: 5 },
914
+ { key: 'audio', name: 'Audio', count: 3 },
915
+ { key: 'image', name: 'Image', count: 3 },
916
+ { key: 'code', name: 'Code', count: 1 },
917
+ ];
918
+ const modalityColor = d3.scaleOrdinal(PALETTE).domain(MODALITIES.map(m => m.key));
919
+
920
+ renderDonut({
921
+ svgId: 'chart-modality',
922
+ centerId: 'center-modality',
923
+ field: 'count',
924
+ datasets: MODALITIES,
925
+ getValue: d => d.count,
926
+ getKey: d => d.key,
927
+ getTitle: d => d.name,
928
+ getColor: d => modalityColor(d.key),
929
+ getMeta: d => `${d.count} datasets`,
930
+ topLabel: 'MODALITIES',
931
+ bottomLabel: 'DATASETS',
932
+ topIcon: '',
933
+ bottomIcon: '',
934
+ });
935
+
936
+ // ===========================================================================
937
+ // DETAILS CARD β€” rendered on slice click with GSAP reveal
938
+ // ===========================================================================
939
+ function hideLanguageDetails() {
940
+ // No-op placeholder β€” currently we share the single details card; click-
941
+ // another to switch. Kept as an explicit symbol for future extension.
942
+ }
943
+
944
+ function showLanguageDetails(langData, color) {
945
+ const card = document.getElementById('details-card');
946
+ card.style.display = '';
947
+
948
+ // Per-dataset breakdown for this language.
949
+ const breakdown = DATASET_LANGS
950
+ .map(d => ({ dataset: d.name, key: d.key, rows: d.langs[langData.code] || 0 }))
951
+ .filter(d => d.rows > 0)
952
+ .sort((a, b) => b.rows - a.rows);
953
+
954
+ const rows = breakdown.map(b =>
955
+ `<div class="kv"><div class="k">${b.dataset}</div><div class="v"><strong>${formatShort(b.rows)}</strong>
956
+ <span style="color:var(--muted);font-size:0.85em">(${b.rows.toLocaleString()})</span></div></div>`
957
+ ).join('');
958
+
959
+ card.innerHTML = `
960
+ <h3>
961
+ <span class="swatch" style="background:${color}"></span>
962
+ ${langData.name} <span style="color:var(--muted);font-weight:400;font-size:0.9rem">(${langData.code})</span>
963
+ </h3>
964
+ <p class="tagline">Total across the raw corpus: <strong>${langData.value.toLocaleString()}</strong> rows.</p>
965
+ <div class="kv-grid">${rows}</div>
966
+ `;
967
+
968
+ gsap.fromTo(card,
969
+ { y: 80, opacity: 0 },
970
+ { y: 0, opacity: 1, duration: 0.75, ease: 'power4.out' }
971
+ );
972
+ gsap.from(card.querySelectorAll('.kv'), {
973
+ y: 18, opacity: 0, duration: 0.45, ease: 'power3.out',
974
+ stagger: 0.05, delay: 0.2
975
+ });
976
+
977
+ card.scrollIntoView({ behavior: 'smooth', block: 'nearest' });
978
+ }
979
+
980
+ function showDetails(key) {
981
+ const d = DATASETS.find(x => x.key === key);
982
+ if (!d) return;
983
+ const card = document.getElementById('details-card');
984
+ card.style.display = '';
985
+
986
+ const repoLink = info => info
987
+ ? `<a href="https://huggingface.co/datasets/${info.repo}" target="_blank">${info.repo}</a>`
988
+ : `<span style="color:var(--muted);">β€”</span>`;
989
+ const rowsCell = info => info
990
+ ? `<strong>${formatShort(info.rows)}</strong> <span style="color:var(--muted);font-size:0.85em">(${info.rows.toLocaleString()})</span>`
991
+ : `<span style="color:var(--muted);">β€”</span>`;
992
+
993
+ card.innerHTML = `
994
+ <h3>
995
+ <span class="swatch" style="background:${d.color}"></span>
996
+ ${d.title}
997
+ </h3>
998
+ <p class="tagline">${d.tagline}</p>
999
+ <div class="kv-grid">
1000
+ <div class="kv"><div class="k">Raw repo</div><div class="v">${repoLink(d.raw)}</div></div>
1001
+ <div class="kv"><div class="k">Raw rows</div><div class="v">${rowsCell(d.raw)}</div></div>
1002
+ <div class="kv"><div class="k">Adaption repo</div><div class="v">${repoLink(d.adaption)}</div></div>
1003
+ <div class="kv"><div class="k">Adaption rows</div><div class="v">${rowsCell(d.adaption)}</div></div>
1004
+ <div class="kv"><div class="k">Modality</div><div class="v">${d.modality}</div></div>
1005
+ <div class="kv"><div class="k">License</div><div class="v">${d.license}</div></div>
1006
+ ${d.model ? `<div class="kv"><div class="k">Annotator</div><div class="v">${d.model}</div></div>` : ''}
1007
+ <div class="kv"><div class="k">Languages</div><div class="v">${d.languages}</div></div>
1008
+ </div>
1009
+ <div class="kv" style="margin-top:18px;">
1010
+ <div class="k">Schema</div>
1011
+ <div class="schema-list">${d.schema.map(c => `<code>${c}</code>`).join('')}</div>
1012
+ </div>
1013
+ `;
1014
+
1015
+ // Elegant slide-in from the bottom with power4.out.
1016
+ gsap.fromTo(card,
1017
+ { y: 80, opacity: 0 },
1018
+ { y: 0, opacity: 1, duration: 0.75, ease: 'power4.out' }
1019
+ );
1020
+ gsap.from(card.querySelectorAll('.kv'), {
1021
+ y: 18, opacity: 0, duration: 0.45, ease: 'power3.out',
1022
+ stagger: 0.05, delay: 0.2
1023
+ });
1024
+
1025
+ card.scrollIntoView({ behavior: 'smooth', block: 'nearest' });
1026
+ }
1027
+
1028
+ // ===========================================================================
1029
+ // INITIAL PAGE LOAD β€” hero image β†’ hero stats β†’ chart cards, staggered.
1030
+ // ===========================================================================
1031
+ const loadTl = gsap.timeline();
1032
+ loadTl
1033
+ .from('.hero-img', { y: 20, opacity: 0, duration: 0.9, ease: 'power3.out' })
1034
+ .from('.stat', { y: 20, opacity: 0, duration: 0.7, ease: 'power3.out', stagger: 0.15 }, '-=0.5')
1035
+ .from('.chart-card', { y: 20, opacity: 0, duration: 0.8, ease: 'power3.out', stagger: 0.15 }, '-=0.3');
1036
+
1037
+ // ===========================================================================
1038
+ // LANGUAGE DATA (for the Voronoi treemap)
1039
+ // ===========================================================================
1040
+ const LANG_NAMES = {
1041
+ tr: "Turkish", ru: "Russian", it: "Italian", en: "English", eo: "Esperanto",
1042
+ hu: "Hungarian", de: "German", fr: "French", pt: "Portuguese", mk: "Macedonian",
1043
+ es: "Spanish", he: "Hebrew", fi: "Finnish", ber: "Berber", nl: "Dutch",
1044
+ pl: "Polish", sr: "Serbian", mr: "Marathi", el: "Greek", da: "Danish",
1045
+ cs: "Czech", sv: "Swedish", bg: "Bulgarian", la: "Latin", zh: "Mandarin",
1046
+ ro: "Romanian", ia: "Interlingua", ja: "Japanese", tok: "Toki Pona",
1047
+ lfn: "Lingua Franca Nova", uk: "Ukrainian", tt: "Tatar", tl: "Tagalog",
1048
+ id: "Indonesian", nb: "Norwegian B.", lt: "Lithuanian", az: "Azerbaijani",
1049
+ ie: "Interlingue", tlh: "Klingon", jbo: "Lojban", mhr: "Meadow Mari",
1050
+ bn: "Bengali", fa: "Persian", br: "Breton", ilo: "Ilocano", ar: "Arabic",
1051
+ ceb: "Cebuano", hi: "Hindi", vi: "Vietnamese", pam: "Kapampangan",
1052
+ hy: "Armenian", be: "Belarusian", ko: "Korean", yue: "Cantonese",
1053
+ ca: "Catalan", kab: "Kabyle", af: "Afrikaans", am: "Amharic", yi: "Yiddish",
1054
+ sat: "Santali", so: "Somali", te: "Telugu", ne: "Nepali", pa: "Punjabi",
1055
+ ur: "Urdu", ta: "Tamil", ml: "Malayalam", th: "Thai", or: "Odia",
1056
+ sd: "Sindhi", gu: "Gujarati", kn: "Kannada", my: "Burmese", bo: "Tibetan",
1057
+ lo: "Lao", mni: "Meitei", kk: "Kazakh", oc: "Occitan", hr: "Croatian",
1058
+ sk: "Slovak", et: "Estonian", sl: "Slovenian", is: "Icelandic", ms: "Malay",
1059
+ sq: "Albanian", hsb: "Upper Sorbian", dsb: "Lower Sorbian", mai: "Maithili",
1060
+ kha: "Khasi", dtp: "Kadazan", yo: "Yoruba", sw: "Swahili", cy: "Welsh",
1061
+ ga: "Irish", gd: "Scottish Gaelic", ti: "Tigrinya", os: "Ossetian",
1062
+ sa: "Sanskrit", ug: "Uyghur", uz: "Uzbek", ka: "Georgian", eu: "Basque",
1063
+ vo: "VolapΓΌk", ido: "Ido", nov: "Novial", avk: "Kotava", ldn: "LΓ‘adan",
1064
+ afh: "Afrihili", lzh: "Classical Chinese", non: "Old Norse", ang: "Old English",
1065
+ grc: "Ancient Greek", sux: "Sumerian", fro: "Old French", cbk: "Chavacano",
1066
+ zsm: "Standard Malay", war: "Waray", kw: "Cornish", nah: "Nahuatl",
1067
+ kek: "Q'eqchi'", hif: "Fiji Hindi", crh: "Crimean Tatar", sah: "Sakha",
1068
+ ext: "Extremaduran", csb: "Kashubian", sgs: "Samogitian", cha: "Chamorro",
1069
+ tvl: "Tuvaluan", mi: "Maori", lin: "Lingala", arq: "Algerian Arabic",
1070
+ arz: "Egyptian Arabic", orv: "Old East Slavic", prg: "Old Prussian",
1071
+ chv: "Chuvash", bar: "Bavarian", pms: "Piedmontese", egl: "Emilian",
1072
+ jav: "Javanese", sun: "Sundanese", hoc: "Ho", zza: "Zaza",
1073
+ rif: "Riffian Berber", nog: "Nogai", km: "Khmer",
1074
+ };
1075
+
1076
+ const DATASET_LANGS = [
1077
+ {
1078
+ key: "polytext", name: "PolyglotText",
1079
+ langs: {
1080
+ tr: 1767000, ru: 1695000, it: 1588000, en: 1337000, eo: 1171000,
1081
+ hu: 817000, de: 675000, fr: 520000, pt: 470000, mk: 398000,
1082
+ es: 358000, he: 272000, fi: 263000, ber: 180000, nl: 125000,
1083
+ pl: 118000, sr: 106000, mr: 96000, el: 94000, da: 90000,
1084
+ cs: 72000, sv: 71000, bg: 70000, la: 66000, zh: 58000, ro: 56000,
1085
+ ia: 54000, ja: 43000, tok: 39000, lfn: 38000, uk: 38000, tt: 33000,
1086
+ tl: 31000, id: 31000, nb: 31000, lt: 29000, az: 25000, ie: 24000,
1087
+ tlh: 23000, jbo: 21000, mhr: 19000, bn: 19000, fa: 17000, br: 17000,
1088
+ ilo: 17000, ar: 16000, ceb: 15000, hi: 13000, vi: 11000, pam: 11000,
1089
+ hy: 9000, be: 9000, ko: 9000,
1090
+ cbk: 19000, sk: 8000, vo: 8000, oc: 8000, et: 8000,
1091
+ war: 6700, ms: 6700, hr: 6700, eu: 6700, yi: 5400, af: 5400,
1092
+ km: 4000, ca: 4000, kha: 4000, dtp: 4000, zza: 4000, is: 4000,
1093
+ avk: 4000, ga: 4000, hoc: 4000, sl: 4000, sq: 4000, chv: 4000,
1094
+ kw: 4000, sux: 2700, ang: 2700, pms: 2700, prg: 2700, ug: 2700,
1095
+ lzh: 2700, egl: 2700, ur: 2700, sah: 2700, nds: 2700, mi: 2700,
1096
+ tvl: 1400, cha: 1400, th: 1400, cy: 1400, non: 1400, yo: 1400,
1097
+ lin: 1400, grc: 1400, arq: 1400, orv: 1400, sw: 1400, rif: 1400,
1098
+ crh: 1400, hif: 1400, jav: 1400, sun: 1400, hsb: 1400, dsb: 1400,
1099
+ amh: 1400, csb: 1400, sgs: 1400, ext: 1400, nov: 1400, nog: 1400,
1100
+ arz: 1400, nah: 1400, ido: 1400, afh: 1400, kk: 1400,
1101
+ }
1102
+ },
1103
+ {
1104
+ key: "polyaudio", name: "PolyglotAudio",
1105
+ langs: {
1106
+ en: 698000, es: 261000, eo: 105000, de: 32000, fr: 16000,
1107
+ ru: 9200, pl: 8800, ber: 6600, nl: 5900, it: 5900,
1108
+ yue: 4300, pt: 3300, ja: 1400, mr: 1200, ca: 505,
1109
+ cs: 410, zh: 110, fi: 93, hu: 87, uk: 38,
1110
+ he: 16, tok: 5, kab: 5,
1111
+ }
1112
+ },
1113
+ {
1114
+ key: "tts", name: "multilingual-synthetic-tts",
1115
+ langs: {
1116
+ ja: 13951, ru: 9105, de: 8972, ko: 8129, es: 7917,
1117
+ pt: 5438, zh: 5417, en: 5157, fr: 4551,
1118
+ }
1119
+ },
1120
+ {
1121
+ key: "magazines", name: "magazines-multilingual-vqa",
1122
+ langs: {
1123
+ de: 4412, fr: 3279, ru: 2762, pt: 2047, vi: 1637, bn: 1598,
1124
+ en: 1004, af: 826, ar: 156, it: 136, fa: 132, te: 132,
1125
+ ja: 130, ne: 108, pa: 100, ur: 98, nl: 95, tr: 85, zh: 83,
1126
+ ta: 68, ml: 64, id: 43, th: 47, am: 123, yi: 108, sat: 85,
1127
+ so: 25, hi: 15, es: 10, mr: 9, kn: 8, or: 17, sd: 3,
1128
+ mai: 1021, la: 146, bo: 25, be: 8, da: 6, ko: 4, bg: 4,
1129
+ os: 3, sa: 3, my: 3, oc: 1, gd: 1, ti: 1, hy: 1, pl: 1,
1130
+ mni: 1, uk: 1, lo: 1, kk: 1,
1131
+ }
1132
+ },
1133
+ { key: "fma", name: "fma-labeled", langs: { en: 29000 } },
1134
+ { key: "streetview", name: "streetview-global", langs: { en: 30000 } },
1135
+ { key: "current_affairs", name: "current-affairs (raw, 2023-26)", langs: { en: 20694 } },
1136
+ { key: "frontend", name: "frontend-coding", langs: { en: 500 } },
1137
+ {
1138
+ key: "image_ann", name: "multilingual-image-annotations",
1139
+ langs: { en: 464, es: 464, fr: 464, hi: 464, zh: 464, ar: 464, pt: 464 }
1140
+ },
1141
+ ];
1142
+
1143
+ // Aggregate totals across the raw corpus
1144
+ const langTotals = {};
1145
+ for (const d of DATASET_LANGS) {
1146
+ for (const [lang, n] of Object.entries(d.langs)) {
1147
+ langTotals[lang] = (langTotals[lang] || 0) + n;
1148
+ }
1149
+ }
1150
+ const langEntries = Object.entries(langTotals)
1151
+ .sort((a, b) => b[1] - a[1])
1152
+ .map(([code, n]) => ({
1153
+ code,
1154
+ name: LANG_NAMES[code] || code,
1155
+ value: n,
1156
+ sizeValue: Math.log10(n + 10),
1157
+ }));
1158
+
1159
+ // ===========================================================================
1160
+ // VORONOI TREEMAP (D3) β€” palette quantile coloring, black strokes
1161
+ // ===========================================================================
1162
+ (function renderVoronoi() {
1163
+ const container = document.getElementById('chart-treemap');
1164
+ const tooltip = document.getElementById('voronoi-tooltip');
1165
+
1166
+ if (typeof d3 === 'undefined' || !d3.voronoiTreemap) {
1167
+ container.insertAdjacentHTML('beforeend',
1168
+ `<div style="color:#ff8a8a;padding:24px;text-align:center;font-size:0.9rem">
1169
+ Voronoi treemap libraries failed to load. Check your network / CSP.
1170
+ </div>`);
1171
+ return;
1172
+ }
1173
+
1174
+ // Map language size buckets onto PALETTE via quantiles, so cells naturally
1175
+ // cluster by tonal groups.
1176
+ const colorScale = d3.scaleQuantile()
1177
+ .domain(langEntries.map(e => e.sizeValue))
1178
+ .range(PALETTE);
1179
+
1180
+ function draw() {
1181
+ container.querySelectorAll('svg').forEach(s => s.remove());
1182
+
1183
+ const rect = container.getBoundingClientRect();
1184
+ const width = Math.max(320, rect.width);
1185
+ const height = Math.max(320, rect.height);
1186
+
1187
+ // Circular clip polygon
1188
+ const clipPad = 6;
1189
+ const cx = width / 2, cy = height / 2;
1190
+ const r = Math.min(width, height) / 2 - clipPad;
1191
+ const N = 96;
1192
+ const clipPolygon = d3.range(N).map(i => [
1193
+ cx + r * Math.cos((i / N) * 2 * Math.PI),
1194
+ cy + r * Math.sin((i / N) * 2 * Math.PI),
1195
+ ]);
1196
+
1197
+ const root = d3.hierarchy({ name: 'root', children: langEntries })
1198
+ .sum(d => d.sizeValue);
1199
+
1200
+ const treemap = d3.voronoiTreemap()
1201
+ .clip(clipPolygon)
1202
+ .convergenceRatio(0.005)
1203
+ .maxIterationCount(120)
1204
+ .minWeightRatio(0.01);
1205
+ treemap(root);
1206
+
1207
+ const svg = d3.select(container).append('svg')
1208
+ .attr('viewBox', `0 0 ${width} ${height}`)
1209
+ .attr('preserveAspectRatio', 'xMidYMid meet');
1210
+ const g = svg.append('g');
1211
+
1212
+ // Outer circle outline
1213
+ g.append('circle')
1214
+ .attr('cx', cx).attr('cy', cy).attr('r', r + 1)
1215
+ .attr('fill', 'none')
1216
+ .attr('stroke', '#1f1f1f')
1217
+ .attr('stroke-width', 1);
1218
+
1219
+ const leaves = root.leaves();
1220
+
1221
+ const cells = g.selectAll('path.voronoi-cell')
1222
+ .data(leaves)
1223
+ .join('path')
1224
+ .attr('class', 'voronoi-cell')
1225
+ .attr('d', d => 'M' + d.polygon.map(p => p.join(',')).join('L') + 'Z')
1226
+ .attr('fill', d => colorScale(d.data.sizeValue))
1227
+ .attr('stroke', '#000000')
1228
+ .attr('stroke-width', 1.2)
1229
+ .attr('stroke-linejoin', 'round');
1230
+
1231
+ // Record centroid per cell so we can scale from the cell's own center.
1232
+ cells.each(function (leaf) {
1233
+ const [cx, cy] = leaf.site || d3.polygonCentroid(leaf.polygon);
1234
+ this._centroid = [cx, cy];
1235
+ });
1236
+
1237
+ // Mosaic build: fade + scale up from each cell's own centroid, staggered.
1238
+ cells.nodes().forEach((node, i) => {
1239
+ const [cx, cy] = node._centroid;
1240
+ gsap.fromTo(node,
1241
+ { scale: 0, opacity: 0, svgOrigin: `${cx} ${cy}` },
1242
+ { scale: 1, opacity: 1, svgOrigin: `${cx} ${cy}`,
1243
+ duration: 0.55, ease: 'power3.out', delay: i * 0.012 }
1244
+ );
1245
+ });
1246
+
1247
+ // Labels β€” sized by cell area; smaller cells hide the name, tiny ones only show on hover.
1248
+ function polyArea(pts) {
1249
+ let a = 0;
1250
+ for (let i = 0, n = pts.length; i < n; i++) {
1251
+ const [x1, y1] = pts[i], [x2, y2] = pts[(i + 1) % n];
1252
+ a += x1 * y2 - x2 * y1;
1253
+ }
1254
+ return Math.abs(a) / 2;
1255
+ }
1256
+
1257
+ leaves.forEach(leaf => {
1258
+ const area = polyArea(leaf.polygon);
1259
+ const side = Math.sqrt(area);
1260
+ const [x, y] = leaf.site || d3.polygonCentroid(leaf.polygon);
1261
+ const d = leaf.data;
1262
+
1263
+ if (side >= 44) {
1264
+ const nameSize = Math.max(10, Math.min(18, side / 6));
1265
+ const codeSize = Math.max(9, Math.min(13, side / 9));
1266
+ const text = g.append('text')
1267
+ .datum(leaf.data)
1268
+ .attr('class', 'voronoi-label')
1269
+ .attr('x', x).attr('y', y - 2)
1270
+ .attr('font-size', nameSize);
1271
+ text.append('tspan').text(d.name);
1272
+ text.append('tspan')
1273
+ .attr('class', 'code')
1274
+ .attr('x', x).attr('dy', nameSize * 0.95)
1275
+ .attr('font-size', codeSize)
1276
+ .text(`${d.code} Β· ${formatShort(d.value)}`);
1277
+ } else if (side >= 22) {
1278
+ const sz = Math.max(8, Math.min(11, side / 3));
1279
+ g.append('text')
1280
+ .datum(leaf.data)
1281
+ .attr('class', 'voronoi-label')
1282
+ .attr('x', x).attr('y', y + sz / 3)
1283
+ .attr('font-size', sz)
1284
+ .text(d.code);
1285
+ }
1286
+ });
1287
+
1288
+ // Hover tooltip
1289
+ cells
1290
+ .on('mouseenter', (ev, d) => {
1291
+ tooltip.innerHTML =
1292
+ `<span class="t-name">${d.data.name}</span>` +
1293
+ `<span class="t-code">(${d.data.code})</span>` +
1294
+ `<div class="t-rows">${d.data.value.toLocaleString()} rows</div>`;
1295
+ gsap.to(tooltip, { opacity: 1, duration: 0.12, overwrite: true });
1296
+ })
1297
+ .on('mousemove', (ev) => {
1298
+ const bb = container.getBoundingClientRect();
1299
+ tooltip.style.left = (ev.clientX - bb.left + 12) + 'px';
1300
+ tooltip.style.top = (ev.clientY - bb.top + 12) + 'px';
1301
+ })
1302
+ .on('mouseleave', () => {
1303
+ gsap.to(tooltip, { opacity: 0, duration: 0.1, overwrite: true });
1304
+ });
1305
+
1306
+ // Click drill-down: highlight the clicked cell, dim the rest, surface a
1307
+ // language detail card below the treemap.
1308
+ let selectedCell = null;
1309
+ cells.on('click', function (ev, d) {
1310
+ const [cx, cy] = this._centroid;
1311
+ const sameAgain = selectedCell === this;
1312
+ if (sameAgain) {
1313
+ // reset
1314
+ selectedCell = null;
1315
+ cells.nodes().forEach(node => {
1316
+ const [ncx, ncy] = node._centroid;
1317
+ gsap.to(node, {
1318
+ scale: 1, opacity: 1,
1319
+ svgOrigin: `${ncx} ${ncy}`,
1320
+ filter: 'none',
1321
+ duration: 0.35, ease: 'power2.out', overwrite: 'auto',
1322
+ });
1323
+ });
1324
+ g.selectAll('.voronoi-label').each(function() {
1325
+ gsap.to(this, { opacity: 1, duration: 0.35, overwrite: 'auto' });
1326
+ });
1327
+ hideLanguageDetails();
1328
+ return;
1329
+ }
1330
+ selectedCell = this;
1331
+ cells.nodes().forEach(node => {
1332
+ const [ncx, ncy] = node._centroid;
1333
+ if (node === this) {
1334
+ gsap.to(node, {
1335
+ scale: 1.1, opacity: 1,
1336
+ svgOrigin: `${ncx} ${ncy}`,
1337
+ filter: 'drop-shadow(0 0 10px rgba(255,255,255,0.45)) brightness(1.35)',
1338
+ duration: 0.45, ease: 'power2.out', overwrite: 'auto',
1339
+ });
1340
+ } else {
1341
+ gsap.to(node, {
1342
+ scale: 1, opacity: 0,
1343
+ svgOrigin: `${ncx} ${ncy}`,
1344
+ filter: 'none',
1345
+ duration: 0.35, ease: 'power2.out', overwrite: 'auto',
1346
+ });
1347
+ }
1348
+ });
1349
+ g.selectAll('.voronoi-label').each(function(ld) {
1350
+ if (ld && ld.code === d.data.code) {
1351
+ gsap.to(this, { opacity: 1, duration: 0.45, overwrite: 'auto' });
1352
+ } else {
1353
+ gsap.to(this, { opacity: 0, duration: 0.35, overwrite: 'auto' });
1354
+ }
1355
+ });
1356
+ showLanguageDetails(d.data, colorScale(d.data.sizeValue));
1357
+
1358
+ // Also emit the custom event so external code can react.
1359
+ container.dispatchEvent(new CustomEvent('voronoi-drilldown', {
1360
+ detail: { code: d.data.code, name: d.data.name, rows: d.data.value }
1361
+ }));
1362
+ });
1363
+ }
1364
+
1365
+ draw();
1366
+
1367
+ let t;
1368
+ window.addEventListener('resize', () => {
1369
+ clearTimeout(t);
1370
+ t = setTimeout(draw, 200);
1371
+ });
1372
+ })();
1373
+ </script>
1374
+
1375
+ </body>
1376
+ </html>