Spaces:
Running
Running
Update research.html
Browse files- research.html +2 -2
research.html
CHANGED
|
@@ -253,7 +253,7 @@
|
|
| 253 |
|
| 254 |
<ul>
|
| 255 |
<li><strong>The Setup:</strong> We are training an ultra-lean <strong>5M parameter Llama model</strong> using Hugging Face Transformers.</li>
|
| 256 |
-
<li><strong>The Data:</strong> Exactly <strong>
|
| 257 |
<br>1. 100% <code>FineWeb-Edu</code>
|
| 258 |
<br>2. 100% <code>DCLM-Edu</code>
|
| 259 |
<br>3. 100% <code>Cosmopedia-v2</code>
|
|
@@ -284,7 +284,7 @@
|
|
| 284 |
<p>The standard convention for LLMs is "one epoch and move on" to avoid overfitting, popularized by several landmark papers. But small models training on high-quality educational data might be a completely different beast. Can they chew on the same high-signal data multiple times?</p>
|
| 285 |
|
| 286 |
<ul>
|
| 287 |
-
<li><strong>The Setup:</strong> A <strong>10M parameter Llama model</strong> trained on exactly <strong>
|
| 288 |
<li><strong>The Epoch Matrix:</strong> We are running 5 identical setups, changing only the epoch count: <strong>1 Epoch vs. 2, 3, 4, and 5 Epochs</strong>.</li>
|
| 289 |
</ul>
|
| 290 |
<p><strong>The Goal:</strong> Pinpoint exactly where overfitting begins for an SLM. If performance on <code>lm-eval</code> keeps scaling up past epoch 2 or 3 without destroying perplexity, it could mean data-scarcity solutions for edge AI are much easier than we think.</p>
|
|
|
|
| 253 |
|
| 254 |
<ul>
|
| 255 |
<li><strong>The Setup:</strong> We are training an ultra-lean <strong>5M parameter Llama model</strong> using Hugging Face Transformers.</li>
|
| 256 |
+
<li><strong>The Data:</strong> Exactly <strong>500 Million tokens</strong> total per run, testing four configurations:
|
| 257 |
<br>1. 100% <code>FineWeb-Edu</code>
|
| 258 |
<br>2. 100% <code>DCLM-Edu</code>
|
| 259 |
<br>3. 100% <code>Cosmopedia-v2</code>
|
|
|
|
| 284 |
<p>The standard convention for LLMs is "one epoch and move on" to avoid overfitting, popularized by several landmark papers. But small models training on high-quality educational data might be a completely different beast. Can they chew on the same high-signal data multiple times?</p>
|
| 285 |
|
| 286 |
<ul>
|
| 287 |
+
<li><strong>The Setup:</strong> A <strong>10M parameter Llama model</strong> trained on exactly <strong>500 Million tokens</strong> of <code>FineWeb-Edu</code>.</li>
|
| 288 |
<li><strong>The Epoch Matrix:</strong> We are running 5 identical setups, changing only the epoch count: <strong>1 Epoch vs. 2, 3, 4, and 5 Epochs</strong>.</li>
|
| 289 |
</ul>
|
| 290 |
<p><strong>The Goal:</strong> Pinpoint exactly where overfitting begins for an SLM. If performance on <code>lm-eval</code> keeps scaling up past epoch 2 or 3 without destroying perplexity, it could mean data-scarcity solutions for edge AI are much easier than we think.</p>
|