| --- |
| license: apache-2.0 |
| language: |
| - pt |
| pipeline_tag: text-generation |
| parameters: 10k |
| --- |
| |
| # MiniText-v1.0 |
|
|
| ## Model Summary |
|
|
| MiniText-v1.0 is a tiny **character-level language model** trained from scratch |
| to learn basic Portuguese text patterns. |
|
|
| The goal of this project is to explore the **minimum viable neural architecture** |
| capable of producing structured natural language, without pretraining, |
| instruction tuning, or external corpora. |
|
|
| This model is intended for **research, education, and experimentation**. |
|
|
| --- |
|
|
| ## Model Details |
|
|
| - **Architecture:** Custom MiniText (character-level) |
| - **Training Objective:** Next-character prediction |
| - **Vocabulary:** Byte-level (0–255) |
| - **Language:** Portuguese (basic) |
| - **Initialization:** Random (no pretrained weights) |
| - **Training:** Single-stream autoregressive training |
| - **Parameters:** ~10K |
|
|
| This is a **base model**, not a chat model. |
|
|
| --- |
|
|
| ## Training Data |
|
|
| The model was trained on a **synthetic Portuguese dataset** designed to emphasize: |
|
|
| - Simple sentence structure |
| - Common verbs and nouns |
| - Basic grammar patterns |
| - Repetition and reinforcement |
|
|
| The dataset intentionally avoids: |
| - Instruction-following |
| - Dialog formatting |
| - Reasoning traces |
|
|
| This design allows clear observation of **language emergence** in small models. |
|
|
| --- |
|
|
| ## Training Procedure |
|
|
| - Optimizer: Adam |
| - Learning rate: 3e-4 |
| - Sequence length: 64 |
| - Epochs: 12000 |
| - Loss function: Cross-Entropy Loss |
| - CPU - AMD Ryzen 5 5600G 32GB (0.72 TFLOPS) |
|
|
| Training includes checkpointing and continuation support. |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| ### Supported Use Cases |
|
|
| - Educational experiments |
| - Language modeling research |
| - Studying emergent structure in small neural networks |
| - Baseline comparisons for future MiniText versions |
|
|
| ### Out-of-Scope Use Cases |
|
|
| - Conversational agents |
| - Instruction-following systems |
| - Reasoning or math tasks |
| - Production deployment |
|
|
| --- |
|
|
| ## Example Output |
|
|
| Prompt: |
| o gato é |
|
|
| Sample generation: |
| o gato é um animal |
|
|
| Note: Output quality varies due to the minimal size of the model. |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - Limited vocabulary and coherence |
| - No reasoning or factual understanding |
| - Susceptible to repetition and noise |
| - Not aligned or safety-tuned |
|
|
| These limitations are **expected and intentional**. |
|
|
| --- |
|
|
| ## Ethical Considerations |
|
|
| This model does not include safety filtering or alignment mechanisms. |
| It should not be used in applications involving sensitive or high-risk domains. |
|
|
| --- |
|
|
| ## Future Work |
|
|
| Planned extensions of the MiniText family include: |
|
|
| - MiniText-v1.1-Lang (improved Portuguese fluency) |
| - MiniText-Math (symbolic pattern learning) |
| - MiniText-Chat (conversation fine-tuning) |
| - MiniText-Reasoning (structured token experiments) |
|
|
| Each version will remain linked to this base model. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use MiniText-v1.0 in research or educational material, please cite the project repository. |
|
|
| --- |
|
|
| ## License |
|
|
| MIT License |
|
|
|
|
| Made by: Arthur Samuel(loboGOAT) |
|
|