Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up

All HF Hub posts

mmhamdyĀ 
posted an update 2 days ago
view post
Post
2452
What if you could train a model on just 10 images instead of 60,000 and still get close to the same performance?

Traditional machine learning requires thousands, even millions, of data points to achieve high accuracy. But what if we could "distill" the entire dataset into just a few synthetic samples?

This is what Dataset Distillation offers. Unlike traditional knowledge distillation, we keep the model fixed and distill the knowledge contained in a massive training set into a tiny set of synthetic distilled images.

The goal is to train a model on this ultra-small set and achieve performance that almost matches what the same model would get when trained on the massive original dataset.

For example, training on only 10 distilled MNIST images (this is equivalent to a single image per class) yields 94% accuracy, compared to 99% when training on the full 60,000 images.

Interestingly, these distilled images look significantly different (as you can see in the image below) from natural images because they are optimized for model training rather than for matching the correct data distribution.

But that's not all.

Most importantly, this same method opens the door to a potent form of data poisoning. Because distilled images are specifically optimized for rapid learning, an attacker can create a tiny set of adversarial distilled images to cause a well-trained model to forget or misclassify a specific category.

What I find fascinating about dataset distillation is this: it mimics human-like learning by letting a model grasp a concept from a single example, but it does so using alien synthetic images that mean absolutely nothing to a human eye!

What about you? What are your thoughts on it?
  • 2 replies
Ā·
KingNishĀ 
posted an update 2 days ago
view post
Post
3015
We trained an open-source Mythos like cybersecurity LLM for the Build Small Hackathon meet OpenMythos

Trained in two stages: SFT on ~1.84K filtered ArXiv cs.CR papers + real CVE data, then RLVR using paired with past vulnerabilities GitHub repos with a verifier model checking outputs against ground truth.

Trained on: H100s from Modal

The RLVR stage made the biggest difference responses got more precise and less prone to confusing similar vulnerability classes.

Everything is open:
šŸ¤– Demo → build-small-hackathon/OpenMythos
🧠 Model → build-small-hackathon/OpenMythos
šŸ“¦ CVE Dataset → build-small-hackathon/CVE_Vulnerailities_Detailed
šŸ“„ ArXiv Dataset → himanshu17HF/ArvixImport-Filtered-Final

Try it out and let us know where it breaks šŸ™
danielhanchenĀ 
posted an update 2 days ago
ovi054Ā 
posted an update 3 days ago
view post
Post
3500
Qwen3-14B Manim Expert LoRA

For "Build Small Hackathon", I built a Gradio app that turns any concept into a Manim explainer video.

This is powered by Qwen3-14B + Manim LoRA I trained on a synthetic 10k dataset I generated.

šŸ‘‰ Try it now: build-small-hackathon/anim-vid-ai
  • 2 replies
Ā·
kanaria007Ā 
posted an update 2 days ago
view post
Post
150
āœ… Article highlight: *Institutional Memory & Forgetting for Learning Worlds* (art-60-172, v0.1)

TL;DR:
This article argues that if a living world becomes training data, memory becomes infrastructure.

Logs, dialogue, labels, releases, feature stores, and model weights can turn a world into something that cannot honestly forget. 172 makes deletion, redaction, exclusion, forgetting requests, SANITIZED/PUBLIC releases, and unlearning claims into receipted governance lifecycles.

Read:
kanaria007/agi-structural-intelligence-protocols

Why it matters:
• prevents learning worlds from becoming ā€œunforgettable worldsā€
• separates deletion, redaction, and future extraction exclusion
• makes right-to-be-forgotten requests caseable and appealable
• preserves canon facts without preserving every memory surface
• blocks public promises like ā€œguaranteed deletion everywhereā€

What’s inside:
• retention policy contracts for what may be kept, copied, trained on, or released
• corpus segment manifests and propagation indexes for known controlled copies
• forgetting request, adjudication, remedy, deletion, redaction, and exclusion receipts
• tombstone manifests and semantic preservation receipts for canon-safe forgetting
• use eligibility receipts for deciding whether a segment may train a future run
• release contracts, redaction maps, and irreversibility disclosures for SANITIZED/PUBLIC releases
• bounded unlearning contracts and post-unlearning verification receipts

Key idea:
Do not say:

*ā€œwe deleted it, so it is forgotten.ā€*

Say:

*ā€œthis subject was handled under this retention policy, propagation index, adjudication path, remedy contract, tombstone, semantic preservation receipt, extraction exclusion receipt, and bounded public claim.ā€*

Forgetting is not a button.

It is governance with receipts.
prithivMLmodsĀ 
posted an update 3 days ago
view post
Post
2916
Wan2.2-I2V-Fast with highly upscaled sequential frame sampling is now available as a Spaces demo, built using Wan2.2-I2V and FLUX.2-Klein. Try the demo using the links below.šŸ‘‡

āž  wan2.2-i2v-fast : prithivMLmods/wan2.2-i2v-fast
āž  github: https://github.com/prithivsakthiur/wan2.2-i2v-fast
āž  collection: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

⤷ To learn more, visit the app page or the respective model pages.
KasualdadĀ 
posted an update 3 days ago
view post
Post
1049
From Plain English to DuckDB SQL: Building LFEDS
šŸ« I just shipped Local First Education Data Stack— a plain-English-to-SQL assistant for school district analytics — for the HF Build Small Hackathon.

The problem: school staff have useful data (attendance, grades, enrollment, discipline) but no fast, private way to ask questions. Most AI tools send that data to a cloud API. LFED doesn't.

What it does:
→ Type a question like "What's the average GPA for chronically absent students in 2023-2024?"
→ A fine-tuned Qwen2.5-Coder-14B model generates DuckDB SQL
→ A validation layer rejects anything that isn't a SELECT
→ Results come back as a summary, table, CSV download, and the SQL itself

Two flavors:
- Live Space demo: transformers + PEFT on HF ZeroGPU
- Local-first: llama.cpp + GGUF Q4_K_M on your own machine — no data leaves

The fine-tune:
- 27,859 synthetic NL→SQL pairs
- Unsloth QLoRA r=32 on Qwen2.5-Coder-14B
- Trained on Modal A10G

Hardest lessons were not model training:
1. Scope the model's job tightly — schema + few-shots + SELECT only.
2. Validate before executing. Always.
3. ZeroGPU is PyTorch-only; llama.cpp won't work there.
4. Gradio's scoped Svelte CSS beats generic selectors — inspect the live DOM.
5. modal deploy + fn.spawn() is fire-and-forget; modal run dies if your terminal drops.
6. Data artifacts matter as much as the model — Parquet seeds, dataset card, model card.

I also published the training dataset: 25,886 question→SQL pairs on the Hub.

Links:
Demo: https://youtu.be/cE0yp4qmFIA
- Live Space: build-small-hackathon/Kasualdad_LFED
- LoRA adapter: build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora
- GGUF: build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf
- Dataset: build-small-hackathon/lfed-training-data

#BuildSmallHackathon #BackyardAI #HuggingFace #TextToSQL #DuckDB #LocalFirst #EdTech #Qwen #QLoRA #LLM
loayĀ 
posted an update 3 days ago
view post
Post
1006
I built EchoYard for the
build-small-hackathon
: a tiny listen-and-repeat language practice app.

Pick a language, level, and voice style, listen to a short reference voice, record yourself, then get simple speaking feedback and a next practice step.

Built with
openbmb
VoxCPM2 for multilingual reference audio and MiniCPM5-1B for friendly feedback.

Try it here: https://build-small-hackathon-echoyard.hf.space

Would love feedback, especially on the recording flow and how useful the speaking tips feel.
nevmenandrĀ 
posted an update 3 days ago
view post
Post
910
šŸ”„ New Russian Stylometry Dataset!

Russian Stylometric Dataset (RSD) — 322 texts from the 19th – early 20th centuries (16 million words), prepared for analysis in stylo (R) and machine learning (Python).

šŸ“š What's inside?

Fiction, journalism, scientific texts, drama, poetry

Grouped by author, gender, age, genre, literary movements (Romanticism/Realism)

Character speech (Tolstoy, Gogol, Ostrovsky)

Generated texts (LSTM, GPT)

šŸ“Š Use cases: authorship attribution, clustering, classification, benchmarking methods.

šŸ”“ Public domain + GPL-3.0 license.

šŸ‘‰ Learn more: https://github.com/nevmenandr/RSD

DOI: 10.5281/zenodo.20701309
ykirpichevĀ 
posted an update 3 days ago
view post
Post
869
Glass-Box Agent for Build Small Hackathon.

A tiny ReAct-style agent where the trace is the interface: click a thought, retry a branch, label weak/useful nodes, and export preference pairs for DPO/RL-style training.

Space: build-small-hackathon/glass-box-agent
Demo: included in the Space at assets/glass-box-agent-demo.mp4
Track: An Adventure in Thousand Token Wood

#BuildSmallHackathon #Gradio #SmallModels