zero-gpu-explorers (ZeroGPU Explorers)

ybelkada

authored a paper 20 days ago

Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

Paper • 2601.04890 • Published 21 days ago • 41

Nymbo

posted an update 21 days ago

Post

1909

Genuine recommendation: You should really use this AutoHotKey macro. Save the file as macros.ahk and run it. Before sending a prompt to your coding agent, press Ctrl + Alt + 1 and paste your prompt to any regular chatbot. Then send the output to the agent. This is the actual, boring, real way to "10x your prompting". Use the other number keys to avoid repeating yourself over and over again. I use this macro prolly 100-200 times per day. AutoHotKey isn't as new or hype as a lot of other workflows, but there's a reason it's still widely used after 17 years. Don't overcomplicate it.

; Requires AutoHotkey v1.1+

; All macros are `Ctrl + Alt + <variable>`

^!1::
    Send, Please help me more clearly articulate what I mean with this message (write the message in a code block):
return

^!2::
    Send, Please make the following changes:
return

^!3::
    Send, It seems you got cut off by the maximum response limit. Please continue by picking up where you left off.
return

In my experience the past few months, Ctrl + Alt + 1 works best with Instruct models (non-thinking). Reasoning causes some models to ramble and miss the point. I've just been using GPT-5.x for this.

mapooon

submitted a paper to Daily Papers 22 days ago

ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

Paper • 2601.02359 • Published 23 days ago • 4

mapooon

authored 6 papers 22 days ago

ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

Paper • 2601.02359 • Published 23 days ago • 4

Nymbo

posted an update about 1 month ago

Post

2296

🚨 New tool for the Nymbo/Tools MCP server: The new Agent_Skills tool provides full support for Agent Skills (Claude Skills but open-source).

How it works: The tool exposes the standard discover/info/resources/validate actions. Skills live in /Skills under the same File_System root, and any bundled scripts run through Shell_Command, no new infrastructure required.

Agent_Skills(action="discover")  # List all available skills
Agent_Skills(action="info", skill_name="music-downloader")  # Full SKILL.md
Agent_Skills(action="resources", skill_name="music-downloader")  # Scripts, refs, assets

I've included a music-downloader skill as a working demo, it wraps yt-dlp for YouTube/SoundCloud audio extraction.

Caveat: On HF Spaces, Shell_Command works for most tasks, but some operations (like YouTube downloads) are restricted due to the container environment. For full functionality, run the server locally on your machine.

Try it out ~ https://www.nymbo.net/nymbot

toshas

posted an update about 1 month ago

Post

830

Introducing StereoSpace -- our new end-to-end method for turning photos into stereo images without explicit geometry or depth maps. This makes it especially robust with thin structures and transparencies. Try the demo below:

🌐 Project: prs-eth/stereospace_web
📕 Paper: StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space (2512.10959)
🐙 Code: https://github.com/prs-eth/stereospace
🤗 Demo: toshas/stereospace
🤗 Weights: prs-eth/stereospace-v1-0

By ETH Zürich ( @behretj , @Bingxin , @konradschindler ), University of Bologna ( @fabiotosi92 , @mpoggi ), HUAWEI Bayer Lab ( @toshas ).

toshas

authored a paper about 1 month ago

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Paper • 2512.10959 • Published Dec 11, 2025 • 12

IliaLarchenko

submitted a paper to Daily Papers about 2 months ago

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

Paper • 2512.06951 • Published Dec 7, 2025 • 4

toshas

submitted a paper to Daily Papers about 2 months ago

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Paper • 2512.10959 • Published Dec 11, 2025 • 12

toshas

posted an update about 2 months ago

Post

2245

Introducing 🇨🇭WindowSeat🇨🇭 –– our new method for removing reflections from photos taken through windows, on planes, in malls, offices, and other glass-filled environments.

Finetuning a foundation diffusion transformer for reflection removal quickly runs up against the limits of what existing datasets and techniques can offer. To fill that gap, we generate physically accurate examples in Blender that simulate realistic glass and reflection effects. This data enables strong performance on both established benchmarks and previously unseen images.

To make this practical, the open-source Apache-2 model builds on Qwen-Image-Edit-2509, a 20B image-editing diffusion transformer that runs on a single GPU and can be fine-tuned in about a day. WindowSeat keeps its use of the underlying DiT cleanly separated from the data and training recipe, allowing future advances in base models to be incorporated with minimal friction.

Try it out with your own photos in this interactive demo:
🤗 toshas/windowseat-reflection-removal

Other resources:
🌎 Website: huawei-bayerlab/windowseat-reflection-removal-web
🎓 Paper: Reflection Removal through Efficient Adaptation of Diffusion Transformers (2512.05000)
🤗 Model: huawei-bayerlab/windowseat-reflection-removal-v1-0
🐙 Code: https://github.com/huawei-bayerlab/windowseat-reflection-removal

Team: Daniyar Zakarin ( @daniyarzt )*, Thiemo Wandel ( @thiemo-wandel )*, Anton Obukhov ( @toshas ), Dengxin Dai.
*Work done during internships at HUAWEI Bayer Lab

Aurelien-Morgan

posted an update about 2 months ago

Post

340

Hey, I went to Hangzhou to talk about retrain-pipelines at the GOSIM Foundation's conference last september.
The recording just got released. Go check it out !
https://www.youtube.com/watch?v=nmrMachM5aM
Slides are there :
https://docs.google.com/presentation/d/1hnAzHJ0SbeAOtGJir-iH84RBtXT1OxVT/

2 replies

·

IliaLarchenko

posted an update about 2 months ago

Post

1167

🏆 BEHAVIOR Challenge 1st Place – Solution Summary

My team recently won 1st place in the BEHAVIOR Challenge at NeurIPS.
The competition focused on training a single policy to complete 50 long-horizon household tasks in simulation.

We built an end-to-end policy based on Pi0.5 with a bunch of custom modifications. Everything is open-sourced, and it should be useful for anyone exploring VLAs or adapting them to specific tasks.

Key Architecture Changes:
- Replaced language model with 50 trainable task embeddings (no text at all)
- Correlated noise for Flow Matching: ϵ ∼ N(0, 0.5I + 0.5Σ) using dataset action covariance
- Learnable mixed-layer attention: each action expert layer attends to a trainable mix of all VLM layers
- System 2 stage tracking: model predicts task stage, we smooth it with voting and feed it back as context

Training:
- Multi-sample Flow Matching: 15 FM samples per VLM pass to reduce gradient variance
- Delta action space + per-timestamp normalization
- FAST auxiliary loss and stage prediction loss
- Trained on 224×224 RGB + proprioception only
- We use 4 fine-tuned checkpoints, all derived from a multi-task model trained on all 50 tasks

Inference Optimizations:
- Soft inpainting: predict 30 actions, execute 26, use 4 as an input for the next chunk
- Correlation-aware guidance of inpainting to keep action chunks smooth
- 1.3× speedup via cubic spline compression
- General correction rule: reopen gripper after failed grasps

🔗 Code and Models:
- Code: https://github.com/IliaLarchenko/behavior-1k-solution
- Weights: IliaLarchenko/behavior_submission
- Paper: Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge (2512.06951)

IliaLarchenko

authored a paper about 2 months ago

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

Paper • 2512.06951 • Published Dec 7, 2025 • 4

flozi00

posted an update about 2 months ago

Post

345

We have covered Tensor Parallelism for slicing matrices and Pipeline Parallelism for stacking layers. But what if your model isn't just deep or wide—it's a sprawling Mixture-of-Experts (MoE) architecture like Mixtral or DeepSeek, with trillions of parameters that are mostly idle per token?

Replicating those experts wastes VRAM. Slicing them with TP wastes bandwidth. The solution is Expert Parallelism (EP), which distributes the experts themselves across GPUs and routes tokens to wherever their "chosen" expert lives.

The hardware catch? It is not matrix splitting or pipeline bubbles—it's the "Router's Dilemma." You must shuffle massive volumes of tokens across the cluster using All-to-All communication, and any imbalance can leave expensive GPUs idle.

My latest guide dives into the mechanics of EP and why the interconnect becomes the ultimate bottleneck.

In this breakdown, we explore:

The Token Routing Lifecycle
A four-step hardware flow: Local routing to pick experts, Dispatch (All-to-All shuffle), Expert computation on the "home" GPU, and Combine (another All-to-All to return results).

The All-to-All Primitive
Unlike the ring-based syncs in TP, All-to-All creates a dense mesh of personalized data transfers. We compare it to All-Reduce and show why uneven token distribution (load imbalance) causes network congestion and compute skew.

Load Balancing: The Hardware Nightmare
If one expert gets 90% of the tokens, its GPU bottlenecks while others stall. We discuss mitigation strategies like token dropping and auxiliary losses to keep utilization high.

The article includes a raw PyTorch implementation of an EP layer using torch.distributed.all_to_all_single to reveal exactly how the data shuffles and where the stalls happen.

Read the full hardware-centric guide here:
https://flozi.net/en/guides/ai/scaling/expert_parallel

1 reply

·

toshas

authored a paper about 2 months ago

The Fourth Monocular Depth Estimation Challenge

Paper • 2504.17787 • Published Apr 24, 2025

ZeroGPU Explorers

AI & ML interests

Recent Activity

Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

BlendFace: Re-designing Identity Encoders for Face-Swapping

Face2Diffusion for Fast and Editable Face Personalization

PetFace: A Large-Scale Dataset and Benchmark for Animal Identification

OpenAnimalTracks: A Dataset for Animal Track Recognition

Detecting Deepfakes with Self-Blended Images

ExposeAnyone: Personalized Audio-to-Expression Diffusion Models Are Robust Zero-Shot Face Forgery Detectors

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge

The Fourth Monocular Depth Estimation Challenge

AI & ML interests

Recent Activity

Team members 751

zero-gpu-explorers's activity