Access Embedl Parakeet Tdt 0.6B V3

To access this model, please review and accept the terms below. Your contact information is collected solely to manage access and, with your explicit consent, to notify you about updated or new optimized models from Embedl.

By requesting access you agree to the Embedl Models Community Licence and the upstream Parakeet Tdt 0.6B V3 License

Optimized by Embedl

Need to fine-tune, hit performance targets, or deploy on specific hardware?

We've got you covered.

Learn more Get in touch →

Embedl Parakeet Tdt 0.6B V3 (Quantized for TensorRT)

Deployable INT8-quantized version of nvidia/parakeet-tdt-0.6b-v3, optimized with embedl-deploy for low-latency NVIDIA TensorRT speech recognition on edge GPUs.

Upstream Model

Open nvidia/parakeet-tdt-0.6b-v3 in hfviewer

Highlights

Mixed-precision INT8/FP16 quantization of the Conformer encoder via embedl-deploy — the TDT decoder stays in FP32 (small, autoregressive).
Drop-in replacement for the upstream nvidia/parakeet-tdt-0.6b-v3 encoder — same log-mel input (log-mel spectrogram (3000 frames × 128 bins)).
Validated accuracy within 0.15 pp of the FP32 baseline on Open ASR Leaderboard.
Ships ONNX (TensorRT) and a runnable inference scripts. The first-build TRT engine is cached next to the ONNX.

Quick Start

Install the latest version of transformers and additional packages for processing sound files. Requires an NVIDIA GPU with driver ≥ 525 (CUDA 12.x) and the TensorRT Python package (usually installed system wide on NVIDIA Orins). Be careful with installing Torch and TensorRT as simple pip install will pull CUDA 13 versions which might not work on the Orin. We can use the CPU wheel for torch as TensorRT will do the heavy lifting during inference.

python3 -m venv venv --system-site-packages
source venv/bin/activate
pip install transformers
pip install soundfile librosa
pip install --force-reinstall --no-cache-dir Pillow>=10.0.0
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cpu

Download the quantized ONNX model and run inference on a sample audiofile. The first run creates the TensorRT engine from the quantized ONNX file.

python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt', local_dir='.')"
python infer_trt.py --audio path/to/speech.wav

Files

File	Purpose
`embedl_parakeet-tdt-0.6b-v3_int8.onnx`	INT8-quantized encoder ONNX with Q/DQ nodes.
`infer_trt.py`	Build a TRT engine from the ONNX and transcribe a WAV.

Demo: Nehru "Tryst with Destiny" (1947)

A 4-minute archival speech in English with strong regional accent and period audio quality — a stress test for any modern ASR model. The demo MP3 is decoded to 16 kHz mono and split into 28 s chunks (below the 30 s encoder window) before being fed through the INT8-quantized encoder.

Mel-spectrogram of the Nehru Tryst-with-Destiny speech

Result (Embedl Parakeet INT8 encoder + upstream TDT decoder, against the verified ground-truth transcript — Whisper-style normalized):

Metric	Value
Audio duration	280.9 s (4 min 41 s)
Ground-truth words	520
Parakeet hypothesis words	526
Word Error Rate	5.58 %

Parakeet INT8 output transcript is provided for direct comparison.

Try it yourself:

curl -O https://huggingface.co/datasets/embedl/documentation-images/resolve/main/parakeet-tdt-0.6b-v3-quantized-tensorrt/nehru_tryst.mp3
python infer_trt.py --audio nehru_tryst.mp3

Performance

Encoder benchmarked via trtexec, GPU compute time only (--noDataTransfers), CUDA Graph + Spin Wait enabled, 1000 iterations after a 2 s warm-up. Input is a static (1, 3000, 128) log-mel spectrogram (30 s window). Engine size is the on-disk .engine file; peak GPU memory is the engine plus the per-context activation pool reported by trtexec.

NVIDIA L4 (TensorRT 10.16)

Precision	Mean Latency	p95	Throughput	Engine Size	Peak GPU Memory	Speedup vs FP32
FP32	32.82 ms	33.62 ms	30.5 qps	2364 MiB	2570 MiB	1.00×
FP16	16.83 ms	17.10 ms	59.4 qps	1177 MiB	1280 MiB	1.95×
Embedl INT8 + FP16	14.57 ms	14.74 ms	68.6 qps	758 MiB	862 MiB	2.25×

NVIDIA Jetson AGX Orin (TensorRT 10.3, JetPack 6)

Precision	Mean Latency	p95	Throughput	Engine Size	Peak GPU Memory	Speedup vs FP32
FP32	62.40 ms	62.46 ms	16.0 qps	2331 MiB	2526 MiB	1.00×
FP16	35.62 ms	35.66 ms	28.1 qps	1174 MiB	1274 MiB	1.75×
Embedl INT8 + FP16	34.32 ms	34.35 ms	29.1 qps	706 MiB	806 MiB	1.82×

The Embedl INT8 + FP16 engine is 2.25× faster than FP32 on L4 and 1.82× faster on AGX Orin, with a 3.1×–3.3× smaller engine than FP32 across both targets.

Accuracy (Open ASR Leaderboard)

Evaluated on the Open ASR Leaderboard test suites with the official Whisper-style English text normalizer (lowercase, number expansion, filler-word removal). Lower WER is better.

Full-dataset FP32 baseline (83,173 samples, 167.9 h audio)

Reference run with the upstream nvidia/parakeet-tdt-0.6b-v3 in FP32:

Dataset	WER	Samples	Audio
AMI	12.02%	12,643	8.7 h
Earnings22	11.83%	2,731	5.3 h
GigaSpeech	9.78%	19,931	35.4 h
LibriSpeech (clean)	1.99%	2,611	5.3 h
LibriSpeech (other)	3.65%	2,932	5.3 h
SPGISpeech	3.87%	39,341	100.0 h
TEDLIUM	3.07%	1,154	2.6 h
VoxPopuli	6.13%	1,830	4.8 h
Average	6.54%	83,173	167.9 h

FP32 vs Embedl INT8 — matched 500 samples per dataset

Both paths evaluated on identical sample indices (evenly spaced across each dataset) to isolate quantization accuracy loss from sampling variance. Lower WER is better.

Dataset	FP32 WER	Embedl INT8 WER	Δ WER
AMI	12.54%	13.54%	+1.00%
Earnings22	12.55%	11.72%	−0.83%
GigaSpeech	9.97%	10.21%	+0.23%
LibriSpeech (clean)	2.00%	2.10%	+0.10%
LibriSpeech (other)	3.32%	3.70%	+0.38%
SPGISpeech	4.04%	4.15%	+0.11%
TEDLIUM	2.68%	2.76%	+0.08%
VoxPopuli	6.07%	6.25%	+0.18%
Average	6.65%	6.80%	+0.16%

Embedl INT8 quantization adds only +0.16 pp absolute WER on average — well within deployment tolerance. The AMI +1.00% outlier is within the expected variance for a 500-sample evaluation on spontaneous meeting speech (the highest natural variance of all 8 datasets). The Earnings22 −0.83% is a sampling artifact (different 500-sample distribution between the two measures).

Creating Your Own Optimized Models

This artifact was produced with embedl-deploy, Embedl's open-source PyTorch → TensorRT deployment library. You can apply the same workflow to your own models — see the documentation for installation and usage.

License

Component	License
Optimized model artifacts (this repo)	Embedl Models Community Licence v1.0 — no redistribution as a hosted service
Upstream architecture and weights	Parakeet Tdt 0.6B V3 License

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.

Community & support

Need help with this model? Chat with the Embedl team and other engineers on Discord.

Quantization gotchas, hardware questions, fine-tuning tips — bring them all.

Join our Discord →

Downloads last month: 24

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt

Base model

nvidia/parakeet-tdt-0.6b-v3

Quantized

(36)

this model