Instructions to use embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Access Embedl Parakeet Tdt 0.6B V3
To access this model, please review and accept the terms below. Your contact information is collected solely to manage access and, with your explicit consent, to notify you about updated or new optimized models from Embedl.
By requesting access you agree to the Embedl Models Community Licence and the upstream Parakeet Tdt 0.6B V3 License
Log in or Sign Up to review the conditions and access this model content.
Embedl Parakeet Tdt 0.6B V3 (Quantized for TensorRT)
Deployable INT8-quantized version of nvidia/parakeet-tdt-0.6b-v3,
optimized with embedl-deploy
for low-latency NVIDIA TensorRT speech recognition on edge GPUs.
Upstream Model
Highlights
- Mixed-precision INT8/FP16 quantization of the Conformer encoder via embedl-deploy β the TDT decoder stays in FP32 (small, autoregressive).
- Drop-in replacement for the upstream
nvidia/parakeet-tdt-0.6b-v3encoder β same log-mel input (log-mel spectrogram (3000 frames Γ 128 bins)). - Validated accuracy within 0.15 pp of the FP32 baseline on Open ASR Leaderboard.
- Ships ONNX (TensorRT) and a runnable inference scripts. The first-build TRT engine is cached next to the ONNX.
Quick Start
Install the latest version of transformers and additional packages for processing sound files. Requires an NVIDIA GPU with driver β₯ 525 (CUDA 12.x) and the TensorRT Python package (usually installed system wide on NVIDIA Orins). Be careful with installing Torch and TensorRT as simple pip install will pull CUDA 13 versions which might not work on the Orin. We can use the CPU wheel for torch as TensorRT will do the heavy lifting during inference.
python3 -m venv venv --system-site-packages
source venv/bin/activate
pip install transformers
pip install soundfile librosa
pip install --force-reinstall --no-cache-dir Pillow>=10.0.0
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cpu
Download the quantized ONNX model and run inference on a sample audiofile. The first run creates the TensorRT engine from the quantized ONNX file.
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt', local_dir='.')"
python infer_trt.py --audio path/to/speech.wav
Files
| File | Purpose |
|---|---|
embedl_parakeet-tdt-0.6b-v3_int8.onnx |
INT8-quantized encoder ONNX with Q/DQ nodes. |
infer_trt.py |
Build a TRT engine from the ONNX and transcribe a WAV. |
Demo: Nehru "Tryst with Destiny" (1947)
A 4-minute archival speech in English with strong regional accent and period audio quality β a stress test for any modern ASR model. The demo MP3 is decoded to 16 kHz mono and split into 28 s chunks (below the 30 s encoder window) before being fed through the INT8-quantized encoder.
Result (Embedl Parakeet INT8 encoder + upstream TDT decoder, against the verified ground-truth transcript β Whisper-style normalized):
| Metric | Value |
|---|---|
| Audio duration | 280.9 s (4 min 41 s) |
| Ground-truth words | 520 |
| Parakeet hypothesis words | 526 |
| Word Error Rate | 5.58 % |
Parakeet INT8 output transcript is provided for direct comparison.
Try it yourself:
curl -O https://huggingface.co/datasets/embedl/documentation-images/resolve/main/parakeet-tdt-0.6b-v3-quantized-tensorrt/nehru_tryst.mp3
python infer_trt.py --audio nehru_tryst.mp3
Performance
Encoder benchmarked via trtexec, GPU compute time only
(--noDataTransfers), CUDA Graph + Spin Wait enabled, 1000 iterations
after a 2 s warm-up. Input is a static (1, 3000, 128) log-mel
spectrogram (30 s window). Engine size is the on-disk .engine file;
peak GPU memory is the engine plus the per-context activation pool
reported by trtexec.
NVIDIA L4 (TensorRT 10.16)
| Precision | Mean Latency | p95 | Throughput | Engine Size | Peak GPU Memory | Speedup vs FP32 |
|---|---|---|---|---|---|---|
| FP32 | 32.82 ms | 33.62 ms | 30.5 qps | 2364 MiB | 2570 MiB | 1.00Γ |
| FP16 | 16.83 ms | 17.10 ms | 59.4 qps | 1177 MiB | 1280 MiB | 1.95Γ |
| Embedl INT8 + FP16 | 14.57 ms | 14.74 ms | 68.6 qps | 758 MiB | 862 MiB | 2.25Γ |
NVIDIA Jetson AGX Orin (TensorRT 10.3, JetPack 6)
| Precision | Mean Latency | p95 | Throughput | Engine Size | Peak GPU Memory | Speedup vs FP32 |
|---|---|---|---|---|---|---|
| FP32 | 62.40 ms | 62.46 ms | 16.0 qps | 2331 MiB | 2526 MiB | 1.00Γ |
| FP16 | 35.62 ms | 35.66 ms | 28.1 qps | 1174 MiB | 1274 MiB | 1.75Γ |
| Embedl INT8 + FP16 | 34.32 ms | 34.35 ms | 29.1 qps | 706 MiB | 806 MiB | 1.82Γ |
The Embedl INT8 + FP16 engine is 2.25Γ faster than FP32 on L4 and 1.82Γ faster on AGX Orin, with a 3.1Γβ3.3Γ smaller engine than FP32 across both targets.
Accuracy (Open ASR Leaderboard)
Evaluated on the Open ASR Leaderboard test suites with the official Whisper-style English text normalizer (lowercase, number expansion, filler-word removal). Lower WER is better.
Full-dataset FP32 baseline (83,173 samples, 167.9 h audio)
Reference run with the upstream nvidia/parakeet-tdt-0.6b-v3 in FP32:
| Dataset | WER | Samples | Audio |
|---|---|---|---|
| AMI | 12.02% | 12,643 | 8.7 h |
| Earnings22 | 11.83% | 2,731 | 5.3 h |
| GigaSpeech | 9.78% | 19,931 | 35.4 h |
| LibriSpeech (clean) | 1.99% | 2,611 | 5.3 h |
| LibriSpeech (other) | 3.65% | 2,932 | 5.3 h |
| SPGISpeech | 3.87% | 39,341 | 100.0 h |
| TEDLIUM | 3.07% | 1,154 | 2.6 h |
| VoxPopuli | 6.13% | 1,830 | 4.8 h |
| Average | 6.54% | 83,173 | 167.9 h |
FP32 vs Embedl INT8 β matched 500 samples per dataset
Both paths evaluated on identical sample indices (evenly spaced across each dataset) to isolate quantization accuracy loss from sampling variance. Lower WER is better.
| Dataset | FP32 WER | Embedl INT8 WER | Ξ WER |
|---|---|---|---|
| AMI | 12.54% | 13.54% | +1.00% |
| Earnings22 | 12.55% | 11.72% | β0.83% |
| GigaSpeech | 9.97% | 10.21% | +0.23% |
| LibriSpeech (clean) | 2.00% | 2.10% | +0.10% |
| LibriSpeech (other) | 3.32% | 3.70% | +0.38% |
| SPGISpeech | 4.04% | 4.15% | +0.11% |
| TEDLIUM | 2.68% | 2.76% | +0.08% |
| VoxPopuli | 6.07% | 6.25% | +0.18% |
| Average | 6.65% | 6.80% | +0.16% |
Embedl INT8 quantization adds only +0.16 pp absolute WER on average β well within deployment tolerance. The AMI +1.00% outlier is within the expected variance for a 500-sample evaluation on spontaneous meeting speech (the highest natural variance of all 8 datasets). The Earnings22 β0.83% is a sampling artifact (different 500-sample distribution between the two measures).
Creating Your Own Optimized Models
This artifact was produced with embedl-deploy, Embedl's open-source PyTorch β TensorRT deployment library. You can apply the same workflow to your own models β see the documentation for installation and usage.
License
| Component | License |
|---|---|
| Optimized model artifacts (this repo) | Embedl Models Community Licence v1.0 β no redistribution as a hosted service |
| Upstream architecture and weights | Parakeet Tdt 0.6B V3 License |
Contact
We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.
- Downloads last month
- 24
Model tree for embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt
Base model
nvidia/parakeet-tdt-0.6b-v3