| | --- |
| | license: mit |
| | library_name: mlx |
| | tags: |
| | - mlx |
| | - audio |
| | - speech |
| | - feature-extraction |
| | - contentvec |
| | - hubert |
| | - voice-conversion |
| | - rvc |
| | datasets: |
| | - librispeech_asr |
| | language: |
| | - en |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # MLX ContentVec / HuBERT Base |
| |
|
| | MLX-converted weights for ContentVec/HuBERT base model, optimized for Apple Silicon. |
| |
|
| | This model extracts speaker-agnostic semantic features from audio, primarily used as the feature extraction backbone for [RVC (Retrieval-based Voice Conversion)](https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI). |
| |
|
| | ## Model Details |
| |
|
| | - **Architecture**: HuBERT Base (12 transformer layers) |
| | - **Parameters**: ~90M |
| | - **Input**: 16kHz mono audio |
| | - **Output**: 768-dimensional features (~50 frames/second) |
| | - **Framework**: [MLX](https://github.com/ml-explore/mlx) |
| | - **Format**: SafeTensors (float32) |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import mlx.core as mx |
| | import librosa |
| | from mlx_contentvec import ContentvecModel |
| | |
| | # Load model |
| | model = ContentvecModel(encoder_layers_1=0) |
| | model.load_weights("contentvec_base.safetensors") |
| | model.eval() |
| | |
| | # Load audio at 16kHz |
| | audio, sr = librosa.load("input.wav", sr=16000, mono=True) |
| | source = mx.array(audio).reshape(1, -1) |
| | |
| | # Extract features |
| | result = model(source) |
| | features = result["x"] # Shape: (1, num_frames, 768) |
| | ``` |
| |
|
| | ## Installation |
| |
|
| | ```bash |
| | pip install git+https://github.com/example/mlx-contentvec.git |
| | ``` |
| |
|
| | ## Download Weights |
| |
|
| | ```python |
| | from huggingface_hub import hf_hub_download |
| | |
| | weights_path = hf_hub_download( |
| | repo_id="lexandstuff/mlx-contentvec", |
| | filename="contentvec_base.safetensors" |
| | ) |
| | ``` |
| |
|
| | ## Validation |
| |
|
| | These weights produce **numerically identical** outputs to the original PyTorch implementation: |
| |
|
| | | Metric | Value | |
| | |--------|-------| |
| | | Max absolute difference | 7.3e-6 | |
| | | Cosine similarity | 1.000000 | |
| |
|
| | ## Source Weights |
| |
|
| | Converted from [hubert_base.pt](https://huggingface.co/lj1995/VoiceConversionWebUI/resolve/main/hubert_base.pt) (MD5: `b76f784c1958d4e535cd0f6151ca35e4`). |
| |
|
| | ## Use Cases |
| |
|
| | - **Voice Conversion**: Feature extraction for RVC pipeline |
| | - **Speaker Verification**: Content-based audio embeddings |
| | - **Speech Analysis**: Semantic feature extraction |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @inproceedings{qian2022contentvec, |
| | title={ContentVec: An Improved Self-Supervised Speech Representation by Disentangling Speakers}, |
| | author={Qian, Kaizhi and Zhang, Yang and Gao, Heting and Ni, Junrui and Lai, Cheng-I and Cox, David and Hasegawa-Johnson, Mark and Chang, Shiyu}, |
| | booktitle={International Conference on Machine Learning}, |
| | year={2022} |
| | } |
| | |
| | @article{hsu2021hubert, |
| | title={HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units}, |
| | author={Hsu, Wei-Ning and others}, |
| | journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, |
| | year={2021} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | MIT |
| |
|