| | --- |
| | license: mit |
| | tags: |
| | - llama-cpp-python, |
| | - cuda, |
| | - gemma |
| | - gemma-3, |
| | - windows, |
| | - wheel, |
| | - prebuilt, |
| | - .whl, |
| | - local-llm, |
| | --- |
| | # llama-cpp-python Prebuilt Wheel (Windows x64, CUDA 12.8, Gemma 3 Support) |
| |
|
| | --- |
| | 🛠️ **Built with** [llama.cpp (b5192)](https://github.com/ggml-org/llama.cpp) + [CUDA 12.8](https://developer.nvidia.com/cuda-toolkit) |
| | --- |
| | **Prebuilt `.whl` for llama-cpp-python 0.3.8 — CUDA 12.8 acceleration with full Gemma 3 model support (Windows x64).** |
| |
|
| | This repository provides a prebuilt Python wheel (`.whl`) file for **llama-cpp-python**, specifically compiled for Windows 10/11 (x64) with NVIDIA CUDA 12.8 acceleration enabled. |
| |
|
| | Building `llama-cpp-python` with CUDA support on Windows can be a complex process involving specific Visual Studio configurations, CUDA Toolkit setup, and environment variables. This prebuilt wheel aims to simplify installation for users with compatible systems. |
| |
|
| | This build is based on **llama-cpp-python** version `0.3.8` of the Python bindings, and the underlying **llama.cpp** source code as of **April 26, 2025**. It has been verified to work with **Gemma 3 models**, correctly offloading layers to the GPU. |
| |
|
| | --- |
| |
|
| | ## Features |
| |
|
| | - **Prebuilt for Windows x64**: Ready to install using `pip` on 64-bit Windows systems. |
| | - **CUDA 12.8 Accelerated**: Leverages your NVIDIA GPU for faster inference. |
| | - **Gemma 3 Support**: Verified compatibility with Gemma 3 models. |
| | - **Based on llama-cpp-python version `0.3.8` bindings.** |
| | - **Uses [llama.cpp release b5192](https://github.com/ggml-org/llama.cpp/releases/tag/b5192) from April 26, 2025.** |
| |
|
| | --- |
| |
|
| | ## Compatibility & Prerequisites |
| |
|
| | To use this wheel, you must have: |
| |
|
| | - An **NVIDIA GPU**. |
| | - NVIDIA drivers compatible with **CUDA 12.8** installed. |
| | - **Windows 10 or Windows 11 (x64)**. |
| | - **Python 3.8 or higher** (the wheel is built specifically for **Python 3.11** (`cp311`)). |
| | - The **Visual C++ Redistributable for Visual Studio 2015-2022** installed. |
| |
|
| | --- |
| |
|
| | ## Installation |
| |
|
| | It is highly recommended to install this wheel within a Python virtual environment. |
| |
|
| | 1. Ensure you have met all the prerequisites listed above. |
| | 2. Create and activate a Python virtual environment: |
| |
|
| | ```bash |
| | python -m venv venv_llama |
| | .\venv_llama\Scripts\activate |
| | ``` |
| | |
| | 3. Download the `.whl` file from this repository's **Releases** section. |
| | 4. Open your Command Prompt or PowerShell. |
| | 5. Navigate to the directory where you downloaded the `.whl` file. |
| | 6. Install the wheel using `pip`: |
| |
|
| | ```bash |
| | pip install llama_cpp_python-0.3.8+cu128.gemma3-cp311-cp311-win_amd64.whl |
| | ``` |
| | |
| | --- |
| |
|
| | ## Verification (Check CUDA Usage) |
| |
|
| | To verify that `llama-cpp-python` is using your GPU via CUDA after installation: |
| |
|
| | ```bash |
| | python -c "from llama_cpp import Llama; print('Attempting to initialize Llama with GPU offload...'); try: model = Llama(model_path='path/to/a/small/model.gguf', n_gpu_layers=-1, verbose=True); print('Initialization attempted. Check output above for GPU layers.'); except FileNotFoundError: print('Model file not found, but library initialization output above might still indicate CUDA usage.'); except Exception as e: print(f'An error occurred during initialization: {e}');" |
| | ``` |
| |
|
| | Note: Replace path/to/a/small/model.gguf with the actual path to a small .gguf model file. |
| |
|
| | Look for output messages indicating layers being offloaded to the GPU, such as assigned to device CUDA0 or memory buffer reports. |
| |
|
| | ## Alternative Verification: Python Script |
| |
|
| | If you prefer, you can verify that llama-cpp-python is correctly using CUDA by running a small Python script inside your virtual environment. |
| |
|
| | Replace the placeholder paths below with your actual .dll and .gguf file locations: |
| |
|
| | ```bash |
| | import os |
| | from llama_cpp import Llama |
| | |
| | # Set the environment variable to point to your custom-built llama.dll |
| | os.environ['LLAMA_CPP_LIB'] = r'PATH_TO_YOUR_CUSTOM_LLAMA_DLL' |
| | |
| | try: |
| | print('Attempting to initialize Llama with GPU offload (-1 layers)...') |
| | |
| | # Initialize the Llama model with full GPU offloading |
| | model = Llama( |
| | model_path=r'PATH_TO_YOUR_MODEL_FILE.gguf', |
| | n_gpu_layers=-1, |
| | verbose=True |
| | ) |
| | |
| | print('Initialization attempted. Check the output above for CUDA device assignments (e.g., CUDA0, CUDA1).') |
| | |
| | except FileNotFoundError: |
| | print('Error: Model file not found. Please double-check your model_path.') |
| | except Exception as e: |
| | print(f'An error occurred during initialization: {e}') |
| | ``` |
| | **What to look for in the output:** |
| |
|
| | Lines like assigned to device CUDA0, assigned to device CUDA1. |
| |
|
| | VRAM buffer allocations such as CUDA0 model buffer size = ... MiB. |
| |
|
| | Confirmation that your GPU(s) are being used for model layer offloading. |
| |
|
| | ## Usage |
| | Once installed and verified, you can use llama-cpp-python in your projects as you normally would. Refer to the official llama-cpp-python documentation for detailed usage instructions. |
| |
|
| | ## Acknowledgments |
| | This prebuilt wheel is based on the excellent llama-cpp-python project by Andrei Betlen (@abetlen). All credit for the core library and Python bindings goes to the original maintainers and to llama.cpp by Georgi Gerganov (@ggerganov) and the ggml team. |
| |
|
| | This specific wheel was built by Bernard Peter Fitzgerald (@boneylizardwizard) using the source code from abetlen/llama-cpp-python, compiled with CUDA 12.8 support for Windows x64 systems, and verified for Gemma 3 model compatibility. |
| |
|
| | ## License |
| | This prebuilt wheel is distributed under the MIT License, the same license as the original llama-cpp-python project. |
| |
|
| | ## Reporting Issues |
| | If you encounter issues specifically with installing this prebuilt wheel or getting CUDA offloading to work using this wheel, please report them on this repository's Issue Tracker. |
| |
|
| | For general issues with llama-cpp-python itself, please report them upstream at the [official llama-cpp-python GitHub Issues page](https://github.com/ggml-org/llama.cpp/issues). |