Instructions to use TheBloke/WizardCoder-Python-34B-V1.0-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TheBloke/WizardCoder-Python-34B-V1.0-GPTQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TheBloke/WizardCoder-Python-34B-V1.0-GPTQ")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TheBloke/WizardCoder-Python-34B-V1.0-GPTQ") model = AutoModelForCausalLM.from_pretrained("TheBloke/WizardCoder-Python-34B-V1.0-GPTQ") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TheBloke/WizardCoder-Python-34B-V1.0-GPTQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TheBloke/WizardCoder-Python-34B-V1.0-GPTQ
- SGLang
How to use TheBloke/WizardCoder-Python-34B-V1.0-GPTQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TheBloke/WizardCoder-Python-34B-V1.0-GPTQ with Docker Model Runner:
docker model run hf.co/TheBloke/WizardCoder-Python-34B-V1.0-GPTQ
The model does not load in "text generation webui" out of memory error
2023-10-07 00:24:03 INFO:Loading TheBloke_WizardCoder-Python-34B-V1.0-GPTQ_gptq-4bit-64g-actorder_True...
2023-10-07 00:24:03 INFO:The AutoGPTQ params are: {'model_basename': 'model', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': {0: '24500MiB', 'cpu': '32600MiB'}, 'quantize_config': None, 'use_cuda_fp16': True, 'disable_exllama': False}
2023-10-07 00:24:35 ERROR:Failed to load the model.
Traceback (most recent call last):
File "E:\ai\ruai\saiga\text-generation-webui\modules\ui_model_menu.py", line 194, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File "E:\ai\ruai\saiga\text-generation-webui\modules\models.py", line 75, in load_model
output = load_func_maploader
File "E:\ai\ruai\saiga\text-generation-webui\modules\models.py", line 316, in AutoGPTQ_loader
return modules.AutoGPTQ_loader.load_quantized(model_name)
File "E:\ai\ruai\saiga\text-generation-webui\modules\AutoGPTQ_loader.py", line 57, in load_quantized
model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\auto_gptq\modeling\auto.py", line 108, in from_quantized
return quant_func(
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\auto_gptq\modeling_base.py", line 875, in from_quantized
accelerate.utils.modeling.load_checkpoint_in_model(
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\accelerate\utils\modeling.py", line 1335, in load_checkpoint_in_model
checkpoint = load_state_dict(checkpoint_file, device_map=device_map)
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\accelerate\utils\modeling.py", line 1164, in load_state_dict
return safe_load_file(checkpoint_file, device=list(device_map.values())[0])
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\safetensors\torch.py", line 311, in load_file
result[k] = f.get_tensor(k)
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 90177536 bytes.
Initially, 120 MB was not enough, I increased the swap file, 90 MB became missing, I think this is actually not related to memory, maybe I’m wrong.
I expanded the swap file from 32 to 64 GB
My system characteristics:
Windows 10 (miniconda)
RTX 3090 24gb
RAM 32gb
Swap file 70gb (drive C (auto) + drive D (64gb))
I am using Linux and loading with ExlamaHF. Took roughly 21 gb vram speed 16-17 tokens/s
@DanekBigLike
I would recommend using exllama(for gptq) or exllama v2(for exl2 quant format which is slightly higher quality and faster then gptq) since both of them take less vram and are much, much faster then auto gptq just like donymorph said
@donymorph any reason you are using ExllamaHF? Exllama is considerably faster but the only other difference is exllamahf has a few more samplers