Instructions to use TheBloke/WizardCoder-Python-34B-V1.0-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TheBloke/WizardCoder-Python-34B-V1.0-GPTQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TheBloke/WizardCoder-Python-34B-V1.0-GPTQ")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/WizardCoder-Python-34B-V1.0-GPTQ")
model = AutoModelForCausalLM.from_pretrained("TheBloke/WizardCoder-Python-34B-V1.0-GPTQ")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use TheBloke/WizardCoder-Python-34B-V1.0-GPTQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/TheBloke/WizardCoder-Python-34B-V1.0-GPTQ

SGLang

How to use TheBloke/WizardCoder-Python-34B-V1.0-GPTQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/WizardCoder-Python-34B-V1.0-GPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use TheBloke/WizardCoder-Python-34B-V1.0-GPTQ with Docker Model Runner:
```
docker model run hf.co/TheBloke/WizardCoder-Python-34B-V1.0-GPTQ
```

The model does not load in "text generation webui" out of memory error

by DanekBigLike - opened Oct 6, 2023

Discussion

DanekBigLike

Oct 6, 2023

2023-10-07 00:24:03 INFO:Loading TheBloke_WizardCoder-Python-34B-V1.0-GPTQ_gptq-4bit-64g-actorder_True...
2023-10-07 00:24:03 INFO:The AutoGPTQ params are: {'model_basename': 'model', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': {0: '24500MiB', 'cpu': '32600MiB'}, 'quantize_config': None, 'use_cuda_fp16': True, 'disable_exllama': False}
2023-10-07 00:24:35 ERROR:Failed to load the model.
Traceback (most recent call last):
File "E:\ai\ruai\saiga\text-generation-webui\modules\ui_model_menu.py", line 194, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File "E:\ai\ruai\saiga\text-generation-webui\modules\models.py", line 75, in load_model
output = load_func_maploader
File "E:\ai\ruai\saiga\text-generation-webui\modules\models.py", line 316, in AutoGPTQ_loader
return modules.AutoGPTQ_loader.load_quantized(model_name)
File "E:\ai\ruai\saiga\text-generation-webui\modules\AutoGPTQ_loader.py", line 57, in load_quantized
model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\auto_gptq\modeling\auto.py", line 108, in from_quantized
return quant_func(
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\auto_gptq\modeling_base.py", line 875, in from_quantized
accelerate.utils.modeling.load_checkpoint_in_model(
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\accelerate\utils\modeling.py", line 1335, in load_checkpoint_in_model
checkpoint = load_state_dict(checkpoint_file, device_map=device_map)
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\accelerate\utils\modeling.py", line 1164, in load_state_dict
return safe_load_file(checkpoint_file, device=list(device_map.values())[0])
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\safetensors\torch.py", line 311, in load_file
result[k] = f.get_tensor(k)
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 90177536 bytes.

Initially, 120 MB was not enough, I increased the swap file, 90 MB became missing, I think this is actually not related to memory, maybe I’m wrong.

I expanded the swap file from 32 to 64 GB

My system characteristics:
Windows 10 (miniconda)
RTX 3090 24gb
RAM 32gb
Swap file 70gb (drive C (auto) + drive D (64gb))

donymorph

Jan 5, 2024

I am using Linux and loading with ExlamaHF. Took roughly 21 gb vram speed 16-17 tokens/s

YaTharThShaRma999

Jan 5, 2024

@DanekBigLike
I would recommend using exllama(for gptq) or exllama v2(for exl2 quant format which is slightly higher quality and faster then gptq) since both of them take less vram and are much, much faster then auto gptq just like donymorph said

@donymorph any reason you are using ExllamaHF? Exllama is considerably faster but the only other difference is exllamahf has a few more samplers

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment