Instructions to use abideen/Bitnet-Llama-70M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use abideen/Bitnet-Llama-70M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="abideen/Bitnet-Llama-70M")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("abideen/Bitnet-Llama-70M")
model = AutoModelForCausalLM.from_pretrained("abideen/Bitnet-Llama-70M")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use abideen/Bitnet-Llama-70M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "abideen/Bitnet-Llama-70M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "abideen/Bitnet-Llama-70M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/abideen/Bitnet-Llama-70M

SGLang

How to use abideen/Bitnet-Llama-70M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "abideen/Bitnet-Llama-70M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "abideen/Bitnet-Llama-70M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "abideen/Bitnet-Llama-70M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "abideen/Bitnet-Llama-70M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use abideen/Bitnet-Llama-70M with Docker Model Runner:
```
docker model run hf.co/abideen/Bitnet-Llama-70M
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Bitnet-LLama-70M

Bitnet-LLama-70M is a 70M parameter model trained using the method described in The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.

It was trained on the subset of the HuggingFaceTB/cosmopedia dataset. This is just a small experiment to try out BitNet. Bitnet-LLama-70M was trained for 2 epochs on 1xA100.

This model is just an experiment and you might not get good results while chatting with it due to smaller model size and less training.

Wandb training report is as follows:

Sample inference code

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load a pretrained BitNet model
model = "abideen/Bitnet-Llama-70M"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model)

def convert_to_bitnet(model, copy_weights):
    for name, module in model.named_modules():
        # Replace linear layers with BitNet
        if isinstance(module, LlamaSdpaAttention) or isinstance(module, LlamaMLP):
            for child_name, child_module in module.named_children():
                if isinstance(child_module, nn.Linear):
                    bitlinear = BitLinear(child_module.in_features, child_module.out_features, child_module.bias is not None).to(device="cuda:0")
                    if copy_weights:
                        bitlinear.weight = child_module.weight
                        if child_module.bias is not None:
                            bitlinear.bias = child_module.bias
                    setattr(module, child_name, bitlinear)
        # Remove redundant input_layernorms
        elif isinstance(module, LlamaDecoderLayer):
            for child_name, child_module in module.named_children():
                if isinstance(child_module, LlamaRMSNorm) and child_name == "input_layernorm":
                    setattr(module, child_name, nn.Identity().to(device="cuda:0"))
              

convert_to_bitnet(model, copy_weights=True)
model.to(device="cuda:0")

prompt = "What is Machine Learning?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
generate_ids = model.generate(inputs.input_ids, max_length=100)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Downloads last month: 63

Safetensors

Model size

77.5M params

Tensor type

F32

Model tree for abideen/Bitnet-Llama-70M

Quantizations

1 model

Dataset used to train abideen/Bitnet-Llama-70M

Paper for abideen/Bitnet-Llama-70M

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

Paper • 2402.17764 • Published Feb 27, 2024 • 629