Instructions to use nur-dev/llama-1.9B-kaz-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nur-dev/llama-1.9B-kaz-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nur-dev/llama-1.9B-kaz-instruct")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nur-dev/llama-1.9B-kaz-instruct")
model = AutoModelForCausalLM.from_pretrained("nur-dev/llama-1.9B-kaz-instruct")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nur-dev/llama-1.9B-kaz-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nur-dev/llama-1.9B-kaz-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/llama-1.9B-kaz-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/nur-dev/llama-1.9B-kaz-instruct

SGLang

How to use nur-dev/llama-1.9B-kaz-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nur-dev/llama-1.9B-kaz-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/llama-1.9B-kaz-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nur-dev/llama-1.9B-kaz-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/llama-1.9B-kaz-instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use nur-dev/llama-1.9B-kaz-instruct with Docker Model Runner:
```
docker model run hf.co/nur-dev/llama-1.9B-kaz-instruct
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

LLaMA 1.9B Kazakh Instruct Model

This repository contains the LLaMA 1.9B model fine-tuned on a Kazakh language dataset for instruction-based tasks. The model is trained to provide helpful, relevant, and context-aware responses to various prompts in Kazakh. It is particularly effective in answering questions, providing explanations, and assisting in educational and professional contexts. This model comes with an integrated chat template that structures conversations for proper input formatting. The Tokenizer supports this feature, allowing for easier interaction by formatting messages before they are passed to the model.

The template follows this structure:

{%- if messages[0]['role'] == 'system' %}
    {%- set offset = 1 %}
{%- else %}
    {%- set offset = 0 %}
{%- endif %}
<|begin_of_text|>
{%- for message in messages %}
    {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' + message['content'] | trim + '<|eot_id|>' }}
{%- endfor %}
{{- '<|start_header_id|>' + 'көмекші' + '<|end_header_id|>\n\n' }}

Model Details

Model Name: LLaMA 1.9B Kazakh Instruct
Model ID: nur-dev/llama-1.9B-kaz-instruct
Parameters: 1.94 billion
Architecture: Causal Language Model (LLaMA)
Tokenizer: LLaMA tokenizer
Language: Kazakh

Training Data

The model was fine-tuned on a dataset containing 22000 samples designed for instruction-based tasks. The dataset includes a diverse set of prompts and responses to help the model learn to handle a wide range of topics, from everyday queries to specialized questions.

How to Use

Using the Model Directly for Inference

Using the LlamaForCausalLM and AutoTokenizer classes to load a custom model, format a conversation, and generate a response using various generation parameters like top_k, top_p, and temperature.

from transformers import LlamaForCausalLM, AutoTokenizer
import torch

# Load the model and tokenizer
model_directory = "nur-dev/llama-1.9B-kaz-instruct"
model = LlamaForCausalLM.from_pretrained(model_directory)
tokenizer = AutoTokenizer.from_pretrained(model_directory)

# Set the model to evaluation mode and move to appropriate device
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Example input in Kazakh

# Conversation history
conversation_history = [
    {"role": "system", "content": "Сіз сұрақтарға жауап беріп, ақпарат ұсынатын сенімді AI көмекшісісіз."},
    {"role": "пайдаланушы", "content": "Жасанды интеллект денсаулық сақтау саласына қандай өзгерістер енгізе алады?"}
]

# Format conversation using the chat template (custom method)
formatted_conversation = tokenizer.apply_chat_template(conversation_history, tokenize=False)

# Tokenize input
input_ids = tokenizer.encode(formatted_conversation, return_tensors="pt").to(device)

# Generate a response from the model
with torch.no_grad():
    output = model.generate(
        input_ids,
        max_length=1000,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        no_repeat_ngram_size=2,
        early_stopping=True,
        do_sample=True,
        top_k=10,
        top_p=0.5,
        eos_token_id=tokenizer.eos_token_id,
        temperature=1.3
    )

# Decode and print the model's response
response = tokenizer.decode(output[0], skip_special_tokens=False)
print(response)

Using the Pipeline for Text Generation

Using the pipeline API, which abstracts much of the setup, allowing you to generate responses with less boilerplate. The assistant responds in a “pirate” style to a user query.

from transformers import pipeline

# Initialize the text generation pipeline
pipe = pipeline("text-generation", model="nur-dev/llama-1.9B-kaz-instruct")

# Define the conversation messages
messages = [
      {"role": "system", "content": "Сіз сұрақтарға жауап беріп, ақпарат ұсынатын сенімді AI көмекшісісіз."},
      {"role": "пайдаланушы", "content": "Жасанды интеллект денсаулық сақтау саласына қандай өзгерістер енгізе алады?"}
  ]

response = pipe(messages, max_new_tokens=128)[0]['generated_text']

print(response)

@misc {nurgali_kadyrbek_2024, author = { {NURGALI Kadyrbek} }, title = { llama-1.9B-kaz-instruct (Revision 4059a4e) }, year = 2024, url = { https://huggingface.co/nur-dev/llama-1.9B-kaz-instruct }, doi = { 10.57967/hf/3114 }, publisher = { Hugging Face } }

Downloads last month: -

Safetensors

Model size

2B params

Tensor type

F32

Model tree for nur-dev/llama-1.9B-kaz-instruct

Base model

nur-dev/llama-1.9B-kaz

Finetuned

(1)

this model