Instructions to use mzbac/llama-3-8B-Instruct-function-calling with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mzbac/llama-3-8B-Instruct-function-calling with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mzbac/llama-3-8B-Instruct-function-calling") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mzbac/llama-3-8B-Instruct-function-calling") model = AutoModelForCausalLM.from_pretrained("mzbac/llama-3-8B-Instruct-function-calling") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use mzbac/llama-3-8B-Instruct-function-calling with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mzbac/llama-3-8B-Instruct-function-calling" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mzbac/llama-3-8B-Instruct-function-calling", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mzbac/llama-3-8B-Instruct-function-calling
- SGLang
How to use mzbac/llama-3-8B-Instruct-function-calling with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mzbac/llama-3-8B-Instruct-function-calling" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mzbac/llama-3-8B-Instruct-function-calling", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mzbac/llama-3-8B-Instruct-function-calling" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mzbac/llama-3-8B-Instruct-function-calling", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mzbac/llama-3-8B-Instruct-function-calling with Docker Model Runner:
docker model run hf.co/mzbac/llama-3-8B-Instruct-function-calling
| license: llama3 | |
| datasets: | |
| - mzbac/glaive-function-calling-v2-llama-3-format | |
| language: | |
| - en | |
| # Model | |
| This model is fine-tuned based on Meta-Llama/Meta-Llama-3-8B instructions via mlx-lm. | |
| **Note:** The glaive-function-calling-v2 dataset contains some invalid JSON and single quotes for the arguments' values. I have re-trained the model based on cleaned-up data. If you encounter issues with the function calling JSON format, you may try this new version here: https://huggingface.co/mzbac/llama-3-8B-Instruct-function-calling-v0.2 | |
| ## Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| import torch | |
| model_id = "mzbac/llama-3-8B-Instruct-function-calling" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| tool = { | |
| "name": "search_web", | |
| "description": "Perform a web search for a given search terms.", | |
| "parameter": { | |
| "type": "object", | |
| "properties": { | |
| "search_terms": { | |
| "type": "array", | |
| "items": {"type": "string"}, | |
| "description": "The search queries for which the search is performed.", | |
| "required": True, | |
| } | |
| } | |
| }, | |
| } | |
| messages = [ | |
| { | |
| "role": "system", | |
| "content": f"You are a helpful assistant with access to the following functions. Use them if required - {str(tool)}", | |
| }, | |
| {"role": "user", "content": "Today's news in Melbourne, just for your information, today is April 27, 2014."}, | |
| ] | |
| input_ids = tokenizer.apply_chat_template( | |
| messages, | |
| add_generation_prompt=True, | |
| return_tensors="pt" | |
| ).to(model.device) | |
| terminators = [ | |
| tokenizer.eos_token_id, | |
| tokenizer.convert_tokens_to_ids("<|eot_id|>") | |
| ] | |
| outputs = model.generate( | |
| input_ids, | |
| max_new_tokens=256, | |
| eos_token_id=terminators, | |
| do_sample=True, | |
| temperature=0.1, | |
| ) | |
| response = outputs[0] | |
| print(tokenizer.decode(response)) | |
| # <|begin_of_text|><|start_header_id|>system<|end_header_id|> | |
| # You are a helpful assistant with access to the following functions. Use them if required - {'name':'search_web', 'description': 'Perform a web search for a given search terms.', 'parameter': {'type': 'object', 'properties': {'search_terms': {'type': 'array', 'items': {'type':'string'}, 'description': 'The search queries for which the search is performed.','required': True}}}}<|eot_id|><|start_header_id|>user<|end_header_id|> | |
| # Today's news in Melbourne, just for your information, today is April 27, 2014.<|eot_id|><|start_header_id|>assistant<|end_header_id|> | |
| # <functioncall> {"name": "search_web", "arguments": '{"search_terms": ["Melbourne news", "April 27, 2014"]}'}<|eot_id|> | |
| ``` | |
| ## Training hyperparameters | |
| lora_config.yaml | |
| ```yaml | |
| # The path to the local model directory or Hugging Face repo. | |
| model: "meta-llama/Meta-Llama-3-8B-Instruct" | |
| # Whether or not to train (boolean) | |
| train: true | |
| # Directory with {train, valid, test}.jsonl files | |
| data: "data" | |
| # The PRNG seed | |
| seed: 0 | |
| # Number of layers to fine-tune | |
| lora_layers: 32 | |
| # Minibatch size. | |
| batch_size: 1 | |
| # Iterations to train for. | |
| iters: 6000 | |
| # Number of validation batches, -1 uses the entire validation set. | |
| val_batches: 25 | |
| # Adam learning rate. | |
| learning_rate: 1e-6 | |
| # Number of training steps between loss reporting. | |
| steps_per_report: 10 | |
| # Number of training steps between validations. | |
| steps_per_eval: 200 | |
| # Load path to resume training with the given adapter weights. | |
| resume_adapter_file: null | |
| # Save/load path for the trained adapter weights. | |
| adapter_path: "adapters" | |
| # Save the model every N iterations. | |
| save_every: 1000 | |
| # Evaluate on the test set after training | |
| test: false | |
| # Number of test set batches, -1 uses the entire test set. | |
| test_batches: 100 | |
| # Maximum sequence length. | |
| max_seq_length: 8192 | |
| # Use gradient checkpointing to reduce memory use. | |
| grad_checkpoint: false | |
| # LoRA parameters can only be specified in a config file | |
| lora_parameters: | |
| # The layer keys to apply LoRA to. | |
| # These will be applied for the last lora_layers | |
| keys: ['mlp.gate_proj', 'mlp.down_proj', 'self_attn.q_proj', 'mlp.up_proj', 'self_attn.o_proj','self_attn.v_proj', 'self_attn.k_proj'] | |
| rank: 128 | |
| alpha: 256 | |
| scale: 10.0 | |
| dropout: 0.05 | |
| # Schedule can only be specified in a config file, uncomment to use. | |
| #lr_schedule: | |
| # name: cosine_decay | |
| # warmup: 100 # 0 for no warmup | |
| # warmup_init: 1e-7 # 0 if not specified | |
| # arguments: [1e-6, 1000, 1e-7] # passed to scheduler | |
| ``` |