| --- |
| language: |
| - en |
| tags: |
| - llava |
| - multimodal |
| - qwen |
| license: apache-2.0 |
| --- |
| # nanoLLaVA - Sub 1B Vision-Language Model |
|
|
| <p align="center"> |
| <img src="https://i.postimg.cc/d15k3YNG/nanollava.webp" alt="Logo" width="350"> |
| </p> |
|
|
| ## Description |
| nanoLLaVA is a "small but mighty" 1B vision-language model designed to run efficiently on edge devices. |
| - **Base LLM**: [Quyen-SE-v0.1](https://huggingface.co/vilm/Quyen-SE-v0.1) (Qwen1.5-0.5B) |
| - **Vision Encoder**: [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) |
|
|
| | Model | **VQA v2** | **TextVQA** | **ScienceQA** | **POPE** | **MMMU (Test)** | **MMMU (Eval)** | **GQA** | **MM-VET** | |
| |---------|--------|---------|-----------|------|-------------|-------------|------|--------| |
| | Score | 70.84 | 46.71 | 58.97 | 84.1 | 28.6 | 30.4 | 54.79| 23.9 | |
|
|
| ## Training Data |
| Training Data will be released later as I am still writing a paper on this. Expect the final final to be much more powerful than the current one. |
|
|
| ## Finetuning Code |
| Coming Soon!!! |
|
|
| ## Usage |
| You can use with `transformers` with the following script: |
|
|
| ```bash |
| pip install -U transformers accelerate flash_attn |
| ``` |
|
|
| ```python |
| import torch |
| import transformers |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from PIL import Image |
| import warnings |
| |
| # disable some warnings |
| transformers.logging.set_verbosity_error() |
| transformers.logging.disable_progress_bar() |
| warnings.filterwarnings('ignore') |
| |
| # set device |
| torch.set_default_device('cuda') # or 'cpu' |
| |
| # create model |
| model = AutoModelForCausalLM.from_pretrained( |
| 'qnguyen3/nanoLLaVA', |
| torch_dtype=torch.float16, |
| device_map='auto', |
| trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained( |
| 'qnguyen3/nanoLLaVA', |
| trust_remote_code=True) |
| |
| # text prompt |
| prompt = 'Describe this image in detail' |
| |
| messages = [ |
| {"role": "user", "content": f'<image>\n{prompt}'} |
| ] |
| text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True |
| ) |
| |
| print(text) |
| |
| text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')] |
| input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0) |
| |
| # image, sample images can be found in images folder |
| image = Image.open('/path/to/image.png') |
| image_tensor = model.process_images([image], model.config).to(dtype=model.dtype) |
| |
| # generate |
| output_ids = model.generate( |
| input_ids, |
| images=image_tensor, |
| max_new_tokens=2048, |
| use_cache=True)[0] |
| |
| print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip()) |
| ``` |
|
|
| ## Prompt Format |
| The model follow the ChatML standard, however, without `\n` at the end of `<|im_end|>`: |
| ``` |
| <|im_start|>system |
| Answer the question<|im_end|><|im_start|>user |
| <image> |
| What is the picture about?<|im_end|><|im_start|>assistant |
| ``` |
|
|
| --- |
| | Image | Example | |
| |--------------------------------------|---------------------------------------------------------------------------------------------| |
| |  | **What is the text saying?** <br> "Small but mighty". <br>**How does the text correlate to the context of the image?** <br> The text seems to be a playful or humorous representation of a small but mighty figure, possibly a mouse or a mouse toy, holding a weightlifting bar. | |
| --- |
|
|
| Model is trained using a modified version from [Bunny](https://github.com/BAAI-DCAI/Bunny/tree/main/bunny) |