Video-Text-to-Text
Transformers
Safetensors
English
qwen2
text-generation
Action
Video
MQA
multimodal
MLLMs
LLaVAction
text-generation-inference
Instructions to use MLAdaptiveIntelligence/LLaVAction-0.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MLAdaptiveIntelligence/LLaVAction-0.5B with Transformers:
# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("MLAdaptiveIntelligence/LLaVAction-0.5B") model = AutoModelForCausalLM.from_pretrained("MLAdaptiveIntelligence/LLaVAction-0.5B") - Notebooks
- Google Colab
- Kaggle
| license: cc-by-nc-sa-4.0 | |
| language: | |
| - en | |
| base_model: | |
| - lmms-lab/llava-onevision-qwen2-0.5b-ov | |
| pipeline_tag: video-text-to-text | |
| tags: | |
| - Action | |
| - Video | |
| - MQA | |
| - multimodal | |
| - MLLMs | |
| - LLaVAction | |
| metrics: | |
| - accuracy | |
| library_name: transformers | |
| # LLaVAction-0.5B | |
| <div align="center"> | |
| <h2>LLaVAction: evaluating and training multi-modal large language models for action recognition | |
| </h2> | |
| [Shaokai Ye](https://yeshaokai.github.io/)<sup>1**</sup> | |
| [Haozhe Qi](https://people.epfl.ch/haozhe.qi)<sup>1**</sup> | |
| [Alexander Mathis](https://mathislab.org/)<sup>1</sup><sup>†</sup> | |
| [Mackenzie Weygandt Mathis](https://www.mackenziemathislab.org/mackenziemathis)<sup>1</sup><sup>†</sup><sup>‡</sup> | |
| <sup>1</sup> EPFL | |
| <sup>**</sup> First authors <sup>†</sup> Senior Authors <sup>‡</sup> Corresponding Author | |
| \[[arXiv Paper](arxiv.org/abs/2503.18712)\] \[[Project Page](https://mmathislab.github.io/llavaction/)\] \[[Github Repo](https://github.com/AdaptiveMotorControlLab/LLaVAction)\] | |
| </div> | |
| ## Model Summary | |
| The LLaVAction-0.5B model is trained on EPIC-KITCHENS-100-MQA, based on Qwen2 language model with a context window of 32K tokens. | |
| - **Project Page**: [https://mmathislab.github.io/llavaction/](https://mmathislab.github.io/llavaction/) | |
| - **Paper**: For more details, please check our [paper](https://arxiv.org/abs/tbd) | |
| - **Repository**: [https://github.com/AdaptiveMotorControlLab/LLaVAction](https://github.com/AdaptiveMotorControlLab/LLaVAction) | |
| - **Point of Contact**: [Mackenzie Mathis](https://people.epfl.ch/mackenzie.mathis) | |
| - **Languages**: English | |
| - | |
| ## Useage | |
| ### Intended use | |
| The model was trained on EPIC-KITCHENS-100-MQA. It's intended to be used on videos that are similar to EPIC-KITCHENS-100. | |
| ### Generation | |
| We provide the simple generation process for using our model. For more details, you could refer to our [Github](https://github.com/AdaptiveMotorControlLab/LLaVAction). | |
| ```python | |
| !pip install llavaction | |
| from llavaction.model.builder import load_pretrained_model | |
| from llavaction.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token | |
| from llavaction.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX | |
| from llavaction.conversation import conv_templates, SeparatorStyle | |
| from PIL import Image | |
| import requests | |
| import copy | |
| import torch | |
| import sys | |
| import warnings | |
| from decord import VideoReader, cpu | |
| import numpy as np | |
| warnings.filterwarnings("ignore") | |
| #Your video (it assumes an egocentric view point) | |
| video_path = "XXXX" | |
| #These are the prompts we trained with, but you can test others: | |
| perspective_prompt = "You are seeing this video from egocentric view and you are the person. Your hands are sometimes interacting with objects. What action are you doing?" | |
| task_prompt = "Describe in details what you see from the video frames." | |
| def load_video(video_path, max_frames_num,fps=1,force_sample=False): | |
| if max_frames_num == 0: | |
| return np.zeros((1, 336, 336, 3)) | |
| vr = VideoReader(video_path, ctx=cpu(0),num_threads=1) | |
| total_frame_num = len(vr) | |
| video_time = total_frame_num / vr.get_avg_fps() | |
| fps = round(vr.get_avg_fps()/fps) | |
| frame_idx = [i for i in range(0, len(vr), fps)] | |
| if len(frame_idx) > max_frames_num or force_sample: | |
| sample_fps = max_frames_num | |
| uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int) | |
| frame_idx = uniform_sampled_frames.tolist() | |
| frame_time = [i/vr.get_avg_fps() for i in frame_idx] | |
| spare_frames = vr.get_batch(frame_idx).asnumpy() | |
| # import pdb;pdb.set_trace() | |
| return spare_frames,frame_time,video_time | |
| pretrained = "MLAdaptiveIntelligence/LLaVAction-0.5B" | |
| model_name = "llava_qwen" | |
| device = "cuda" | |
| device_map = "auto" | |
| tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args | |
| model.eval() | |
| max_frames_num = 64 | |
| video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True) | |
| video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16) | |
| video = [video] | |
| conv_template = "qwen_1_5" # Make sure you use correct chat template for different models | |
| time_instruction = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. " | |
| question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruction}\n{perspective_prompt} {task_prompt}" | |
| conv = copy.deepcopy(conv_templates[conv_template]) | |
| conv.append_message(conv.roles[0], question) | |
| conv.append_message(conv.roles[1], None) | |
| prompt_question = conv.get_prompt() | |
| input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device) | |
| cont = model.generate( | |
| input_ids, | |
| images=video, | |
| modalities= ["video"], | |
| do_sample=False, | |
| temperature=0, | |
| max_new_tokens=4096, | |
| ) | |
| text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip() | |
| print(text_outputs) | |
| ``` | |
| ## Training | |
| See details in Ye et al. 2025: arxiv.org/abs/2503.18712 | |
| ### Model | |
| - **Architecture**: SO400M + Qwen2 | |
| - **Initialized Model**: lmms-lab/llava-onevision-qwen2-0.5b-ov | |
| - **Data**: EPIC-KITCHENS-100-MQA, 2 epochs, full model | |
| - **Precision**: bfloat16 | |
| ### Hardware & Software | |
| GPUs: 32 * Nvidia GH-200 (for whole model series training) | |
| Orchestration: HuggingFace Trainer | |
| Neural networks: PyTorch | |
| ## Citation | |
| arXiv: arxiv.org/abs/2503.18712 | |
| ```bibtex | |
| @article{YeQi2025llavaction, | |
| title={LLaVAction: evaluating and training multi-modal large language models for action recognition}, | |
| author={Ye, Shaokai and Qi, Haozhe and Mathis, Alexander and Mathis, Mackenzie W.}, | |
| journal={arXiv preprint}, | |
| year={2025} | |
| } | |
| ``` |