llava
by
haotian-liu

Description: [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

View haotian-liu/llava on GitHub ↗

Summary Information

Updated 1 hour ago
Added to GitGenius on July 18th, 2025
Created on April 17th, 2023
Open Issues/Pull Requests: 1,135 (+0)
Number of forks: 2,732
Total Stargazers: 24,478 (+1)
Total Subscribers: 156 (+0)
Detailed Description

LLaVA (Large Language and Vision Assistant) is an open-source, end-to-end, multimodal model developed by researchers at UW Madison and Microsoft Research. It bridges the gap between Large Language Models (LLMs) and visual inputs, enabling a conversational AI system that can "see" and reason about images. The core innovation lies in its efficient training approach, allowing it to achieve strong performance with relatively limited visual instruction tuning data. Essentially, LLaVA takes a pre-trained LLM (like Vicuna, Llama 2, or Mistral) and connects it to a pre-trained vision encoder (like CLIP), then fine-tunes the entire system with a carefully curated dataset of image-text pairs.

The repository provides the code, weights, and instructions for training, evaluating, and deploying LLaVA. A key component is the "visual instruction tuning" dataset, which consists of around 580K image-text pairs generated using GPT-4. This dataset is crucial because it provides the model with examples of how to respond to diverse visual prompts, ranging from simple object recognition to complex reasoning and detailed descriptions. The dataset generation process itself is a significant contribution, as it leverages the capabilities of a powerful LLM to create high-quality training data without extensive manual annotation. The repository details the methodology used for dataset creation, including prompt engineering strategies to elicit informative and varied responses from GPT-4.

LLaVA’s architecture is relatively straightforward. It uses a projection layer to map the visual features extracted by the vision encoder (CLIP ViT-L/14 is commonly used) into the embedding space of the LLM. This allows the LLM to process visual information alongside text. During training, the model learns to align the visual and textual representations, enabling it to generate coherent and relevant responses to multimodal prompts. The repository supports different LLM backends, offering flexibility in terms of model size and performance. It also includes implementations for various training techniques, such as LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

The repository offers comprehensive evaluation metrics and benchmarks. LLaVA is evaluated on a range of multimodal benchmarks, including Visual Question Answering (VQA), Document Visual Question Answering (DocVQA), and ScienceQA. The results demonstrate that LLaVA achieves competitive performance compared to other state-of-the-art multimodal models, often surpassing them with significantly fewer parameters. Furthermore, the repository provides tools for human evaluation, allowing users to assess the quality and relevance of the model's responses in a more subjective manner.

Beyond the core model and training pipeline, the repository includes a demo interface for interacting with LLaVA. This allows users to easily test the model's capabilities and explore its potential applications. The repository is actively maintained and updated, with ongoing contributions from the open-source community. It serves as a valuable resource for researchers and developers interested in building and deploying multimodal AI systems, and it represents a significant step towards creating more versatile and intelligent conversational agents that can understand and interact with the world around them.

llava
by
haotian-liuhaotian-liu/llava

Repository Details

Fetching additional details & charts...