LLaVA
by
haotian-liu

Description: [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

View on GitHub ↗

Summary Information

Updated 60 minutes ago

Added to GitGenius on July 18th, 2025

Created on April 17th, 2023

Open Issues & Pull Requests: 1,142 (+0)

Number of forks: 2,776

Total Stargazers: 24,924 (+0)

Total Subscribers: 156 (+0)

Issue Activity (beta)

Open issues: 748

New in 7 days: 0

Closed in 7 days: 0

Avg open age: 695 days

Stale 30+ days: 748

Stale 90+ days: 741

Recent activity

Opened in 7 days: 0

Closed in 7 days: 0

Comments in 7 days: 0

Events in 7 days: 0

Top labels

question (2)
documentation (1)
enhancement (1)

Most active issues this week

No issue events were indexed in the last 7 days.

Explore full issue details

Repository Insights (GitGenius)

Median issue/PR response: 62.7 days

Mean response time: 130.4 days

90th percentile: 345.3 days

Tracked items: 567

Most active contributors

fisher75 - 21 events, 16 issues
anas-zafar - 17 events, 13 issues
copperwiring - 14 events, 7 issues
Luodian - 11 events, 1 issues
MassEast - 11 events, 8 issues

Related by overlapping contributors

Detailed Description

LLaVA (Large Language and Vision Assistant) is an open-source, end-to-end, multimodal model developed by researchers at UW Madison and Microsoft Research. It bridges the gap between Large Language Models (LLMs) and visual inputs, enabling a conversational AI system that can "see" and reason about images. The core innovation lies in its efficient training approach, allowing it to achieve strong performance with relatively limited visual instruction tuning data. Essentially, LLaVA takes a pre-trained LLM (like Vicuna, Llama 2, or Mistral) and connects it to a pre-trained vision encoder (like CLIP), then fine-tunes the entire system with a carefully curated dataset of image-text pairs.

The repository provides the code, weights, and instructions for training, evaluating, and deploying LLaVA. A key component is the "visual instruction tuning" dataset, which consists of around 580K image-text pairs generated using GPT-4. This dataset is crucial because it provides the model with examples of how to respond to diverse visual prompts, ranging from simple object recognition to complex reasoning and detailed descriptions. The dataset generation process itself is a significant contribution, as it leverages the capabilities of a powerful LLM to create high-quality training data without extensive manual annotation. The repository details the methodology used for dataset creation, including prompt engineering strategies to elicit informative and varied responses from GPT-4.

LLaVA’s architecture is relatively straightforward. It uses a projection layer to map the visual features extracted by the vision encoder (CLIP ViT-L/14 is commonly used) into the embedding space of the LLM. This allows the LLM to process visual information alongside text. During training, the model learns to align the visual and textual representations, enabling it to generate coherent and relevant responses to multimodal prompts. The repository supports different LLM backends, offering flexibility in terms of model size and performance. It also includes implementations for various training techniques, such as LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

The repository offers comprehensive evaluation metrics and benchmarks. LLaVA is evaluated on a range of multimodal benchmarks, including Visual Question Answering (VQA), Document Visual Question Answering (DocVQA), and ScienceQA. The results demonstrate that LLaVA achieves competitive performance compared to other state-of-the-art multimodal models, often surpassing them with significantly fewer parameters. Furthermore, the repository provides tools for human evaluation, allowing users to assess the quality and relevance of the model's responses in a more subjective manner.

Beyond the core model and training pipeline, the repository includes a demo interface for interacting with LLaVA. This allows users to easily test the model's capabilities and explore its potential applications. The repository is actively maintained and updated, with ongoing contributions from the open-source community. It serves as a valuable resource for researchers and developers interested in building and deploying multimodal AI systems, and it represents a significant step towards creating more versatile and intelligent conversational agents that can understand and interact with the world around them.

LLaVA
by
haotian-liu

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

LLaVA
by
haotian-liuhaotian-liu/LLaVA

Repository Details

LLaVA by haotian-liu

Summary Information

Issue Activity (beta)

Recent activity

Top labels

Most active issues this week

Repository Insights (GitGenius)

Most active contributors

Related by overlapping contributors

LLaVA by haotian-liuhaotian-liu/LLaVA

Repository Details

LLaVA
by
haotian-liu

LLaVA
by
haotian-liuhaotian-liu/LLaVA