MiniCPM-V
by
OpenBMB

Description: A Pocket-Sized MLLM for Ultra-Efficient Image and Video Understanding on Your Phone

View OpenBMB/MiniCPM-V on GitHub ↗

Summary Information

Updated 45 seconds ago
Added to GitGenius on May 21st, 2026
Created on January 29th, 2024
Open Issues & Pull Requests: 49 (+0)
Number of forks: 1,971
Total Stargazers: 25,186 (+0)
Total Subscribers: 162 (+0)

Issue Activity (beta)

Open issues: 28
New in 7 days: 4
Closed in 7 days: 2
Avg open age: 75 days
Stale 30+ days: 19
Stale 90+ days: 3

Recent activity

Opened in 7 days: 4
Closed in 7 days: 2
Comments in 7 days: 6
Events in 7 days: 14

Top labels

  • question (156)
  • feature (21)
  • Finetune (18)
  • inference (10)
  • documentation (5)
  • duplicate (3)
  • llamacpp (3)
  • SPECIAL ATTENTION (1)

Detailed Description

The openbmb/minicpm-v repository provides MiniCPM-V and MiniCPM-o, a series of pocket-sized multimodal large language models (MLLMs) designed for ultra-efficient image and video understanding, specifically optimized for deployment on mobile devices such as smartphones and tablets. The primary goal of these models is to deliver strong vision-language capabilities while maintaining low computational requirements, enabling real-time multimodal AI experiences directly on consumer hardware.

MiniCPM-V 4.6 is the latest and most efficient model in the MiniCPM-V family, featuring only 1.3 billion parameters. Despite its compact size, it outperforms larger models like Gemma4-E2B-it and demonstrates superior efficiency compared to smaller models such as Qwen3.5-0.8B, achieving approximately 1.5 times the token throughput. MiniCPM-V 4.6 leverages the intra-ViT early compression technique from LLaVA-UHD v4, which reduces visual encoding computation costs by over 50%. This model supports mixed 4x/16x visual token compression rates, allowing users to balance performance and efficiency according to their needs. Its architecture is based on SigLIP2-400M and Qwen3.5-0.8B, inheriting robust single-image, multi-image, and video understanding capabilities.

MiniCPM-V 4.6 is highly optimized for edge deployment, supporting iOS, Android, and HarmonyOS platforms. The repository provides open-source edge adaptation code, making it straightforward for developers to reproduce the on-device experience. The model is compatible with popular inference frameworks like SGLang, vLLM, llama.cpp, and Ollama, and supports fine-tuning ecosystems such as SWIFT and LLaMA-Factory. Multiple quantized variants are available in formats like GGUF, BNB, AWQ, and GPTQ, facilitating efficient customization and deployment on consumer-grade GPUs.

Performance-wise, MiniCPM-V 4.6 achieves a score of 13 on the Artificial Analysis Intelligence Index benchmark, outperforming Qwen3.5-0.8B and Ministral 3 3B, while using significantly fewer tokens. It demonstrates strong multimodal capabilities, surpassing Qwen3.5-0.8B on most vision-language tasks and reaching Qwen3.5 2B-level performance on benchmarks such as OpenCompass, RefCOCO, HallusionBench, MUIRBench, and OCRBench. The model’s efficiency is highlighted by its reduced visual encoding FLOPs and flexible compression rates, enabling high throughput and low latency even on mobile devices.

MiniCPM-o 4.5 extends the family toward real-time, end-to-end omnimodal interaction, supporting streaming video and audio inputs as well as text and speech outputs. With 9 billion parameters, it approaches the performance of Gemini 2.5 Flash in vision, speech, and full-duplex multimodal live streaming. MiniCPM-o 4.5 enables simultaneous input and output streams, allowing the model to see, listen, and speak in real-time conversations and perform proactive interactions.

The repository also offers comprehensive documentation, technical reports, and a cookbook for diverse user scenarios, along with a public API service for MiniCPM-V 4.6. Community support is available via Discord and Feishu, and the models are integrated with various open-source frameworks for easy adoption. Overall, openbmb/minicpm-v provides state-of-the-art, efficient, and versatile multimodal models for image, video, and real-time multimodal understanding, with a strong focus on mobile and edge deployment.

MiniCPM-V
by
OpenBMBOpenBMB/MiniCPM-V

Repository Details

Fetching additional details & charts...