Skip to content

Commit

Permalink
minicpm-v2 notebook (#2180)
Browse files Browse the repository at this point in the history
  • Loading branch information
eaidova authored Jul 10, 2024
1 parent a89d8ed commit 149c153
Show file tree
Hide file tree
Showing 5 changed files with 2,030 additions and 0 deletions.
1 change: 1 addition & 0 deletions .ci/ignore_treon_docker.txt
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,4 @@ notebooks/hunyuan-dit-image-generation/hunyuan-dit-image-generation.ipynb
notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb
notebooks/pixart/pixart.ipynb
notebooks/llm-rag-llamaindex/llm-rag-llamaindex.ipynb
notebooks/minicpm-v-multimodal-chatbot/minicpm-v-multimodal-chatbot.ipynb
8 changes: 8 additions & 0 deletions .ci/skipped_notebooks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -539,3 +539,11 @@
skips:
- device:
- cpu
- notebook: notebooks/minicpm-v-multimodal-chatbot/minicpm-v-multimodal-chatbot.ipynb
skips:
- os:
- macos-12
- ubuntu-20.04
- ubuntu-22.04
- windows-2019

12 changes: 12 additions & 0 deletions .ci/spellcheck/.pyspelling.wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,7 @@ Gu
Gutendex
Hafner
HugginFaceH
HalBench
HandBrake
heatmap
HC
Expand Down Expand Up @@ -429,6 +430,7 @@ Markovian
Martyniuk
maskrcnn
mathbf
MathVista
MatMul
MBs
MediaPipe
Expand All @@ -445,6 +447,7 @@ MiniCPM
MiniLM
mistralai
MLS
MMB
mms
MMS
MLLM
Expand Down Expand Up @@ -513,6 +516,7 @@ OASST
OBB
obb
ocr
OCRBench
OCRv
odometry
OMZ
Expand Down Expand Up @@ -571,6 +575,7 @@ parsers
perceptron
Patil
PEFT
perceiver
performant
PersonaGPT
PGI
Expand Down Expand Up @@ -645,6 +650,7 @@ qwen
Qwen
Radiopaedia
Radosavovic
Raito
Raj
Ranftl
RASPP
Expand All @@ -669,6 +675,8 @@ reproducibility
rerank
Rerank
reranker
resampler
Resampler
rescale
rescaling
Rescaling
Expand Down Expand Up @@ -724,6 +732,7 @@ Shutterstock
siggraph
sigmoid
SigLIP
SigLip
siglip
SISR
SlimOrca
Expand Down Expand Up @@ -792,6 +801,7 @@ TartanAir
tbb
TensorBoard
tensorflow
TextVQA
tf
TFLite
tflite
Expand Down Expand Up @@ -827,6 +837,7 @@ tunable
tv
TypeScript
Udnie
UHD
UI
UIs
UINT
Expand Down Expand Up @@ -868,6 +879,7 @@ VegaRT
videpth
VIO
virtualenv
VisCPM
ViT
vit
vits
Expand Down
30 changes: 30 additions & 0 deletions notebooks/minicpm-v-multimodal-chatbot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Visual-language assistant with MiniCPM-V2 and OpenVINO

MiniCPM-V 2 is a strong multimodal large language model for efficient end-side deployment. The model is built based on SigLip-400M and MiniCPM-2.4B, connected by a perceiver resampler. MiniCPM-V 2.0 has several notable features:
* **Outperforming many popular models on many benchmarks** (including OCRBench, TextVQA, MME, MMB, MathVista, etc). Strong OCR capability, achieving comparable performance to Gemini Pro in scene-text understanding.
* **Trustworthy Behavior**. LLMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is the first end-side LLM aligned via multimodal RLHF for trustworthy behavior (using the recent [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] series technique). This allows the model to match GPT-4V in preventing hallucinations on Object HalBench.
* **High-Resolution Images at Any Aspect Raito.** MiniCPM-V 2.0 can accept 1.8 million pixels (e.g., 1344x1344) images at any aspect ratio. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from [LLaVA-UHD](https://arxiv.org/pdf/2403.11703).
* **High Efficiency.** For visual encoding, model compresses the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with favorable memory cost and speed during inference even when dealing with high-resolution images.
* **Bilingual Support.** MiniCPM-V 2.0 supports strong bilingual multimodal capabilities in both English and Chinese. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038)[ICLR'24].

In this tutorial we consider how to convert and optimize MiniCPM-V2 model for creating multimodal chatbot. Additionally, we demonstrate how to apply stateful transformation on LLM part and model optimization techniques like weights compression using [NNCF](https://github.com/openvinotoolkit/nncf)

## Notebook contents
The tutorial consists from following steps:

- Install requirements
- Download PyTorch model
- Convert model to OpenVINO Intermediate Representation (IR)
- Compress Language Model weights
- Prepare Inference Pipeline
- Run OpenVINO model inference
- Launch Interactive demo

In this demonstration, you'll create interactive chatbot that can answer questions about provided image's content. Image bellow shows a result of model work.
![](https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/7b0919ea-6fe4-4c8f-8395-cb0ee6e87937)


## Installation instructions
This is a self-contained example that relies solely on its own code.</br>
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](../../README.md).
Loading

0 comments on commit 149c153

Please sign in to comment.