minicpm-v2 notebook (#2180)

openvinotoolkit · Jul 10, 2024 · 149c153 · 149c153
1 parent a89d8ed
commit 149c153
Show file tree

Hide file tree

Showing 5 changed files with 2,030 additions and 0 deletions.
diff --git a/.ci/ignore_treon_docker.txt b/.ci/ignore_treon_docker.txt
@@ -70,3 +70,4 @@ notebooks/hunyuan-dit-image-generation/hunyuan-dit-image-generation.ipynb
 notebooks/stable-diffusion-v3/stable-diffusion-v3.ipynb
 notebooks/pixart/pixart.ipynb
 notebooks/llm-rag-llamaindex/llm-rag-llamaindex.ipynb
+notebooks/minicpm-v-multimodal-chatbot/minicpm-v-multimodal-chatbot.ipynb
diff --git a/.ci/skipped_notebooks.yml b/.ci/skipped_notebooks.yml
@@ -539,3 +539,11 @@
   skips:
     - device:
         - cpu
+- notebook: notebooks/minicpm-v-multimodal-chatbot/minicpm-v-multimodal-chatbot.ipynb
+  skips:
+    - os:
+        - macos-12
+        - ubuntu-20.04
+        - ubuntu-22.04
+        - windows-2019
+
diff --git a/.ci/spellcheck/.pyspelling.wordlist.txt b/.ci/spellcheck/.pyspelling.wordlist.txt
@@ -292,6 +292,7 @@ Gu
 Gutendex
 Hafner
 HugginFaceH
+HalBench
 HandBrake
 heatmap
 HC
@@ -429,6 +430,7 @@ Markovian
 Martyniuk
 maskrcnn
 mathbf
+MathVista
 MatMul
 MBs
 MediaPipe
@@ -445,6 +447,7 @@ MiniCPM
 MiniLM
 mistralai
 MLS
+MMB
 mms
 MMS
 MLLM
@@ -513,6 +516,7 @@ OASST
 OBB
 obb
 ocr
+OCRBench
 OCRv
 odometry
 OMZ
@@ -571,6 +575,7 @@ parsers
 perceptron
 Patil
 PEFT
+perceiver
 performant
 PersonaGPT
 PGI
@@ -645,6 +650,7 @@ qwen
 Qwen
 Radiopaedia
 Radosavovic
+Raito
 Raj
 Ranftl
 RASPP
@@ -669,6 +675,8 @@ reproducibility
 rerank
 Rerank
 reranker
+resampler
+Resampler
 rescale
 rescaling
 Rescaling
@@ -724,6 +732,7 @@ Shutterstock
 siggraph
 sigmoid
 SigLIP
+SigLip
 siglip
 SISR
 SlimOrca
@@ -792,6 +801,7 @@ TartanAir
 tbb
 TensorBoard
 tensorflow
+TextVQA
 tf
 TFLite
 tflite
@@ -827,6 +837,7 @@ tunable
 tv
 TypeScript
 Udnie
+UHD
 UI
 UIs
 UINT
@@ -868,6 +879,7 @@ VegaRT
 videpth
 VIO
 virtualenv
+VisCPM
 ViT
 vit
 vits

diff --git a/notebooks/minicpm-v-multimodal-chatbot/README.md b/notebooks/minicpm-v-multimodal-chatbot/README.md
@@ -0,0 +1,30 @@
+# Visual-language assistant with MiniCPM-V2 and OpenVINO
+
+MiniCPM-V 2 is a strong multimodal large language model for efficient end-side deployment. The model is built based on SigLip-400M and MiniCPM-2.4B, connected by a perceiver resampler. MiniCPM-V 2.0 has several notable features:
+* **Outperforming many popular models on many benchmarks** (including OCRBench, TextVQA, MME, MMB, MathVista, etc). Strong OCR capability, achieving comparable performance to Gemini Pro in scene-text understanding.
+* **Trustworthy Behavior**. LLMs are known for suffering from hallucination, often generating text not factually grounded in images. MiniCPM-V 2.0 is the first end-side LLM aligned via multimodal RLHF for trustworthy behavior (using the recent [RLHF-V](https://rlhf-v.github.io/) [CVPR'24] series technique). This allows the model to match GPT-4V in preventing hallucinations on Object HalBench.
+* **High-Resolution Images at Any Aspect Raito.** MiniCPM-V 2.0 can accept 1.8 million pixels (e.g., 1344x1344) images at any aspect ratio. This enables better perception of fine-grained visual information such as small objects and optical characters, which is achieved via a recent technique from [LLaVA-UHD](https://arxiv.org/pdf/2403.11703).
+* **High Efficiency.** For visual encoding, model compresses the image representations into much fewer tokens via a perceiver resampler. This allows MiniCPM-V 2.0 to operate with favorable memory cost and speed during inference even when dealing with high-resolution images.
+* **Bilingual Support.** MiniCPM-V 2.0 supports strong bilingual multimodal capabilities in both English and Chinese. This is enabled by generalizing multimodal capabilities across languages, a technique from [VisCPM](https://arxiv.org/abs/2308.12038)[ICLR'24].
+
+In this tutorial we consider how to convert and optimize MiniCPM-V2 model for creating multimodal chatbot. Additionally, we demonstrate how to apply stateful transformation on LLM part and model optimization techniques like weights compression using [NNCF](https://github.com/openvinotoolkit/nncf) 
+
+## Notebook contents
+The tutorial consists from following steps:
+
+- Install requirements
+- Download PyTorch model
+- Convert model to OpenVINO Intermediate Representation (IR)
+- Compress Language Model weights
+- Prepare Inference Pipeline
+- Run OpenVINO model inference
+- Launch Interactive demo
+
+In this demonstration, you'll create interactive chatbot that can answer questions about provided image's content. Image bellow shows a result of model work.
+![](https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/7b0919ea-6fe4-4c8f-8395-cb0ee6e87937)
+
+
+## Installation instructions
+This is a self-contained example that relies solely on its own code.</br>
+We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
+For details, please refer to [Installation Guide](../../README.md).