Skip to content

Latest commit

 

History

History
144 lines (93 loc) · 5.97 KB

Vision_Inference.md

File metadata and controls

144 lines (93 loc) · 5.97 KB

本地推理 Phi-3-Vision

Phi-3-vision-128k-instruct 使 Phi-3 不仅能理解语言,还能视觉地看世界。通过 Phi-3-vision-128k-instruct,我们可以解决不同的视觉问题,如 OCR、表格分析、物体识别、描述图片等。我们可以轻松完成以前需要大量数据训练的任务。以下是 Phi-3-vision-128k-instruct 引用的相关技术和应用场景。

0. 准备工作

请确保在使用前已安装以下 Python 库(推荐使用 Python 3.10+)

pip install transformers -U
pip install datasets -U
pip install torch -U

推荐使用 CUDA 11.6+ 并安装 flatten

pip install flash-attn --no-build-isolation

创建一个新的 Notebook。为了完成示例,建议您首先创建以下内容。

from PIL import Image
import requests
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"

kwargs = {}
kwargs['torch_dtype'] = torch.bfloat16

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype="auto").cuda()

user_prompt = '<|user|>\n'
assistant_prompt = '<|assistant|>\n'
prompt_suffix = "<|end|>\n"

1. 使用 Phi-3-Vision 分析图像

我们希望 AI 能够分析我们图片的内容并给出相关描述。

prompt = f"{user_prompt}<|image_1|>\nCould you please introduce this stock to me?{prompt_suffix}{assistant_prompt}"


url = "https://g.foolcdn.com/editorial/images/767633/nvidiadatacenterrevenuefy2017tofy2024.png"

image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids, 
                                  skip_special_tokens=True, 
                                  clean_up_tokenization_spaces=False)[0]

我们可以通过在 Notebook 中执行以下脚本来获取相关答案。

Certainly! Nvidia Corporation is a global leader in advanced computing and artificial intelligence (AI). The company designs and develops graphics processing units (GPUs), which are specialized hardware accelerators used to process and render images and video. Nvidia's GPUs are widely used in professional visualization, data centers, and gaming. The company also provides software and services to enhance the capabilities of its GPUs. Nvidia's innovative technologies have applications in various industries, including automotive, healthcare, and entertainment. The company's stock is publicly traded and can be found on major stock exchanges.

2. 使用 Phi-3-Vision 进行 OCR

除了分析图像外,我们还可以从图像中提取信息。这是以前需要编写复杂代码来完成的 OCR 过程。

prompt = f"{user_prompt}<|image_1|>\nHelp me get the title and author information of this book?{prompt_suffix}{assistant_prompt}"

url = "https://marketplace.canva.com/EAFPHUaBrFc/1/0/1003w/canva-black-and-white-modern-alone-story-book-cover-QHBKwQnsgzs.jpg"

image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids, 
                                  skip_special_tokens=False, 
                                  clean_up_tokenization_spaces=False)[0]

结果是

The title of the book is "ALONE" and the author is Morgan Maxwell.

3. 多图像比较

Phi-3 Vision 支持多图像比较。我们可以使用这个模型来找出图像之间的差异。

prompt = f"{user_prompt}<|image_1|>\n<|image_2|>\n What is difference in this two images?{prompt_suffix}{assistant_prompt}"

print(f">>> Prompt\n{prompt}")

url = "https://hinhnen.ibongda.net/upload/wallpaper/doi-bong/2012/11/22/arsenal-wallpaper-free.jpg"

image_1 = Image.open(requests.get(url, stream=True).raw)

url = "https://assets-webp.khelnow.com/d7293de2fa93b29528da214253f1d8d0/news/uploads/2021/07/Arsenal-1024x576.jpg.webp"

image_2 = Image.open(requests.get(url, stream=True).raw)

images = [image_1, image_2]

inputs = processor(prompt, images, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

结果是

The first image shows a group of soccer players from the Arsenal Football Club posing for a team photo with their trophies, while the second image shows a group of soccer players from the Arsenal Football Club celebrating a victory with a large crowd of fans in the background. The difference between the two images is the context in which the photos were taken, with the first image focusing on the team and their trophies, and the second image capturing a moment of celebration and victory.

免责声明: 本文档已使用基于机器的人工智能翻译服务进行翻译。尽管我们努力确保准确性,但请注意,自动翻译可能包含错误或不准确之处。应将原始文档的母语版本视为权威来源。对于关键信息,建议进行专业人工翻译。对于因使用此翻译而产生的任何误解或误读,我们不承担任何责任。