Skip to content

Latest commit

 

History

History
128 lines (74 loc) · 4.22 KB

File metadata and controls

128 lines (74 loc) · 4.22 KB

Florence 2 Example

Introduction

Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks, developed by Microsoft.

Here, we will show you how to use Florence-2 on X-AnyLabeling to perform various vision tasks.

Let's get started!

Installation

Before you begin, make sure you have the following prerequisites installed:

Step 0: Download and install Miniconda from the official website.

Step 1: Create a new Conda environment with Python version 3.10 or higher, and activate it:

conda create -n x-anylabeling-transformers python=3.10 -y
conda activate x-anylabeling-transformers

Step 2: Install required dependencies.

First, follow the instructions here to install both PyTorch and TorchVision dependencies.

Then, install the transformers package via:

pip install transformers

Finally, you can back to the installation guide (简体中文 | English) to install the necessary dependencies for X-AnyLabeling (v2.5.0+):

Getting Started

Tip

We recommend that you download the model before loading it.

You can download the model from HuggingFace via:

import torch
from transformers import AutoProcessor, AutoModelForCausalLM 

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large-ft", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)

The model will be automatically downloaded and cached in the transformers package directory.

If you have an unstable network connection, you can:

  1. Manually download the model files
  2. Update the model_path parameter in the configuration file

Image-level Captioning Tasks

The image-level captioning task is including three sub-tasks:

  • Caption: Generate a concise caption for the image
  • Detailed caption: Generate a detailed caption for the image
  • More detailed caption: Generate a more detailed caption for the image
Florence2-caption.mp4

Region-level Tasks

The region-level tasks are including:

  • Object detection

Florence2-od

  • Region proposal

Florence2-region-proposal

  • Dense region caption

Florence2-dense-region-caption

Note

The following tasks require additional box input.

  • Region to category: Assign a category to the region
  • Region to description: Generate a description for the region
  • Region to segmentation: Generate a segmentation mask for the region
Florence2-region-based-task.mp4

Phrase Grounding & OVD

Note

Both phrase grounding and open vocabulary detection tasks require additional text input.

  • Caption to parse grounding
Florence2-caption-to-parse-grounding.mp4
  • Referring expression segmentation
Florence2-referring-expression-segmentation.mp4
  • Open vocabulary detection

Florence2-ovd

Optical Character Recognition

  • OCR

Florence2-ocr

  • OCR with region

Florence2-ocr-with-region