Florence 2 Example

Introduction

Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks, developed by Microsoft.

Here, we will show you how to use Florence-2 on X-AnyLabeling to perform various vision tasks.

Let's get started!

Installation

Before you begin, make sure you have the following prerequisites installed:

Step 0: Download and install Miniconda from the official website.

Step 1: Create a new Conda environment with Python version 3.10 or higher, and activate it:

conda create -n x-anylabeling-transformers python=3.10 -y
conda activate x-anylabeling-transformers

Step 2: Install required dependencies.

First, follow the instructions here to install both PyTorch and TorchVision dependencies.

Then, install the transformers package via:

pip install transformers

Finally, you can back to the installation guide (简体中文 | English) to install the necessary dependencies for X-AnyLabeling (v2.5.0+):

Getting Started

Tip

We recommend that you download the model before loading it.

You can download the model from HuggingFace via:

import torch
from transformers import AutoProcessor, AutoModelForCausalLM 

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large-ft", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)

The model will be automatically downloaded and cached in the transformers package directory.

If you have an unstable network connection, you can:

Manually download the model files
Update the model_path parameter in the configuration file

Image-level Captioning Tasks

The image-level captioning task is including three sub-tasks:

Caption: Generate a concise caption for the image
Detailed caption: Generate a detailed caption for the image
More detailed caption: Generate a more detailed caption for the image

Florence2-caption.mp4

Region-level Tasks

The region-level tasks are including:

Object detection

Region proposal

Dense region caption

Note

The following tasks require additional box input.

Region to category: Assign a category to the region
Region to description: Generate a description for the region
Region to segmentation: Generate a segmentation mask for the region

Florence2-region-based-task.mp4

Phrase Grounding & OVD

Note

Both phrase grounding and open vocabulary detection tasks require additional text input.

Caption to parse grounding

Florence2-caption-to-parse-grounding.mp4

Referring expression segmentation

Florence2-referring-expression-segmentation.mp4

Open vocabulary detection

Optical Character Recognition

OCR

OCR with region

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Florence 2 Example

Introduction

Installation

Getting Started

Image-level Captioning Tasks

Region-level Tasks

Phrase Grounding & OVD

Optical Character Recognition

Files

README.md

Latest commit

History

README.md

File metadata and controls

Florence 2 Example

Introduction

Installation

Getting Started

Image-level Captioning Tasks

Region-level Tasks

Phrase Grounding & OVD

Optical Character Recognition