Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks, developed by Microsoft.
Here, we will show you how to use Florence-2 on X-AnyLabeling to perform various vision tasks.
Let's get started!
Before you begin, make sure you have the following prerequisites installed:
Step 0: Download and install Miniconda from the official website.
Step 1: Create a new Conda environment with Python version 3.10
or higher, and activate it:
conda create -n x-anylabeling-transformers python=3.10 -y
conda activate x-anylabeling-transformers
Step 2: Install required dependencies.
First, follow the instructions here to install both PyTorch and TorchVision dependencies.
Then, install the transformers
package via:
pip install transformers
Finally, you can back to the installation guide (简体中文 | English) to install the necessary dependencies for X-AnyLabeling (v2.5.0+):
Tip
We recommend that you download the model before loading it.
You can download the model from HuggingFace via:
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large-ft", torch_dtype=torch_dtype, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large-ft", trust_remote_code=True)
The model will be automatically downloaded and cached in the transformers
package directory.
If you have an unstable network connection, you can:
- Manually download the model files
- Update the
model_path
parameter in the configuration file
The image-level captioning task is including three sub-tasks:
- Caption: Generate a concise caption for the image
- Detailed caption: Generate a detailed caption for the image
- More detailed caption: Generate a more detailed caption for the image
Florence2-caption.mp4
The region-level tasks are including:
- Object detection
- Region proposal
- Dense region caption
Note
The following tasks require additional box input.
- Region to category: Assign a category to the region
- Region to description: Generate a description for the region
- Region to segmentation: Generate a segmentation mask for the region
Florence2-region-based-task.mp4
Note
Both phrase grounding and open vocabulary detection tasks require additional text input.
- Caption to parse grounding
Florence2-caption-to-parse-grounding.mp4
- Referring expression segmentation
Florence2-referring-expression-segmentation.mp4
- Open vocabulary detection
- OCR
- OCR with region