Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenSora-HPCAI] PLLaVA Captioner #673

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions examples/opensora_hpcai/tools/PLLaVA/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# local #
tmp*/
cache/*
*/cache*/
tmp*.py
tmp*
*pickle
data/

# Zip Files/Packages #
*.7z
*.dmg
*.gz
*.iso
*.jar
*.rar
*.tar
*.zip

# Logs and databases #
*.log
*.sql
*.sqlite
.ipynb_checkpoints/
*.swp
*.vscode/
*.idea/
*.pyc
__pycache__
slurm*out

# OS files #
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db


.vim-arsync
scratch.norg
sync_to_red.sh

anno/
wandb/
logs/
accelerate_config/
*.pth
hf_*

# local folders
MODELS
DATAS
SAVED
EXPERIMENTS
REMOTE_HF
TEST

test_results
test_training
test_hdfs.py
magic_video_outputs/llava*
magic_video_outputs
pllava_video_outputs/
59 changes: 59 additions & 0 deletions examples/opensora_hpcai/tools/PLLaVA/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# PLLaVA based on MindSpore

MindSpore implementation of
[PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
](https://arxiv.org/abs/2404.16994).

## Dependencies

- CANN: 8.0.RC2 or later
- Python: 3.9 or later
- Mindspore: 2.3.1

## Getting Started
### Downloading Pretrained Checkpoints

Please download the model [here](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf).
By default, you can put the model under `./models` or your designated directory.

### Requirements

Run the following command to install the required packages:
```bash
pip install -r requirements.txt
```

## Inference

To run the inference, you may use `pllavarun.py` with the following command:

```bash
python pllavarun.py --video path_to_your_video
```

The inference examples are shown below:

| Video ID | Sample Frame | Caption |
|----------|----------------------------------|------------------------------------------------------------------------------------------------|
| -0J1SbgYLaw_1.mp4 | ![Sample Frame 1](example/1.png) | The image shows a person who appears to be a woman with a serious expression. She is wearing a dark top and has a necklace around her neck. There is a blurred background that suggests she might be in an indoor setting, possibly a room with a door or a window. The image is not high resolution, and there are no clear indications of what the video content might be. |
| -0og5HrzhpY_0.mp4 | ![Sample Image 2](example/2.png) | The image shows a collection of cake pans inside an oven. Each pan appears to be filled with a different color of batter, suggesting that they are being used to bake cakes with various flavors or decorative effects. The oven is likely preheating, as indicated by the light on the inside of the oven door. This scene is typical of a bakery or home kitchen where cakes are being prepared for baking. |
| -0UwLhziocc_1.mp4 | ![Sample Image 3](example/3.png) | The image shows two individuals, likely soldiers, engaged in a training exercise. The person on the left is holding a sign, which could be a training aid or a symbol of a specific task or objective. The person on the right is wearing a helmet and appears to be operating a piece of equipment, possibly a vehicle or a piece of machinery. The setting looks like a training ground or a military facility, and the focus seems to be on communication or a specific skill being demonstrated. |


## Benchmark

### Inference

To test the benchmark, you may use the video `-0J1SbgYLaw_1.mp4` under `./examples`
and run the following command:
```bash
python pllavarun.py --video ./examples/-0J1SbgYLaw_1.mp4 --benchmark
```

| Model | Context | Batch Size | Throughput (tokens/second) |
|-----------------------|---------------|------------|----------------------------|
| pllava-7b| D910*x1-MS2.3 | 1 | 9.66 |

> Context: {Ascend chip}-{number of NPUs}-{mindspore version}.\
> Throughput (tokens/second): number of generated tokens per second.\
> We use the second round of inference as the benchmark result.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
22 changes: 22 additions & 0 deletions examples/opensora_hpcai/tools/PLLaVA/models/pllava/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Copyright 2023 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from .modeling_pllava import (
PLLAVA_PRETRAINED_MODEL_ARCHIVE_LIST,
PllavaForConditionalGeneration,
PllavaPreTrainedModel,
)
from .processing_pllava import PllavaProcessor
from .configuration_pllava import PllavaConfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# coding=utf-8
# Copyright 2023 Microsoft Research & University of Wisconsin-Madison and the HuggingFace Inc. team. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Llava model configuration"""

from mindnlp.transformers.configuration_utils import PretrainedConfig
from mindnlp.transformers import logging
from mindnlp.transformers.models.auto import CONFIG_MAPPING


logger = logging.get_logger(__name__)

PLLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"llava-hf/llava-v1.5-7b": "https://huggingface.co/llava-hf/llava-v1.5-7b/resolve/main/config.json",
}


class PllavaConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`LlavaForConditionalGeneration`]. It is used to instantiate an
Llava model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the Llava-9B.

e.g. [llava-hf/llava-9b](https://huggingface.co/llava-hf/llava-9b)

Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.

Args:
vision_config (`LlavaVisionConfig`, *optional*):
Custom vision config or dict
text_config (`Union[AutoConfig, dict]`, *optional*):
The config object of the text backbone. Can be any of `LlamaConfig` or `MistralConfig`.
ignore_index (`int`, *optional*, defaults to -100):
The ignore index for the loss function.
image_token_index (`int`, *optional*, defaults to 32000):
The image token index to encode the image prompt.
projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
The activation function used by the multimodal projector.
vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
The feature selection strategy used to select the vision feature from the CLIP backbone.
vision_feature_layer (`int`, *optional*, defaults to -2):
The index of the layer to select the vision feature.
vocab_size (`int`, *optional*, defaults to 32000):
Vocabulary size of the Llava model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`~LlavaForConditionalGeneration`]

Example:

```python
>>> from mindnlp.transformers import LlavaForConditionalGeneration, LlavaConfig, CLIPVisionConfig, LlamaConfig

>>> # Initializing a CLIP-vision config
>>> vision_config = CLIPVisionConfig()

>>> # Initializing a Llama config
>>> text_config = LlamaConfig()

>>> # Initializing a Llava llava-1.5-7b style configuration
>>> configuration = LlavaConfig(vision_config, text_config)

>>> # Initializing a model from the llava-1.5-7b style configuration
>>> model = LlavaForConditionalGeneration(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```"""

model_type = "llava"
is_composition = False

def __init__(
self,
vision_config=None,
text_config=None,
ignore_index=-100,
image_token_index=32000,
projector_hidden_act="gelu",
vision_feature_select_strategy="default",
vision_feature_layer=-2,
vocab_size=32000,
pooling_method='avg',
pooling_shape=(8, 16, 16),
frame_shape=(24, 24), # llava 1.5 pretrained frame shape
num_frames=1, # llava 1.5 pretrained frame shape
use_pooling=True,
gradient_checkpointing=False,
**kwargs,
):
self.ignore_index = ignore_index
self.image_token_index = image_token_index
self.projector_hidden_act = projector_hidden_act
self.vision_feature_select_strategy = vision_feature_select_strategy
self.vision_feature_layer = vision_feature_layer
self.vocab_size = vocab_size
self.use_pooling = use_pooling
self.gradient_checkpointing = gradient_checkpointing

self.vision_config = vision_config

self.pooling_method = pooling_method # should be in 'max', 'avg'
self.pooling_shape = pooling_shape #
self.frame_shape = frame_shape #
self.num_frames = num_frames
if isinstance(self.vision_config, dict):
vision_config["model_type"] = (
vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
)
self.vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
elif vision_config is None:
self.vision_config = CONFIG_MAPPING["clip_vision_model"](
intermediate_size=4096,
hidden_size=1024,
patch_size=14,
image_size=336,
num_hidden_layers=24,
num_attention_heads=16,
vocab_size=32000,
projection_dim=768,
)
self.vocab_size = self.vocab_size

self.text_config = text_config

if isinstance(self.text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
self.text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
self.vocab_size = self.text_config.vocab_size
self.text_config.gradient_checkpointing = self.gradient_checkpointing

elif text_config is None:
# TODO: delete flash_attention?
tmp_config = {"_attn_implementation":"flash_attention_2",
"gradient_checkpointing": self.gradient_checkpointing}
self.text_config = CONFIG_MAPPING["llama"](**tmp_config)
self.text_config.gradient_checkpointing = self.gradient_checkpointing
# self.text_config["_attn_implementation"]="flash_attention_2" # xl: temporal hard code


super().__init__(**kwargs)
Loading
Loading