mindspore-lab · AndyZhou952 · Sep 26, 2024 · Sep 26, 2024 · Sep 27, 2024 · Sep 27, 2024
@@ -0,0 +1,66 @@
+# local #
+tmp*/
+cache/*
+*/cache*/
+tmp*.py
+tmp*
+*pickle
+data/
+
+# Zip Files/Packages #
+*.7z
+*.dmg
+*.gz
+*.iso
+*.jar
+*.rar
+*.tar
+*.zip
+
+# Logs and databases #
+*.log
+*.sql
+*.sqlite
+.ipynb_checkpoints/
+*.swp
+*.vscode/
+*.idea/
+*.pyc
+__pycache__
+slurm*out
+
+# OS files #
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+
+
+.vim-arsync
+scratch.norg
+sync_to_red.sh
+
+anno/
+wandb/
+logs/
+accelerate_config/
+*.pth
+hf_*
+
+# local folders
+MODELS
+DATAS
+SAVED
+EXPERIMENTS
+REMOTE_HF
+TEST
+
+test_results
+test_training
+test_hdfs.py
+magic_video_outputs/llava*
+magic_video_outputs
+pllava_video_outputs/
@@ -0,0 +1,59 @@
+# PLLaVA based on MindSpore
+
+MindSpore implementation of 
+[PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
+](https://arxiv.org/abs/2404.16994).
+
+## Dependencies
+
+- CANN: 8.0.RC2 or later
+- Python: 3.9 or later
+- Mindspore: 2.3.1
+
+## Getting Started
+### Downloading Pretrained Checkpoints
+
+Please download the model [here](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf).
+By default, you can put the model under `./models` or your designated directory.
+
+### Requirements
+
+Run the following command to install the required packages:
+```bash
+pip install -r requirements.txt
+```
+
+## Inference
+
+To run the inference, you may use `pllavarun.py` with the following command:
+
+```bash
+python pllavarun.py --video path_to_your_video
+```
+
+The inference examples are shown below:
+
+| Video ID | Sample Frame                     | Caption                                                                                         |
+|----------|----------------------------------|------------------------------------------------------------------------------------------------|
+| -0J1SbgYLaw_1.mp4    | ![Sample Frame 1](example/1.png) | The image shows a person who appears to be a woman with a serious expression. She is wearing a dark top and has a necklace around her neck. There is a blurred background that suggests she might be in an indoor setting, possibly a room with a door or a window. The image is not high resolution, and there are no clear indications of what the video content might be.                               |
+| -0og5HrzhpY_0.mp4      | ![Sample Image 2](example/2.png) | The image shows a collection of cake pans inside an oven. Each pan appears to be filled with a different color of batter, suggesting that they are being used to bake cakes with various flavors or decorative effects. The oven is likely preheating, as indicated by the light on the inside of the oven door. This scene is typical of a bakery or home kitchen where cakes are being prepared for baking.     |
+| -0UwLhziocc_1.mp4      | ![Sample Image 3](example/3.png) | The image shows two individuals, likely soldiers, engaged in a training exercise. The person on the left is holding a sign, which could be a training aid or a symbol of a specific task or objective. The person on the right is wearing a helmet and appears to be operating a piece of equipment, possibly a vehicle or a piece of machinery. The setting looks like a training ground or a military facility, and the focus seems to be on communication or a specific skill being demonstrated.   |
+
+
+## Benchmark
+
+### Inference
+
+To test the benchmark, you may use the video `-0J1SbgYLaw_1.mp4` under `./examples`
+and run the following command:
+```bash
+python pllavarun.py --video ./examples/-0J1SbgYLaw_1.mp4 --benchmark
+```
+
+|         Model         | Context       | Batch Size | Throughput (tokens/second) |
+|-----------------------|---------------|------------|----------------------------|
+| pllava-7b| D910*x1-MS2.3 |    1       | 9.66                       |
+
+> Context: {Ascend chip}-{number of NPUs}-{mindspore version}.\
+> Throughput (tokens/second): number of generated tokens per second.\
+> We use the second round of inference as the benchmark result.
@@ -0,0 +1,22 @@
+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import TYPE_CHECKING
+
+from .modeling_pllava import (
+    PLLAVA_PRETRAINED_MODEL_ARCHIVE_LIST,
+    PllavaForConditionalGeneration,
+    PllavaPreTrainedModel,
+)
+from .processing_pllava import PllavaProcessor
+from .configuration_pllava import PllavaConfig
@@ -0,0 +1,150 @@
+# coding=utf-8
+# Copyright 2023 Microsoft Research & University of Wisconsin-Madison and the HuggingFace Inc. team. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Llava model configuration"""
+
+from mindnlp.transformers.configuration_utils import PretrainedConfig
+from mindnlp.transformers import logging
+from mindnlp.transformers.models.auto import CONFIG_MAPPING
+
+
+logger = logging.get_logger(__name__)
+
+PLLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "llava-hf/llava-v1.5-7b": "https://huggingface.co/llava-hf/llava-v1.5-7b/resolve/main/config.json",
+}
+
+
+class PllavaConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`LlavaForConditionalGeneration`]. It is used to instantiate an
+    Llava model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the Llava-9B.
+
+    e.g. [llava-hf/llava-9b](https://huggingface.co/llava-hf/llava-9b)
+
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+
+    Args:
+        vision_config (`LlavaVisionConfig`,  *optional*):
+            Custom vision config or dict
+        text_config (`Union[AutoConfig, dict]`, *optional*):
+            The config object of the text backbone. Can be any of `LlamaConfig` or `MistralConfig`.
+        ignore_index (`int`, *optional*, defaults to -100):
+            The ignore index for the loss function.
+        image_token_index (`int`, *optional*, defaults to 32000):
+            The image token index to encode the image prompt.
+        projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
+            The activation function used by the multimodal projector.
+        vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
+            The feature selection strategy used to select the vision feature from the CLIP backbone.
+        vision_feature_layer (`int`, *optional*, defaults to -2):
+            The index of the layer to select the vision feature.
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the Llava model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`~LlavaForConditionalGeneration`]
+
+    Example:
+
+    ```python
+    >>> from mindnlp.transformers import LlavaForConditionalGeneration, LlavaConfig, CLIPVisionConfig, LlamaConfig
+
+    >>> # Initializing a CLIP-vision config
+    >>> vision_config = CLIPVisionConfig()
+
+    >>> # Initializing a Llama config
+    >>> text_config = LlamaConfig()
+
+    >>> # Initializing a Llava llava-1.5-7b style configuration
+    >>> configuration = LlavaConfig(vision_config, text_config)
+
+    >>> # Initializing a model from the llava-1.5-7b style configuration
+    >>> model = LlavaForConditionalGeneration(configuration)
+
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+
+    model_type = "llava"
+    is_composition = False
+
+    def __init__(
+        self,
+        vision_config=None,
+        text_config=None,
+        ignore_index=-100,
+        image_token_index=32000,
+        projector_hidden_act="gelu",
+        vision_feature_select_strategy="default",
+        vision_feature_layer=-2,
+        vocab_size=32000,
+        pooling_method='avg',
+        pooling_shape=(8, 16, 16),
+        frame_shape=(24, 24), # llava 1.5 pretrained frame shape
+        num_frames=1, # llava 1.5 pretrained frame shape
+        use_pooling=True,
+        gradient_checkpointing=False,
+        **kwargs,
+    ):
+        self.ignore_index = ignore_index
+        self.image_token_index = image_token_index
+        self.projector_hidden_act = projector_hidden_act
+        self.vision_feature_select_strategy = vision_feature_select_strategy
+        self.vision_feature_layer = vision_feature_layer
+        self.vocab_size = vocab_size
+        self.use_pooling = use_pooling
+        self.gradient_checkpointing = gradient_checkpointing
+
+        self.vision_config = vision_config
+
+        self.pooling_method = pooling_method # should be in 'max', 'avg'
+        self.pooling_shape = pooling_shape # 
+        self.frame_shape = frame_shape # 
+        self.num_frames = num_frames
+        if isinstance(self.vision_config, dict):
+            vision_config["model_type"] = (
+                vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
+            )
+            self.vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
+        elif vision_config is None:
+            self.vision_config = CONFIG_MAPPING["clip_vision_model"](
+                intermediate_size=4096,
+                hidden_size=1024,
+                patch_size=14,
+                image_size=336,
+                num_hidden_layers=24,
+                num_attention_heads=16,
+                vocab_size=32000,
+                projection_dim=768,
+            )
+        self.vocab_size = self.vocab_size
+
+        self.text_config = text_config
+
+        if isinstance(self.text_config, dict):
+            text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
+            self.text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
+            self.vocab_size = self.text_config.vocab_size
+            self.text_config.gradient_checkpointing = self.gradient_checkpointing
+
+        elif text_config is None:
+            # TODO: delete flash_attention?
+            tmp_config = {"_attn_implementation":"flash_attention_2",
+                          "gradient_checkpointing": self.gradient_checkpointing}
+            self.text_config = CONFIG_MAPPING["llama"](**tmp_config)
+            self.text_config.gradient_checkpointing = self.gradient_checkpointing
+        # self.text_config["_attn_implementation"]="flash_attention_2"  # xl: temporal hard code
+
+
+        super().__init__(**kwargs)