TGI: export model if configuration is cached (#445)

* feat(cache): use one registry per optimum version * feat(registry): use model_type as primary key This allows to identify cached configurations that can be applied to models that differ only by their weights, like meta-llama/Llama-2-7b-hf and meta-llama/Llama-2-7b-chat-hf. This also allows to lookup cached configurations for local model folders containing a model config. * doc(cache): fix image link * doc(cache): add cache lookup * refactor(decoder): add get_export_config helper * feat(tgi): export model if cached * review: addressing code comments * wip * review: address doc comments
huggingface · Jan 30, 2024 · c114fc8 · c114fc8
1 parent 0f7bf4a
commit c114fc8
Show file tree

Hide file tree

Showing 11 changed files with 386 additions and 115 deletions.
diff --git a/docs/source/benchmarks/inferentia-llama2.mdx b/docs/source/benchmarks/inferentia-llama2.mdx
@@ -48,7 +48,7 @@ while 768 is more typical of a Retrieval Augmented Generation (RAG) use-case.
 
 Encoding time is expressed in **seconds**.
 
-![Llama2 inferentia2 encoding-time](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2/encoding-times.png "Encoding time")
+![Llama2 inferentia2 encoding-time](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/benchmarks/inferentia-llama2/encoding_times.png "Encoding time")
 
 We can see that all deployed models exhibit excellent response times, even for long contexts.
 

diff --git a/docs/source/guides/cache_system.mdx b/docs/source/guides/cache_system.mdx
@@ -13,35 +13,111 @@ specific language governing permissions and limitations under the License.
 # Neuron Model Cache
 
 The Neuron Model Cache is a remote cache for compiled Neuron models in the `neff` format.
-It is integrated into the [`NeuronTrainer` and `NeuronModelForCausalLM`] classes to enable loading pretrained models from the cache instead of compiling them locally.
+It is integrated into the `NeuronTrainer` and `NeuronModelForCausalLM` classes to enable loading pretrained models from the cache instead of compiling them locally.
+
+**Note: it is not available for models exported using any other NeuronModelXX classes, that use a different export mechanism.**
 
 The Neuron Model Cache is hosted on the [Hugging Face Hub](https://huggingface.co/aws-neuron/optimum-neuron-cache) and includes compiled files for all popular and supported `optimum-neuron` pre-trained models.
 
-When loading a Transformers or Diffusion model, it needs to be compiled to neuron format with [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx),
-in order to run on Neuron platforms.
-The compilation produces several compilation files stored in a local directory, usually `/var/tmp/neuron-compile-cache`.
-This means that every time you train or export a model on a new host, you need to recompile it, which takes a lot of time.
+Before training a Transformers or Diffusion model or loading a NeuronModelForCausalLM on Neuron platforms, it needs to be exported to neuron format
+with [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx).
+
+When exporting a model, [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx) will:
+
+- convert it to a set of [XLA](https://github.com/pytorch/xla/) subgraphs,
+- compile each subgraph with the neuronx compiler into a Neuron Executable File Format (NEFF) binary file.
+
+The first step is relatively fast, but the compilation takes a lot of time.
+To avoid recompiling all NEFF files every time a model is loaded on a NeuronX host, [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx)
+ stores NEFF files in a local directory, usually `/var/tmp/neuron-compile-cache`.
+
+However, this local cache is not shared between platforms, which means that every time you train or export a model on a new host, you need to recompile it.
+
+We created the Neuron Model Cache to solve this limitation by providing a public repository of precompiled model graphs.
+
+Note: we also support the creation of private, secured, remote model cache.
 
-We created the Neuron Model Cache to solve this limitation by providing a public cache of precompiled available models and a private cache to create your private, secured, remote model cache.
+## How to use the Neuron model cache
 
-## How the caching system works
+The public model cache will be used when you use the `NeuronTrainer` or `NeuronModelForCausalLM` classes. There are no additional changes needed.
 
-### Hash computation
+When exporting a model to neuron format, `optimum-neuron` will simply look for cached NEFF files in the hub repository during the compilation of the
+model subgraphs.
 
-Many factors can trigger compilation among which:
+If the NEFF files are cached, they will be fetched from the hub and directly loaded instead of being recompiled.
 
-- The input shapes,
-- The precision of the model, full-precision or bf16,
+## How caching works
+
+The Optimum Neuron Cache is built on top of the [NeuronX compiler cache](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html).
+
+It is important to understand that the cache operates on NEFF binaries, and not on the model itself.
+
+As explained previously, each model exported to Neuron using the `NeuronTrainer` or `NeuronModelForCausalLM` is composed of [XLA](https://github.com/pytorch/xla/) subgraphs.
+
+Each subgraph is unique, and results from the combination of:
+- the `transformers` or `transformers_neuronx` python modeling code,
+- the `transformers` model config,
+- the `input_shapes` selected during the export,
+- The precision of the model, full-precision, fp16 or bf16.
+
+When compiling a subgraph to a NEFF file, other parameters influence the result:
 - The version of the Neuron X compiler,
-- The number of Neuron cores used.
+- The number of Neuron cores used,
+- The compilation parameters (such as the optimization level).
+
+All these parameters are combined together to create a unique hash that identifies a NEFF file.
 
-These parameters are used to compute a hash that uniquely identifies each compilation file.
+This has two very important consequences:
+- it is only when actually exporting a model that the associated NEFF files can be identified,
+- even a small change in the model configuration will lead to a different set of NEFF files.
 
-**It is important to keep in mind that even a small change in the model configuration will trigger a recompilation.**
+It is therefore very difficult to know in advance if the NEFFs associated to a specific model configuration are cached.
+
+## Neuron model cache lookup (inferentia only)
+
+The neuron cache lookup is a feature allowing users to look for compatible cached model configurations before exporting
+a model for inference.
+
+It is based on a dedicated registry composed of stored cached configurations.
+
+Cached model configurations are stored as entries under a specific subfolder in the Neuron Model Cache:
+
+```
+neuronxcc-2.12.54.0+f631c2365
+├── 0_REGISTRY
+│   └── 0.0.18
+│       └── llama
+│           └── meta-llama
+│               └── Llama-2-7b-chat-hf
+│                   └── 54c1f6689cd88f246fce.json
+```
+
+Each entry corresponds to the combination of a model configuration and its export parameters: this is as close as we can get to
+uniquely identify the exported model.
+
+You can use the `optimum-cli` to lookup for compatible cached entries by passing it a hub model_id or the path to a file
+containing a model `config.json`.
+
+```shell
+$ optimum-cli neuron cache lookup meta-llama/Llama-2-7b-chat-hf
+
+*** 1 entrie(s) found in cache for meta-llama/Llama-2-7b-chat-hf ***
+
+task: text-generation
+batch_size: 1
+num_cores: 24
+auto_cast_type: fp16
+sequence_length: 2048
+compiler_type: neuronx-cc
+compiler_version: 2.12.54.0+f631c2365
+checkpoint_id: meta-llama/Llama-2-7b-chat-hf
+checkpoint_revision: c1b0db933684edbfe29a06fa47eb19cc48025e93
+```
 
-### How to use the Neuron model cache
+**Note that even if compatible cached entries exist, this does not always guarantee that the model will not be recompiled during export
+if you modified the compilation parameters or updated the neuronx packages.**
 
-The public model cache will be used when you use the [`NeuronTrainer` or `NeuronModelForCausalLM`] classes. There are no additional changes needed.
+## Advanced usage (trainium only)
 
 ### How to use a private Neuron model cache (trainium only)
 

diff --git a/optimum/neuron/modeling_decoder.py b/optimum/neuron/modeling_decoder.py
@@ -14,6 +14,7 @@
 # limitations under the License.
 """Base class for text-generation model architectures on neuron devices."""
 
+import copy
 import logging
 import os
 import shutil
@@ -28,7 +29,7 @@
 from ..exporters.neuron.model_configs import *  # noqa: F403
 from ..exporters.tasks import TasksManager
 from ..modeling_base import OptimizedModel
-from .utils import CacheEntry, hub_neuronx_cache, is_transformers_neuronx_available
+from .utils import ModelCacheEntry, hub_neuronx_cache, is_transformers_neuronx_available
 from .utils.require_utils import requires_transformers_neuronx
 from .utils.version_utils import check_compiler_compatibility, get_neuronxcc_version
 
@@ -126,7 +127,7 @@ def __init__(
         os.environ["NEURON_CC_FLAGS"] = neuron_cc_flags + " --model-type=transformer"
         checkpoint_id = neuron_config.get("checkpoint_id", None)
         # Only create a cache entry if the model comes from the hub
-        cache_entry = None if checkpoint_id is None else CacheEntry(neuron_config["checkpoint_id"], neuron_config)
+        cache_entry = None if checkpoint_id is None else ModelCacheEntry(checkpoint_id, config)
         with hub_neuronx_cache(entry=cache_entry):
             neuronx_model.to_neuron()
         os.environ["NEURON_CC_FLAGS"] = neuron_cc_flags
@@ -170,14 +171,7 @@ def _create_checkpoint(
         return checkpoint_dir
 
     @classmethod
-    @requires_transformers_neuronx
-    def _from_transformers(cls, *args, **kwargs):
-        # Deprecate it when optimum uses `_export` as from_pretrained_method in a stable release.
-        return cls._export(*args, **kwargs)
-
-    @classmethod
-    @requires_transformers_neuronx
-    def _export(
+    def get_export_config(
         cls,
         model_id: str,
         config: "PretrainedConfig",
@@ -187,23 +181,11 @@ def _export(
         batch_size: Optional[int] = None,
         sequence_length: Optional[int] = None,
         num_cores: Optional[int] = None,
-        auto_cast_type: Optional[str] = "fp32",
-        **kwargs,
-    ) -> "NeuronDecoderModel":
-        if not os.path.isdir("/sys/class/neuron_device/"):
-            raise SystemError("Decoder models can only be exported on a neuron platform.")
-
+        auto_cast_type: Optional[str] = None,
+    ) -> "PretrainedConfig":
         if task is None:
             task = TasksManager.infer_task_from_model(cls.auto_model_class)
 
-        # Instantiate the transformers model checkpoint
-        checkpoint_dir = cls._create_checkpoint(
-            model_id,
-            task=task,
-            revision=revision,
-            **kwargs,
-        )
-
         if os.path.isdir(model_id):
             checkpoint_id = None
             checkpoint_revision = None
@@ -223,9 +205,15 @@ def _export(
         if num_cores is None:
             # Use all available cores
             num_cores = len(os.listdir("/sys/class/neuron_device/")) * 2
-
-        # Update the config
-        config.neuron = {
+        if auto_cast_type is None:
+            auto_cast_type = "fp32"
+            if config.torch_dtype == "float16":
+                auto_cast_type = "fp16"
+            elif config.torch_dtype == "bfloat16":
+                auto_cast_type = "bf16"
+
+        new_config = copy.deepcopy(config)
+        new_config.neuron = {
             "task": task,
             "batch_size": batch_size,
             "num_cores": num_cores,
@@ -236,6 +224,52 @@ def _export(
             "checkpoint_id": checkpoint_id,
             "checkpoint_revision": checkpoint_revision,
         }
+        return new_config
+
+    @classmethod
+    @requires_transformers_neuronx
+    def _from_transformers(cls, *args, **kwargs):
+        # Deprecate it when optimum uses `_export` as from_pretrained_method in a stable release.
+        return cls._export(*args, **kwargs)
+
+    @classmethod
+    @requires_transformers_neuronx
+    def _export(
+        cls,
+        model_id: str,
+        config: "PretrainedConfig",
+        use_auth_token: Optional[str] = None,
+        revision: Optional[str] = None,
+        task: Optional[str] = None,
+        batch_size: Optional[int] = None,
+        sequence_length: Optional[int] = None,
+        num_cores: Optional[int] = None,
+        auto_cast_type: Optional[str] = "fp32",
+        **kwargs,
+    ) -> "NeuronDecoderModel":
+        if not os.path.isdir("/sys/class/neuron_device/"):
+            raise SystemError("Decoder models can only be exported on a neuron platform.")
+
+        # Update the config
+        new_config = cls.get_export_config(
+            model_id,
+            config,
+            use_auth_token=use_auth_token,
+            revision=revision,
+            task=task,
+            batch_size=batch_size,
+            sequence_length=sequence_length,
+            num_cores=num_cores,
+            auto_cast_type=auto_cast_type,
+        )
+
+        # Instantiate the transformers model checkpoint
+        checkpoint_dir = cls._create_checkpoint(
+            model_id,
+            task=new_config.neuron["task"],
+            revision=revision,
+            **kwargs,
+        )
 
         # Try to reload the generation config (if any)
         generation_config = None
@@ -244,7 +278,7 @@ def _export(
         except OSError:
             pass
 
-        return cls(config, checkpoint_dir, generation_config=generation_config)
+        return cls(new_config, checkpoint_dir, generation_config=generation_config)
 
     @classmethod
     def _get_neuron_dirs(cls, model_path: Union[str, Path]) -> Tuple[str, str]:

diff --git a/optimum/neuron/utils/__init__.py b/optimum/neuron/utils/__init__.py
@@ -24,7 +24,7 @@
     ENCODER_NAME,
     NEURON_FILE_NAME,
 )
-from .hub_neuronx_cache import CacheEntry, get_hub_cached_entries, hub_neuronx_cache, synchronize_hub_cache
+from .hub_neuronx_cache import ModelCacheEntry, get_hub_cached_entries, hub_neuronx_cache, synchronize_hub_cache
 from .import_utils import (
     is_accelerate_available,
     is_neuron_available,