Skip to content

Commit

Permalink
Add Neuronx compile cache proxy and use it for LLM decoder models (#410)
Browse files Browse the repository at this point in the history
* feat: add HF Hub neuronx cache proxy

* feat(decoders): always use hub neuronx cache

* feat(cli): add cache synchronize command

* doc: add reference to new cache for NeuronModelForCausalLM

* ci: add neuronx cache tests

* feat(cache): only except when synchronizing

* feat(cache): catch more errors

* feat(cache): add warning on cache miss

* review: address comments

* fix(cli): avoid undefined symbol

* fix(utils): avoid possible circular import

* review: use existing require helper

* review: address comment
  • Loading branch information
dacorvo authored Jan 17, 2024
1 parent 66c42d7 commit f81c365
Show file tree
Hide file tree
Showing 8 changed files with 444 additions and 32 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/test_inf2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ jobs:
python -m pip install -U pip
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
python -m pip install .[neuronx,tests]
- name: Run cache tests
run: |
source aws_neuron_venv_pytorch/bin/activate
HF_TOKEN=${{ secrets.HF_TOKEN_OPTIMUM_NEURON_CI }} pytest -m is_inferentia_test tests/cache
- name: Run CLI tests
run: |
source aws_neuron_venv_pytorch/bin/activate
Expand Down
59 changes: 30 additions & 29 deletions docs/source/guides/cache_system.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,52 +12,52 @@ specific language governing permissions and limitations under the License.

# Neuron Model Cache

The Neuron Model Cache is a remote cache for compiled Neuron models in the `neff` format.
It is integrated into the [`NeuronTrainer`] class to enable loading pretrained models from the cache instead of compiling them locally.
This can speed up the training process by about –3x.
The Neuron Model Cache is a remote cache for compiled Neuron models in the `neff` format.
It is integrated into the [`NeuronTrainer` and `NeuronModelForCausalLM] classes to enable loading pretrained models from the cache instead of compiling them locally.

The Neuron Model Cache is hosted on the [Hugging Face Hub](https://huggingface.co/aws-neuron/optimum-neuron-cache) and includes compiled files for all popular and supported pre-trained models `optimum-neuron`.
The Neuron Model Cache is hosted on the [Hugging Face Hub](https://huggingface.co/aws-neuron/optimum-neuron-cache) and includes compiled files for all popular and supported `optimum-neuron` pre-trained models.

When training a Transformers or Diffusion model with vanilla [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx), the models needs to be first compiled. The compiled version is stored in a local directory, usually `/var/tmp/neuron-compile-cache`.
This means that every time you train a new model in a new environment, you need to recompile it, which takes a lot of time.
When loading a Transformers or Diffusion model, it needs to be compiled to neuron format with [`torch-neuronx`](https://github.com/aws-neuron/aws-neuron-samples/tree/master/torch-neuronx),
in order to run on Neuron platforms.
The compilation produces several compilation files stored in a local directory, usually `/var/tmp/neuron-compile-cache`.
This means that every time you train or export a model on a new host, you need to recompile it, which takes a lot of time.

We created the Neuron Model Cache to solve this limitation by providing a public cache of precompiled available models and a private cache to create your private, secured, remote model cache.

The Neuron Model Cache plugs into the local cache directory of the Hugging Face Hub. During training, the [`NeuronTrainer`] will check if compilation files are available on the Hub and download them if they are found, allowing you to save both time and cost by skipping the compilation phase.

## How the caching system works

### Hash computation

Many factors can trigger compilation among which:
Many factors can trigger compilation among which:

- The input shapes,
- The precision of the model, full-precision or bf16,
- The version of the Neuron X compiler,
- The number of Neuron cores used.

- The model weights
- The input shapes
- The precision of the model, full-precision or bf16
- The version of the Neuron X compiler
- The number of Neuron cores used
These parameters are used to compute a hash that uniquely identifies each compilation file.

These parameters are used to compute a hash. This hash is then used to compare local hashes for our training session against hashes stored on the Hugging Face Hub, and act accordingly (download or push).
**It is important to keep in mind that even a small change in the model configuration will trigger a recompilation.**

### How to use the Neuron model cache

The Public model cache will be used when your training script uses the [`NeuronTrainer`]. There are no additional changes needed.
The public model cache will be used when you use the [`NeuronTrainer` or `NeuronModelForCausalLM] classes. There are no additional changes needed.

### How to use a private Neuron model cache
### How to use a private Neuron model cache (trainium only)

The repository for the public cache is `aws-neuron/optimum-neuron-cache`. This repository includes all precompiled files for commonly used models so that it is publicly available and free to use for everyone. But there are two limitations:

1. You will not be able to push your own compiled files on this repo
1. You will not be able to push your own compiled files on this repo
2. It is public and you might want to use a private repo for private models

To alleviate that you can create your own private cache repository using the `optimum-cli` or set the environment variable `CUSTOM_CACHE_REPO`.

#### Using the Optimum CLI

The Optimum CLI offers 2 subcommands for cache creation and setting:
The Optimum CLI offers 2 subcommands for cache creation and setting:

- `create`: To create a new cache repository that you can use as a private Neuron Model cache.
- `set`: To set the name of the Nueron cache repository locally, the repository needs to exists
- `create`: To create a new cache repository that you can use as a private Neuron Model cache.
- `set`: To set the name of the Neuron cache repository locally, the repository needs to exists
and will be used by default by `optimum-neuron`.

Create a new Neuron cache repository:
Expand Down Expand Up @@ -115,7 +115,7 @@ The `optimum-cli neuron cache set` command is useful when working on a new insta

Using the CLI is not always feasible, and not very practical for small testing. In this case, you can simply set the environment variable `CUSTOM_CACHE_REPO`.

For example, if you cache repo is called `michaelbenayoun/my_custom_cache_repo`, you just need to do:
For example, if your cache repo is called `michaelbenayoun/my_custom_cache_repo`, you just need to do:

```bash
CUSTOM_CACHE_REPO="michaelbenayoun/my_custom_cache_repo" torchrun ...
Expand All @@ -139,11 +139,11 @@ You have to be [logged into the Hugging Face Hub](https://huggingface.co/docs/hu
</p>


At each the beginning of each training step, the [`NeuronTrainer`] computes a `NeuronHash` and checks the cache repo(s) (official and custom) on the Hugging Face Hub to see if there are compiled files associated to this hash.
At each the beginning of each training step, the [`NeuronTrainer`] computes a `NeuronHash` and checks the cache repo(s) (official and custom) on the Hugging Face Hub to see if there are compiled files associated to this hash.
If that is the case, the files are downloaded directly to the local cache directory and no compilation is needed. Otherwise compilation is performed.


Just as for downloading compiled files, the [`NeuronTrainer`] will keep track of the newly created compilation files at each training step, and upload them to the Hugging Face Hub at save time or when training ends. This assumes that you have writing access to the cache repo, otherwise nothing will be pushed.
Just as for downloading compiled files, the [`NeuronTrainer`] will keep track of the newly created compilation files at each training step, and upload them to the Hugging Face Hub at save time or when training ends. This assumes that you have writing access to the cache repo, otherwise nothing will be pushed.


## Optimum CLI
Expand All @@ -156,15 +156,16 @@ usage: optimum-cli neuron cache [-h] {create,set,add,list} ...
positional arguments:
{create,set,add,list}
create Create a model repo on the Hugging Face Hub to store Neuron X compilation files.
set Set the name of the Neuron cache repo to use locally.
add Add a model to the cache of your choice.
list List models in a cache repo.
set Set the name of the Neuron cache repo to use locally (trainium only).
add Add a model to the cache of your choice (trainium only).
list List models in a cache repo (trainium only).
synchronize Synchronize local compiler cache with the hub cache (inferentia only).
optional arguments:
-h, --help show this help message and exit
```

### Add a model to the cache
### Add a model to the cache (trainium only)

It is possible to add a model compilation files to a cache repo via the `optimum-cli neuron cache add` command:

Expand All @@ -178,7 +179,7 @@ usage: optimum-cli neuron cache add [-h] -m MODEL --task TASK --train_batch_size
When running this command a small training session will be run and the resulting compilation files will be pushed.

<Tip warning={true}>
Make sure that the Neuron cache repo to use is set up locally, this can be done by running the `optimum-cli neuron cache set` command.
Make sure that the Neuron cache repo to use is set up locally, this can be done by running the `optimum-cli neuron cache set` command.
You also need to make sure that you are logged in to the Hugging Face Hub and that you have the writing rights for the specified cache repo,
this can be done via the `huggingface-cli login` command.

Expand Down
15 changes: 15 additions & 0 deletions optimum/commands/neuron/cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

from typing import TYPE_CHECKING

from ...neuron.utils import synchronize_hub_cache
from ...neuron.utils.cache_utils import (
CACHE_REPO_NAME,
HF_HOME_CACHE_REPO_FILE,
Expand Down Expand Up @@ -208,6 +209,15 @@ def run(self):
print(f"\n*** Repo id: {self.args.name} ***\n\n{result}")


class SynchronizeRepoCommand(BaseOptimumCLICommand):
@staticmethod
def parse_args(parser: "ArgumentParser"):
parser.add_argument("--repo_id", type=str, default=None, help="The name of the repo to use as remote cache.")

def run(self):
synchronize_hub_cache(self.args.repo_id)


class CustomCacheRepoCommand(BaseOptimumCLICommand):
SUBCOMMANDS = (
CommandInfo(
Expand All @@ -230,4 +240,9 @@ class CustomCacheRepoCommand(BaseOptimumCLICommand):
help="List models in a cache repo.",
subcommand_class=ListRepoCommand,
),
CommandInfo(
name="synchronize",
help="Synchronize the neuronx compiler cache with a hub cache repo.",
subcommand_class=SynchronizeRepoCommand,
),
)
5 changes: 3 additions & 2 deletions optimum/neuron/modeling_decoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@
from ..exporters.neuron.model_configs import * # noqa: F403
from ..exporters.tasks import TasksManager
from ..modeling_base import OptimizedModel
from .utils import is_transformers_neuronx_available
from .utils import hub_neuronx_cache, is_transformers_neuronx_available
from .utils.version_utils import check_compiler_compatibility, get_neuronxcc_version


Expand Down Expand Up @@ -223,7 +223,8 @@ def _from_pretrained(
# Compile the Neuron model (if present compiled artifacts will be reloaded instead of compiled)
neuron_cc_flags = os.environ.get("NEURON_CC_FLAGS", "")
os.environ["NEURON_CC_FLAGS"] = neuron_cc_flags + " --model-type=transformer"
neuronx_model.to_neuron()
with hub_neuronx_cache():
neuronx_model.to_neuron()
os.environ["NEURON_CC_FLAGS"] = neuron_cc_flags

# Try to reload the generation config (if any)
Expand Down
1 change: 1 addition & 0 deletions optimum/neuron/utils/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
ENCODER_NAME,
NEURON_FILE_NAME,
)
from .hub_neuronx_cache import hub_neuronx_cache, synchronize_hub_cache
from .import_utils import (
is_accelerate_available,
is_neuron_available,
Expand Down
Loading

0 comments on commit f81c365

Please sign in to comment.