Skip to content

Commit

Permalink
Merge pull request #259 from argonne-lcf/feature/Cerebras_updates_2
Browse files Browse the repository at this point in the history
Updates for new release of Cerebras software 1.9.1
  • Loading branch information
wcarnold1010 committed Aug 11, 2023
2 parents 2baa840 + f79516a commit 98ce520
Show file tree
Hide file tree
Showing 6 changed files with 81 additions and 110 deletions.
32 changes: 20 additions & 12 deletions docs/ai-testbed/cerebras/customizing-environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,43 +5,51 @@
#### To make a PyTorch virtual environment for Cerebras

```console
mkdir ~/R_1.8.0
cd ~/R_1.8.0
#Make your home directory navigable
chmod a+xr ~/
mkdir ~/R_1.9.1
chmod a+x ~/R_1.9.1/
cd ~/R_1.9.1
# Note: "deactivate" does not actually work in scripts.
deactivate
rm -r venv_pt
/software/cerebras/python3.7/bin/python3.7 -m venv venv_pt
/software/cerebras/python3.8/bin/python3.8 -m venv venv_pt
source venv_pt/bin/activate
pip3 install --disable-pip-version-check /opt/cerebras/wheels/cerebras_pytorch-1.8.0+de49801ca3-py3-none-any.whl --find-links=/opt/cerebras/wheels/
pip3 install /opt/cerebras/wheels/cerebras_pytorch-1.9.1+1cf4d0632b-cp38-cp38-linux_x86_64.whl --find-links=/opt/cerebras/wheels
pip install numpy==1.23.4
pip install datasets transformers
```

#### To make a TensorFlow virtual environment for Cerebras

```console
mkdir ~/R_1.8.0
cd ~/R_1.8.0
chmod a+xr ~/
mkdir ~/R_1.9.1
chmod a+x ~/R_1.9.1/
cd ~/R_1.9.1
# Note: "deactivate" does not actually work in scripts.
deactivate
rm -r venv_tf
/software/cerebras/python3.7/bin/python3.7 -m venv venv_tf
/software/cerebras/python3.8/bin/python3.8 -m venv venv_tf
source venv_tf/bin/activate
pip install tensorflow_datasets
pip install spacy
pip3 install --disable-pip-version-check /opt/cerebras/wheels/cerebras_tensorflow-1.8.0+de49801ca3-py3-none-any.whl --find-links=/opt/cerebras/wheels/
#pip install tensorflow_datasets
#pip install spacy
pip3 install /opt/cerebras/wheels/cerebras_tensorflow-1.9.1+1cf4d0632b-cp38-cp38-linux_x86_64.whl --find-links=/opt/cerebras/wheels/
pip install numpy==1.23.4
```

#### Activation and deactivation

To activate one of these virtual environments,

```console
source ~/R_1.8.0/venv_pt/bin/activate
source ~/R_1.9.1/venv_pt/bin/activate
```

or

```console
source ~/R_1.8.0/venv_tf/bin/activate
source ~/R_1.9.1/venv_tf/bin/activate
```

To deactivate a virtual environment,
Expand Down
55 changes: 25 additions & 30 deletions docs/ai-testbed/cerebras/example-programs.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@
Make a working directory and a local copy of the Cerebras **modelzoo** and **anl_shared** repository, if not previously done, as follows.

```bash
mkdir ~/R_1.8.0
cd ~/R_1.8.0
mkdir ~/R_1.9.1
cd ~/R_1.9.1
git clone https://github.com/Cerebras/modelzoo.git
```
<!---
cp -r /software/cerebras/model_zoo/anl_shared/ ~/R_1.8.0/anl_shared
cp -r /software/cerebras/model_zoo/anl_shared/ ~/R_1.9.1/anl_shared
--->

## UNet
Expand All @@ -19,17 +19,17 @@ To run Unet with the <a href="https://www.kaggle.com/c/severstal-steel-defect-de
First, source a Cerebras PyTorch virtual environment.

```console
source ~/R_1.8.0/venv_pt/bin/activate
source ~/R_1.9.1/venv_pt/bin/activate
```

Then

```console
cd ~/R_1.8.0/modelzoo/modelzoo/vision/pytorch/unet
cd ~/R_1.9.1/modelzoo/modelzoo/vision/pytorch/unet
cp /software/cerebras/dataset/severstal-steel-defect-detection/params_severstal_binary_rawds.yaml configs/params_severstal_binary_rawds.yaml
export MODEL_DIR=model_dir_unet
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX weight_streaming --params configs/params_severstal_binary_rawds.yaml --model_dir $MODEL_DIR --mode train --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_1.8.0/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=unet_pt --params configs/params_severstal_binary_rawds.yaml --model_dir $MODEL_DIR --mode train --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_1.9.1/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
```

<!--- Appears to not have been ported to 1.7.1
Expand All @@ -43,7 +43,7 @@ The BraggNN model has two versions:<br>
```console
TODO
cd ~/R_1.8.0/anl_shared/braggnn/tf
cd ~/R_1.9.1/anl_shared/braggnn/tf
# This yaml has a correct path to a BraggNN dataset
cp /software/cerebras/dataset/BraggN/params_bragg_nonlocal_sampleds.yaml configs/params_bragg_nonlocal_sampleds.yaml
export MODEL_DIR=model_dir_braggnn
Expand All @@ -63,17 +63,17 @@ source /software/cerebras/venvs/venv_pt/bin/activate
# or your personal venv
--->
```console
source ~/R_1.8.0/venv_pt/bin/activate
source ~/R_1.9.1/venv_pt/bin/activate
```

Then

```console
cd ~/R_1.8.0/modelzoo/modelzoo/transformers/pytorch/bert
cd ~/R_1.9.1/modelzoo/modelzoo/transformers/pytorch/bert
cp /software/cerebras/dataset/bert_large/bert_large_MSL128_sampleds.yaml configs/bert_large_MSL128_sampleds.yaml
export MODEL_DIR=model_dir_bert_large_pytorch
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX pipeline --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_1.8.0/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=bert_pt --params configs/bert_large_MSL128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_1.9.1/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
```

The last parts of the output should resemble the following, with messages about cuda that should be ignored and are not shown.
Expand All @@ -97,27 +97,24 @@ The last parts of the output should resemble the following, with messages about
2023-05-17 18:18:49,293 INFO: Monitoring returned
```

<!--- No longer part of the modelzoo
## BERT - TensorFlow
The modelzoo/modelzoo/transformers/tf/bert directory is a TensorFlow implementation of [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)<br>
This BERT-large msl128 example uses a single sample dataset for both training and evaluation. See the README.md in the source directory for details on how to build a dataset from text input.
First, source a Cerebras TensorFlow virtual environment.
<!---
source /software/cerebras/venvs/venv_tf/bin/activate
# or your personal venv
--->
```console
source ~/R_1.8.0/venv_tf/bin/activate
source ~/R_1.9.1/venv_tf/bin/activate
```
Then
```console
cd ~/R_1.8.0/modelzoo/modelzoo/transformers/tf/bert
cd ~/R_1.9.1/modelzoo/modelzoo/transformers/tf/bert
cp /software/cerebras/dataset/bert_large/params_bert_large_msl128_sampleds.yaml configs/params_bert_large_msl128_sampleds.yaml
export MODEL_DIR=mytest
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX pipeline --job_labels name=bert_tf --max_steps 1000 --params configs/params_bert_large_msl128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_1.8.0/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=bert_tf --max_steps 1000 --params configs/params_bert_large_msl128_sampleds.yaml --num_workers_per_csx=1 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_1.9.1/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
```
The last parts of the output should resemble the following, with messages about cuda that should be ignored and are not shown.
Expand All @@ -140,6 +137,7 @@ INFO:root:Taking final checkpoint at step: 1000
INFO:tensorflow:Saved checkpoint for global step 1000 in 67.17758774757385 seconds: mytest/model.ckpt-1000
INFO:root:Monitoring returned
```
--->

## GPT-J PyTorch

Expand All @@ -148,22 +146,18 @@ This PyTorch GPT-J 6B parameter pretraining sample uses 2 CS2s.

First, source a Cerebras PyTorch virtual environment.

<!---
source /software/cerebras/venvs/venv_pt/bin/activate
# or your personal venv
--->
```console
source ~/R_1.8.0/venv_pt/bin/activate
source ~/R_1.9.1/venv_pt/bin/activate
```

Then

```console
cd ~/R_1.8.0/modelzoo/modelzoo/transformers/pytorch/gptj
cd ~/R_1.9.1/modelzoo/modelzoo/transformers/pytorch/gptj
cp /software/cerebras/dataset/gptj/params_gptj_6B_sampleds.yaml configs/params_gptj_6B_sampleds.yaml
export MODEL_DIR=model_dir_gptj
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX weight_streaming --job_labels name=gptj_pt --params configs/params_gptj_6B_sampleds.yaml --num_csx=2 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_1.8.0/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=gptj_pt --params configs/params_gptj_6B_sampleds.yaml --num_csx=2 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software --python_paths /home/$(whoami)/R_1.9.1/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
```

The last parts of the output should resemble the following:
Expand All @@ -180,30 +174,30 @@ The last parts of the output should resemble the following:
2023-05-17 19:27:13,435 INFO: Saved checkpoint at global step: 500
2023-05-17 19:27:13,436 INFO: Training Complete. Completed 65000 sample(s) in 2554.1804394721985 seconds.
```

<!---
## GPT-J TensorFlow
GPT-J [[github]](https://github.com/kingoflolz/mesh-transformer-jax) is an auto-regressive language model created by [EleutherAI](https://www.eleuther.ai/).
This TensorFlow GPT-J 6B parameter pretraining sample uses 2 CS2s.
First, source a Cerebras TensorFlow virtual environment.
<!---
source /software/cerebras/venvs/venv_tf/bin/activate
# or your personal venv
--->
```console
source ~/R_1.8.0/venv_tf/bin/activate
source ~/R_1.9.1/venv_tf/bin/activate
```
Then
```console
cd ~/R_1.8.0/modelzoo/modelzoo/transformers/tf/gptj
cd ~/R_1.9.1/modelzoo/modelzoo/transformers/tf/gptj
cp /software/cerebras/dataset/gptj/params_gptj_6B_tf_sampleds.yaml configs/params_gptj_6B_sampleds.yaml
export MODEL_DIR=model_dir_gptj_tf
if [ -d "$MODEL_DIR" ]; then rm -Rf $MODEL_DIR; fi
python run.py CSX weight_streaming --job_labels name=gptj_tf --max_steps 500 --params configs/params_gptj_6B_sampleds.yaml --num_csx=2 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_1.8.0/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
python run.py CSX --job_labels name=gptj_tf --max_steps 500 --params configs/params_gptj_6B_sampleds.yaml --num_csx=2 --mode train --model_dir $MODEL_DIR --mount_dirs /home/ /software/ --python_paths /home/$(whoami)/R_1.9.1/modelzoo/ --compile_dir $(whoami) |& tee mytest.log
```
The last parts of the output should resemble the following:
Expand All @@ -217,3 +211,4 @@ INFO:root:Taking final checkpoint at step: 500
INFO:tensorflow:Saved checkpoint for global step 500 in 304.37238907814026 seconds: model_dir_gptj_tf/model.ckpt-500
INFO:root:Monitoring is over without any issue
```
--->
37 changes: 17 additions & 20 deletions docs/ai-testbed/cerebras/job-queuing-and-submission.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,18 +5,13 @@ The CS-2 cluster has its own **Kubernetes-based** system for job submission and
Jobs are started automatically through the **Python** frameworks in modelzoo.common.pytorch.run_utils and modelzoo.common.tf.run_utils
Continuous job status for a job is output to stdout/stderr; redirect the output, or consider using a persistent session started with **screen**, or **tmux**, or both.

In order to run the Cerebras csctl utility you will need to copy a config file to your home directory. Future versions of Cerebras software will reference a system wide file.
```console
mkdir ~/.cs; cp /opt/cerebras/config ~/.cs/config
```

Jobs that have not yet completed can be listed as shown. Note: this command can take over a minute to complete.

```console
(venv_tf) $ csctl get jobs | grep -v "SUCCEEDED\|FAILED\|CANCELLED"
NAME AGE PHASE SYSTEMS USER LABELS
wsjob-eyjapwgnycahq9tus4w7id 88s RUNNING cer-cs2-01 username name=pt_smoketest,user=username
(venv_tf) $
(venv_pt) $ csctl get jobs
NAME AGE DURATION PHASE SYSTEMS USER LABELS DASHBOARD
wsjob-thjj8zticwsylhppkbmjqe 13s 1s RUNNING cer-cs2-01 username name=unet_pt https://grafana.cerebras1.lab.alcf.anl.gov/d/WebHNShVz/wsjob-dashboard?orgId=1&var-wsjob=wsjob-thjj8zticwsylhppkbmjqe&from=1691705374000&to=now
(venv_pt) $
```

Jobs can be canceled as shown:
Expand Down Expand Up @@ -46,7 +41,8 @@ wsjob-ez6dyfronnsg2rz7f7fqw4 19m SUCCEEDED cer-cs2-02 username testlabel=test,
(venv_pt) $
```

See `csctl -h` for more options
See `csctl -h` for more options.<br>
Add `-h` to a command for help for that command, e.g. `csctl get -h` or `csctl cancel -h`.

```console
$ csctl -h
Expand All @@ -56,18 +52,19 @@ Usage:
csctl [command]

Available Commands:
cancel Cancel job
config Modify csctl config files
get Get resources
label Label resources
log-export Gather and download logs.
types Display resource types
cancel Cancel job
clear-worker-cache Clear the worker cache
config View csctl config files
get Get resources
label Label resources
log-export Gather and download logs.
types Display resource types

Flags:
--csconfig string config file (default is $HOME/.cs/config) (default "$HOME/.cs/config")
-d, --debug int higher debug values will display more fields in output objects
-h, --help help for csctl
-d, --debug int higher debug values will display more fields in output objects
-h, --help help for csctl
--namespace string configure csctl to talk to different user namespaces
-v, --version version for csctl

Use "csctl [command] --help" for more information about a command.

```
Loading

0 comments on commit 98ce520

Please sign in to comment.