-
Notifications
You must be signed in to change notification settings - Fork 36
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #542 from xerbalind/getting_started
Add getting started chapter
- Loading branch information
Showing
5 changed files
with
302 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
7 changes: 7 additions & 0 deletions
7
mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
TensorFlow example copied from https://github.com/EESSI/eessi-demo/tree/main/TensorFlow | ||
|
||
Loads MNIST datasets and trains a neural network to recognize hand-written digits. | ||
|
||
Runtime: ~1 min. on 8 cores (Intel Skylake) | ||
|
||
See https://www.tensorflow.org/tutorials/quickstart/beginner |
5 changes: 5 additions & 0 deletions
5
mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/run.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
#!/bin/bash | ||
|
||
module load TensorFlow/2.11.0-foss-2022a | ||
|
||
python tensorflow_mnist.py |
21 changes: 21 additions & 0 deletions
21
mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/tensorflow_mnist.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# based on https://www.tensorflow.org/tutorials/quickstart/beginner | ||
|
||
import tensorflow as tf | ||
|
||
mnist = tf.keras.datasets.mnist | ||
|
||
(x_train, y_train), (x_test, y_test) = mnist.load_data() | ||
x_train, x_test = x_train / 255.0, x_test / 255.0 | ||
|
||
model = tf.keras.models.Sequential([ | ||
tf.keras.layers.Flatten(input_shape=(28, 28)), | ||
tf.keras.layers.Dense(128, activation='relu'), | ||
tf.keras.layers.Dropout(0.2), | ||
tf.keras.layers.Dense(10, activation='softmax'), | ||
]) | ||
|
||
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) | ||
|
||
model.fit(x_train, y_train, epochs=5) | ||
|
||
model.evaluate(x_test, y_test, verbose=2) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,268 @@ | ||
{% set exampleloc="mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist" %} | ||
# Getting Started | ||
|
||
Welcome to the "Getting Started" guide. This chapter will lead you through the initial steps of logging into the {{hpcinfra}} and submitting your very first job. We'll also walk you through the process step by step using a practical example. | ||
|
||
In addition to this chapter, you might find the [recording of the *Introduction to HPC-UGent* training session](https://www.ugent.be/hpc/en/training/introhpcugent-recording) to be a useful resource. | ||
|
||
Before proceeding, read [the introduction to HPC](introduction.md) to gain an understanding of the {{ hpcinfra }} and related terminology. | ||
|
||
### Getting Access | ||
|
||
To get access to the {{hpcinfra}}, visit [Getting an HPC Account](account.md). | ||
|
||
If you have not used Linux before, | ||
{%- if site == 'Gent' %} | ||
now would be a good time to follow our [Linux Tutorial](./only/gent/linux-tutorial/index.md). | ||
{%- else %} | ||
please learn some basics first before continuing. (see [Appendix C - Useful Linux Commands](useful_linux_commands.md)) | ||
{%- endif %} | ||
|
||
#### A typical workflow looks like this: | ||
|
||
1. Connect to the login nodes | ||
2. Transfer your files to the {{hpcinfra}} | ||
3. Optional: compile your code and test it | ||
4. Create a job script and submit your job | ||
5. Wait for job to be executed | ||
6. Study the results generated by your jobs, either on the cluster or | ||
after downloading them locally. | ||
|
||
We will walk through an illustrative workload to get you started. In this example, our objective is to train a deep learning model for recognizing hand-written digits (MNIST dataset) using [TensorFlow](https://www.tensorflow.org/); | ||
see the [example scripts](https://github.com/hpcugent/vsc_user_docs/tree/main/{{exampleloc}}). | ||
|
||
### Getting Connected | ||
|
||
There are two options to connect | ||
|
||
- Using a terminal to connect via SSH (for power users) (see [First Time connection to the {{ hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure)) | ||
- [Using the web portal](web_portal.md) | ||
|
||
Considering your operating system is **{{OS}}**, | ||
|
||
{%- if OS == linux %} | ||
it is recommended to make use of the `ssh` command in a terminal to get the most flexibility. | ||
|
||
Assuming you have already generated SSH keys in the previous step ([Getting Access](#getting-access)), and that they are in a default location, you should now be able to login by running the following command: | ||
|
||
<pre><code>ssh {{userid}}@{{loginnode}}</code></pre> | ||
|
||
!!! Warning "User your own VSC account id" | ||
|
||
Replace <b>{{userid}}</b> with your VSC account id (see <https://account.vscentrum.be>) | ||
|
||
!!! Tip | ||
|
||
You can also still use the web portal (see [shell access on web portal](web_portal.md#shell-access)) | ||
|
||
{%- else %} | ||
{%- if OS == windows %} it is recommended to use the web portal. | ||
{%- else %} it should be easy to make use of the `ssh` command in a terminal, but the web portal will work too. {%- endif %} | ||
|
||
The [web portal](web_portal.md) offers a convenient way to upload files and gain shell access to the {{hpcinfra}} from a standard web browser (no software installation or configuration required). | ||
|
||
See [shell access](web_portal.md#shell-access) when using the web portal, or | ||
[connection to the {{hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure) when using a terminal. | ||
|
||
Make sure you can get to a shell access to the {{hpcinfra}} before proceeding with the next steps. | ||
|
||
{%- endif %} | ||
|
||
!!! Info | ||
|
||
When having problems see the [connection issues section on the troubleshooting page](troubleshooting.md#sec:connecting-issues). | ||
|
||
|
||
### Transfer your files | ||
|
||
Now that you can login, it is time to transfer files from your local computer to your **home directory** on the {{hpcinfra}}. | ||
|
||
Download [tensorflow_mnist.py](https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/tensorflow_mnist.py) | ||
and [run.sh](https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/run.sh) example scripts to your computer (from [here](https://github.com/hpcugent/vsc_user_docs/tree/main/{{exampleloc}})). | ||
|
||
{%- if OS == windows %} | ||
|
||
The [HPC-UGent web portal](https://login.hpc.ugent.be) provides a file browser that allows uploading files. | ||
For more information see the [file browser section](web_portal.md#file-browser). | ||
|
||
Upload both files (`run.sh` and `tensorflow-mnist.py`) to your **home directory** and go back to your shell. | ||
|
||
!!! Info | ||
|
||
As an alternative, you can use WinSCP (see [our section](connecting.md#winscp)) | ||
|
||
{%- else %} | ||
|
||
On your local machine you can run: | ||
<pre><code>curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/tensorflow_mnist.py | ||
curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/run.sh | ||
</code></pre> | ||
|
||
Using the `scp` command, the files can be copied from your local host to your *home directory* (`~`) on the remote host (HPC). | ||
<pre><code>scp tensorflow_mnist.py run.sh {{userid}}{{ loginnode }}:~ </code></pre> | ||
<pre><code>ssh {{userid}}@{{ loginnode }} </code></pre> | ||
|
||
!!! Warning "User your own VSC account id" | ||
|
||
Replace <b>{{userid}}</b> with your VSC account id (see <https://account.vscentrum.be>) | ||
|
||
!!! Info | ||
|
||
For more information about transfering files or `scp`, see [tranfer files from/to hpc](connecting.md#transfer-files-tofrom-the-hpc). | ||
|
||
{%- endif %} | ||
|
||
When running `ls` in your session on the {{hpcinfra}}, you should see the two files listed in your home directory (`~`): | ||
|
||
```shell | ||
$ ls ~ | ||
run.sh tensorflow_mnist.py | ||
``` | ||
|
||
When you do not see these files, make sure you uploaded the files to your **home directory**. | ||
|
||
### Submitting a job | ||
|
||
Jobs are submitted and executed using job scripts. In our case **run.sh** can be used as a (very minimal) job script. | ||
|
||
A job script is a shell script, a text file that specifies the resources, | ||
the software that is used (via `module load` statements), | ||
and the steps that should be executed to run the calculation. | ||
|
||
Our job script looks like this: | ||
|
||
<center>-- run.sh --</center> | ||
|
||
```bash | ||
#!/bin/bash | ||
|
||
module load TensorFlow/2.11.0-foss-2022a | ||
|
||
python tensorflow_mnist.py | ||
|
||
``` | ||
<sub>As you can see this job script will run the Python script named **tensorflow_mnist.py**.</sub> | ||
|
||
|
||
The jobs you submit are per default executed on **cluser/{{defaultcluster}}**, you can swap to another cluster by issuing the following command. | ||
|
||
```shell | ||
module swap cluster/{{othercluster}} | ||
``` | ||
|
||
!!! Tip | ||
|
||
When submitting jobs with limited amount of resources, it is recommended to use the [debug/interactive cluster](interactive_debug.md#interactive-and-debug-cluster): `donphan`. | ||
|
||
{%- if site == 'Gent' %} | ||
|
||
To get a list of all clusters and their hardware, see <https://www.ugent.be/hpc/en/infrastructure>. | ||
|
||
{%- endif %} | ||
|
||
This job script can now be submitted to the cluster's job system for execution, using the qsub (**q**ueue **sub**mit) command: | ||
|
||
```shell | ||
$ qsub run.sh | ||
{{jobid}} | ||
``` | ||
|
||
This command returns a job identifier (*{{jobid}}*) on the HPC cluster. This is a unique identifier for the job which can be used to monitor and manage your job. | ||
|
||
!!! Warning "Make sure you understand what the `module` command does" | ||
|
||
Note that the module commands only modify environment variables. For instance, running `module swap cluster/{{othercluster}}` will update your shell environment so that `qsub` submits a job to the `{{othercluster}}` cluster, | ||
but our active shell session is still running on the login node. | ||
|
||
It is important to understand that while `module` commands affect your session environment, they do ***not*** change where the commands your are running are being executed: they will still be run on the login node you are on. | ||
|
||
When you submit a job script however, the commands ***in*** the job script will be run on a workernode of the cluster the job was submitted to (like `{{othercluster}}`). | ||
|
||
For detailed information about `module` commands, read the [running batch jobs](running_batch_jobs.md) chapter. | ||
|
||
### Wait for job to be executed | ||
|
||
Your job is put into a queue before being executed, so it may take a while before it actually starts. | ||
(see [when will my job start?](running_batch_jobs.md#when-will-my-job-start) for scheduling policy). | ||
|
||
You can get an overview of the active jobs using the `qstat` command: | ||
<pre><code>$ qstat | ||
Job ID Name User Time Use S Queue | ||
---------- ---------------- --------------- -------- - ------- | ||
{{jobid}} run.sh {{userid}} 0:00:00 <b style="color:orange">Q</b> {{othercluster}} | ||
</code></pre> | ||
|
||
Eventually, after entering `qstat` again you should see that your job has started running: | ||
<pre><code>$ qstat | ||
Job ID Name User Time Use S Queue | ||
---------- ---------------- --------------- -------- - ------- | ||
{{jobid}} run.sh {{userid}} 0:00:01 <b style="color:green">R</b> {{othercluster}} | ||
</code></pre> | ||
|
||
If you don't see your job in the output of the `qstat` command anymore, your job has likely completed. | ||
|
||
Read [this section](running_batch_jobs.md#monitoring-and-managing-your-jobs) on how to interpret the output. | ||
|
||
### Inspect your results | ||
|
||
When your job finishes it generates 2 output files: | ||
|
||
- One for normal output messages (*stdout* output channel). | ||
- One for warning and error messages (*stderr* output channel). | ||
|
||
By default located in the directory where you issued `qsub`. | ||
|
||
{%- if site == 'Gent' %} | ||
|
||
!!! Info | ||
|
||
For more information about the stdout and stderr output channels, see this [section](./only/gent/linux-tutorial/beyond_the_basics.md#inputoutput). | ||
|
||
{%- endif %} | ||
|
||
In our example when running <code>ls</code> in the current directory you should see 2 new files: | ||
|
||
- **run.sh.o{{jobid}}**, containing *normal output messages* produced by job {{jobid}}; | ||
- **run.sh.e{{jobid}}**, containing *errors and warnings* produced by job {{jobid}}. | ||
|
||
!!! Info | ||
|
||
run.sh.e{{jobid}} should be empty (no errors or warnings). | ||
|
||
!!! Warning "Use your own job ID" | ||
|
||
Replace <b>{{jobid}}</b> with the jobid you got from the `qstat` command (see above) or simply look for added files in your current directory by running `ls`. | ||
|
||
When examining the contents of ``run.sh.o{{jobid}}`` you will see something like this: | ||
``` | ||
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz | ||
11493376/11490434 [==============================] - 1s 0us/step | ||
Epoch 1/5 | ||
1875/1875 [==============================] - 2s 823us/step - loss: 0.2960 - accuracy: 0.9133 | ||
Epoch 2/5 | ||
1875/1875 [==============================] - 1s 771us/step - loss: 0.1427 - accuracy: 0.9571 | ||
Epoch 3/5 | ||
1875/1875 [==============================] - 1s 767us/step - loss: 0.1070 - accuracy: 0.9675 | ||
Epoch 4/5 | ||
1875/1875 [==============================] - 1s 764us/step - loss: 0.0881 - accuracy: 0.9727 | ||
Epoch 5/5 | ||
1875/1875 [==============================] - 1s 764us/step - loss: 0.0741 - accuracy: 0.9768 | ||
313/313 - 0s - loss: 0.0782 - accuracy: 0.9764 | ||
``` | ||
|
||
Hurray 🎉, we trained a deep learning model and achieved 97,64 percent accuracy. | ||
|
||
!!! Warning | ||
|
||
When using TensorFlow specifically, you should actually submit jobs to a GPU cluster for better performance, see [GPU clusters](gpu.md). | ||
|
||
For the purpose of this example, we are running a very small TensorFlow workload on a CPU-only cluster. | ||
|
||
### Next steps | ||
|
||
- [Running interactive jobs](running_interactive_jobs.md) | ||
- [Running jobs with input/output data](running_jobs_with_input_output_data.md) | ||
- [Multi core jobs/Parallel Computing](multi_core_jobs.md) | ||
- [Interactive and debug cluster](interactive_debug.md#interactive-and-debug-cluster) | ||
|
||
For more examples see [Program examples](program_examples.md) and [Job script examples](jobscript_examples.md) |