Skip to content

Commit

Permalink
Merge pull request #542 from xerbalind/getting_started
Browse files Browse the repository at this point in the history
Add getting started chapter
  • Loading branch information
boegel authored Aug 16, 2023
2 parents 55c8b55 + 5b419e9 commit b3d7d99
Show file tree
Hide file tree
Showing 5 changed files with 302 additions and 0 deletions.
1 change: 1 addition & 0 deletions config/templates/hpc.template
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ docs_dir: docs/HPC
nav:
- Welcome: index.md
- Introduction to HPC: introduction.md
- Getting Started: getting_started.md
- Getting an HPC Account: account.md
- Connecting to the HPC infrastructure: connecting.md
- Running batch jobs: running_batch_jobs.md
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
TensorFlow example copied from https://github.com/EESSI/eessi-demo/tree/main/TensorFlow

Loads MNIST datasets and trains a neural network to recognize hand-written digits.

Runtime: ~1 min. on 8 cores (Intel Skylake)

See https://www.tensorflow.org/tutorials/quickstart/beginner
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash

module load TensorFlow/2.11.0-foss-2022a

python tensorflow_mnist.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# based on https://www.tensorflow.org/tutorials/quickstart/beginner

import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax'),
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.fit(x_train, y_train, epochs=5)

model.evaluate(x_test, y_test, verbose=2)
268 changes: 268 additions & 0 deletions mkdocs/docs/HPC/getting_started.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,268 @@
{% set exampleloc="mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist" %}
# Getting Started

Welcome to the "Getting Started" guide. This chapter will lead you through the initial steps of logging into the {{hpcinfra}} and submitting your very first job. We'll also walk you through the process step by step using a practical example.

In addition to this chapter, you might find the [recording of the *Introduction to HPC-UGent* training session](https://www.ugent.be/hpc/en/training/introhpcugent-recording) to be a useful resource.

Before proceeding, read [the introduction to HPC](introduction.md) to gain an understanding of the {{ hpcinfra }} and related terminology.

### Getting Access

To get access to the {{hpcinfra}}, visit [Getting an HPC Account](account.md).

If you have not used Linux before,
{%- if site == 'Gent' %}
now would be a good time to follow our [Linux Tutorial](./only/gent/linux-tutorial/index.md).
{%- else %}
please learn some basics first before continuing. (see [Appendix C - Useful Linux Commands](useful_linux_commands.md))
{%- endif %}

#### A typical workflow looks like this:

1. Connect to the login nodes
2. Transfer your files to the {{hpcinfra}}
3. Optional: compile your code and test it
4. Create a job script and submit your job
5. Wait for job to be executed
6. Study the results generated by your jobs, either on the cluster or
after downloading them locally.

We will walk through an illustrative workload to get you started. In this example, our objective is to train a deep learning model for recognizing hand-written digits (MNIST dataset) using [TensorFlow](https://www.tensorflow.org/);
see the [example scripts](https://github.com/hpcugent/vsc_user_docs/tree/main/{{exampleloc}}).

### Getting Connected

There are two options to connect

- Using a terminal to connect via SSH (for power users) (see [First Time connection to the {{ hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure))
- [Using the web portal](web_portal.md)

Considering your operating system is **{{OS}}**,

{%- if OS == linux %}
it is recommended to make use of the `ssh` command in a terminal to get the most flexibility.

Assuming you have already generated SSH keys in the previous step ([Getting Access](#getting-access)), and that they are in a default location, you should now be able to login by running the following command:

<pre><code>ssh {{userid}}@{{loginnode}}</code></pre>

!!! Warning "User your own VSC account id"

Replace <b>{{userid}}</b> with your VSC account id (see <https://account.vscentrum.be>)

!!! Tip

You can also still use the web portal (see [shell access on web portal](web_portal.md#shell-access))

{%- else %}
{%- if OS == windows %} it is recommended to use the web portal.
{%- else %} it should be easy to make use of the `ssh` command in a terminal, but the web portal will work too. {%- endif %}

The [web portal](web_portal.md) offers a convenient way to upload files and gain shell access to the {{hpcinfra}} from a standard web browser (no software installation or configuration required).

See [shell access](web_portal.md#shell-access) when using the web portal, or
[connection to the {{hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure) when using a terminal.

Make sure you can get to a shell access to the {{hpcinfra}} before proceeding with the next steps.

{%- endif %}

!!! Info

When having problems see the [connection issues section on the troubleshooting page](troubleshooting.md#sec:connecting-issues).


### Transfer your files

Now that you can login, it is time to transfer files from your local computer to your **home directory** on the {{hpcinfra}}.

Download [tensorflow_mnist.py](https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/tensorflow_mnist.py)
and [run.sh](https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/run.sh) example scripts to your computer (from [here](https://github.com/hpcugent/vsc_user_docs/tree/main/{{exampleloc}})).

{%- if OS == windows %}

The [HPC-UGent web portal](https://login.hpc.ugent.be) provides a file browser that allows uploading files.
For more information see the [file browser section](web_portal.md#file-browser).

Upload both files (`run.sh` and `tensorflow-mnist.py`) to your **home directory** and go back to your shell.

!!! Info

As an alternative, you can use WinSCP (see [our section](connecting.md#winscp))

{%- else %}

On your local machine you can run:
<pre><code>curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/tensorflow_mnist.py
curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/run.sh
</code></pre>

Using the `scp` command, the files can be copied from your local host to your *home directory* (`~`) on the remote host (HPC).
<pre><code>scp tensorflow_mnist.py run.sh {{userid}}{{ loginnode }}:~ </code></pre>
<pre><code>ssh {{userid}}@{{ loginnode }} </code></pre>

!!! Warning "User your own VSC account id"

Replace <b>{{userid}}</b> with your VSC account id (see <https://account.vscentrum.be>)

!!! Info

For more information about transfering files or `scp`, see [tranfer files from/to hpc](connecting.md#transfer-files-tofrom-the-hpc).

{%- endif %}

When running `ls` in your session on the {{hpcinfra}}, you should see the two files listed in your home directory (`~`):

```shell
$ ls ~
run.sh tensorflow_mnist.py
```

When you do not see these files, make sure you uploaded the files to your **home directory**.

### Submitting a job

Jobs are submitted and executed using job scripts. In our case **run.sh** can be used as a (very minimal) job script.

A job script is a shell script, a text file that specifies the resources,
the software that is used (via `module load` statements),
and the steps that should be executed to run the calculation.

Our job script looks like this:

<center>-- run.sh --</center>

```bash
#!/bin/bash

module load TensorFlow/2.11.0-foss-2022a

python tensorflow_mnist.py

```
<sub>As you can see this job script will run the Python script named **tensorflow_mnist.py**.</sub>


The jobs you submit are per default executed on **cluser/{{defaultcluster}}**, you can swap to another cluster by issuing the following command.

```shell
module swap cluster/{{othercluster}}
```

!!! Tip

When submitting jobs with limited amount of resources, it is recommended to use the [debug/interactive cluster](interactive_debug.md#interactive-and-debug-cluster): `donphan`.

{%- if site == 'Gent' %}

To get a list of all clusters and their hardware, see <https://www.ugent.be/hpc/en/infrastructure>.

{%- endif %}

This job script can now be submitted to the cluster's job system for execution, using the qsub (**q**ueue **sub**mit) command:

```shell
$ qsub run.sh
{{jobid}}
```

This command returns a job identifier (*{{jobid}}*) on the HPC cluster. This is a unique identifier for the job which can be used to monitor and manage your job.

!!! Warning "Make sure you understand what the `module` command does"

Note that the module commands only modify environment variables. For instance, running `module swap cluster/{{othercluster}}` will update your shell environment so that `qsub` submits a job to the `{{othercluster}}` cluster,
but our active shell session is still running on the login node.

It is important to understand that while `module` commands affect your session environment, they do ***not*** change where the commands your are running are being executed: they will still be run on the login node you are on.

When you submit a job script however, the commands ***in*** the job script will be run on a workernode of the cluster the job was submitted to (like `{{othercluster}}`).

For detailed information about `module` commands, read the [running batch jobs](running_batch_jobs.md) chapter.

### Wait for job to be executed

Your job is put into a queue before being executed, so it may take a while before it actually starts.
(see [when will my job start?](running_batch_jobs.md#when-will-my-job-start) for scheduling policy).

You can get an overview of the active jobs using the `qstat` command:
<pre><code>$ qstat
Job ID Name User Time Use S Queue
---------- ---------------- --------------- -------- - -------
{{jobid}} run.sh {{userid}} 0:00:00 <b style="color:orange">Q</b> {{othercluster}}
</code></pre>

Eventually, after entering `qstat` again you should see that your job has started running:
<pre><code>$ qstat
Job ID Name User Time Use S Queue
---------- ---------------- --------------- -------- - -------
{{jobid}} run.sh {{userid}} 0:00:01 <b style="color:green">R</b> {{othercluster}}
</code></pre>

If you don't see your job in the output of the `qstat` command anymore, your job has likely completed.

Read [this section](running_batch_jobs.md#monitoring-and-managing-your-jobs) on how to interpret the output.

### Inspect your results

When your job finishes it generates 2 output files:

- One for normal output messages (*stdout* output channel).
- One for warning and error messages (*stderr* output channel).

By default located in the directory where you issued `qsub`.

{%- if site == 'Gent' %}

!!! Info

For more information about the stdout and stderr output channels, see this [section](./only/gent/linux-tutorial/beyond_the_basics.md#inputoutput).

{%- endif %}

In our example when running <code>ls</code> in the current directory you should see 2 new files:

- **run.sh.o{{jobid}}**, containing *normal output messages* produced by job {{jobid}};
- **run.sh.e{{jobid}}**, containing *errors and warnings* produced by job {{jobid}}.

!!! Info

run.sh.e{{jobid}} should be empty (no errors or warnings).

!!! Warning "Use your own job ID"

Replace <b>{{jobid}}</b> with the jobid you got from the `qstat` command (see above) or simply look for added files in your current directory by running `ls`.

When examining the contents of ``run.sh.o{{jobid}}`` you will see something like this:
```
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step
Epoch 1/5
1875/1875 [==============================] - 2s 823us/step - loss: 0.2960 - accuracy: 0.9133
Epoch 2/5
1875/1875 [==============================] - 1s 771us/step - loss: 0.1427 - accuracy: 0.9571
Epoch 3/5
1875/1875 [==============================] - 1s 767us/step - loss: 0.1070 - accuracy: 0.9675
Epoch 4/5
1875/1875 [==============================] - 1s 764us/step - loss: 0.0881 - accuracy: 0.9727
Epoch 5/5
1875/1875 [==============================] - 1s 764us/step - loss: 0.0741 - accuracy: 0.9768
313/313 - 0s - loss: 0.0782 - accuracy: 0.9764
```

Hurray 🎉, we trained a deep learning model and achieved 97,64 percent accuracy.

!!! Warning

When using TensorFlow specifically, you should actually submit jobs to a GPU cluster for better performance, see [GPU clusters](gpu.md).

For the purpose of this example, we are running a very small TensorFlow workload on a CPU-only cluster.

### Next steps

- [Running interactive jobs](running_interactive_jobs.md)
- [Running jobs with input/output data](running_jobs_with_input_output_data.md)
- [Multi core jobs/Parallel Computing](multi_core_jobs.md)
- [Interactive and debug cluster](interactive_debug.md#interactive-and-debug-cluster)

For more examples see [Program examples](program_examples.md) and [Job script examples](jobscript_examples.md)

0 comments on commit b3d7d99

Please sign in to comment.