Merge pull request #542 from xerbalind/getting_started

Add getting started chapter
hpcugent · Aug 16, 2023 · b3d7d99 · b3d7d99
2 parents 55c8b55 + 5b419e9
commit b3d7d99
Show file tree

Hide file tree

Showing 5 changed files with 302 additions and 0 deletions.
diff --git a/config/templates/hpc.template b/config/templates/hpc.template
@@ -7,6 +7,7 @@ docs_dir: docs/HPC
 nav:
   - Welcome: index.md
   - Introduction to HPC: introduction.md
+  - Getting Started: getting_started.md
   - Getting an HPC Account: account.md
   - Connecting to the HPC infrastructure: connecting.md
   - Running batch jobs: running_batch_jobs.md

diff --git a/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/README.md b/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/README.md
@@ -0,0 +1,7 @@
+TensorFlow example copied from https://github.com/EESSI/eessi-demo/tree/main/TensorFlow
+
+Loads MNIST datasets and trains a neural network to recognize hand-written digits.
+
+Runtime: ~1 min. on 8 cores (Intel Skylake)
+
+See https://www.tensorflow.org/tutorials/quickstart/beginner
diff --git a/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/run.sh b/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/run.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+
+module load TensorFlow/2.11.0-foss-2022a
+
+python tensorflow_mnist.py
diff --git a/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/tensorflow_mnist.py b/mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist/tensorflow_mnist.py
@@ -0,0 +1,21 @@
+# based on https://www.tensorflow.org/tutorials/quickstart/beginner
+
+import tensorflow as tf
+
+mnist = tf.keras.datasets.mnist
+
+(x_train, y_train), (x_test, y_test) = mnist.load_data()
+x_train, x_test = x_train / 255.0, x_test / 255.0
+
+model = tf.keras.models.Sequential([
+    tf.keras.layers.Flatten(input_shape=(28, 28)),
+    tf.keras.layers.Dense(128, activation='relu'),
+    tf.keras.layers.Dropout(0.2),
+    tf.keras.layers.Dense(10, activation='softmax'),
+])
+
+model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
+
+model.fit(x_train, y_train, epochs=5)
+
+model.evaluate(x_test, y_test, verbose=2)
diff --git a/mkdocs/docs/HPC/getting_started.md b/mkdocs/docs/HPC/getting_started.md
@@ -0,0 +1,268 @@
+{% set exampleloc="mkdocs/docs/HPC/examples/Getting_Started/tensorflow_mnist" %}
+# Getting Started
+
+Welcome to the "Getting Started" guide. This chapter will lead you through the initial steps of logging into the {{hpcinfra}} and submitting your very first job. We'll also walk you through the process step by step using a practical example.
+
+In addition to this chapter, you might find the [recording of the *Introduction to HPC-UGent* training session](https://www.ugent.be/hpc/en/training/introhpcugent-recording) to be a useful resource.
+
+Before proceeding, read [the introduction to HPC](introduction.md) to gain an understanding of the {{ hpcinfra }} and related terminology.
+
+### Getting Access
+
+To get access to the {{hpcinfra}}, visit [Getting an HPC Account](account.md).
+
+If you have not used Linux before, 
+{%- if site == 'Gent' %}
+now would be a good time to follow our [Linux Tutorial](./only/gent/linux-tutorial/index.md).
+{%- else %}
+please learn some basics first before continuing. (see [Appendix C - Useful Linux Commands](useful_linux_commands.md))
+{%- endif %}
+
+#### A typical workflow looks like this:
+
+1.  Connect to the login nodes 
+2.  Transfer your files to the {{hpcinfra}}
+3.  Optional: compile your code and test it 
+4.  Create a job script and submit your job
+5.  Wait for job to be executed
+6.  Study the results generated by your jobs, either on the cluster or
+    after downloading them locally.
+
+We will walk through an illustrative workload to get you started. In this example, our objective is to train a deep learning model for recognizing hand-written digits (MNIST dataset) using [TensorFlow](https://www.tensorflow.org/);
+see the [example scripts](https://github.com/hpcugent/vsc_user_docs/tree/main/{{exampleloc}}).
+
+### Getting Connected
+
+There are two options to connect
+
+- Using a terminal to connect via SSH (for power users) (see [First Time connection to the {{ hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure))
+- [Using the web portal](web_portal.md)
+
+Considering your operating system is **{{OS}}**, 
+
+{%- if OS == linux %}
+it is recommended to make use of the `ssh` command in a terminal to get the most flexibility. 
+
+Assuming you have already generated SSH keys in the previous step ([Getting Access](#getting-access)), and that they are in a default location, you should now be able to login by running the following command:
+
+<pre><code>ssh {{userid}}@{{loginnode}}</code></pre>
+
+!!! Warning "User your own VSC account id"
+
+    Replace <b>{{userid}}</b> with your VSC account id (see <https://account.vscentrum.be>)
+
+!!! Tip
+
+    You can also still use the web portal (see [shell access on web portal](web_portal.md#shell-access))
+
+{%- else %}
+{%- if OS == windows %} it is recommended to use the web portal.
+{%- else %} it should be easy to make use of the `ssh` command in a terminal, but the web portal will work too. {%- endif %}
+
+The [web portal](web_portal.md) offers a convenient way to upload files and gain shell access to the {{hpcinfra}} from a standard web browser (no software installation or configuration required).
+
+See [shell access](web_portal.md#shell-access) when using the web portal, or
+[connection to the {{hpcinfra}}](connecting.md#first-time-connection-to-the-hpc-infrastructure) when using a terminal.
+
+Make sure you can get to a shell access to the {{hpcinfra}} before proceeding with the next steps.
+
+{%- endif %}
+
+!!! Info
+
+    When having problems see the [connection issues section on the troubleshooting page](troubleshooting.md#sec:connecting-issues).
+
+
+### Transfer your files
+
+Now that you can login, it is time to transfer files from your local computer to your **home directory** on the {{hpcinfra}}.
+
+Download [tensorflow_mnist.py](https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/tensorflow_mnist.py) 
+and [run.sh](https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/run.sh) example scripts to your computer (from [here](https://github.com/hpcugent/vsc_user_docs/tree/main/{{exampleloc}})).
+
+{%- if OS == windows %}
+
+The [HPC-UGent web portal](https://login.hpc.ugent.be) provides a file browser that allows uploading files.
+For more information see the [file browser section](web_portal.md#file-browser).
+
+Upload both files (`run.sh` and `tensorflow-mnist.py`) to your **home directory** and go back to your shell.
+
+!!! Info
+
+    As an alternative, you can use WinSCP (see [our section](connecting.md#winscp))
+
+{%- else %}
+
+On your local machine you can run:
+<pre><code>curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/tensorflow_mnist.py
+curl -OL https://raw.githubusercontent.com/hpcugent/vsc_user_docs/main/{{exampleloc}}/run.sh
+</code></pre>
+
+Using the `scp` command, the files can be copied from your local host to your *home directory* (`~`) on the remote host (HPC).
+<pre><code>scp tensorflow_mnist.py run.sh {{userid}}{{ loginnode }}:~ </code></pre>
+<pre><code>ssh  {{userid}}@{{ loginnode }} </code></pre>
+
+!!! Warning "User your own VSC account id"
+
+    Replace <b>{{userid}}</b> with your VSC account id (see <https://account.vscentrum.be>)
+
+!!! Info
+
+    For more information about transfering files or `scp`, see [tranfer files from/to hpc](connecting.md#transfer-files-tofrom-the-hpc).
+
+{%- endif %}
+
+When running `ls` in your session on the {{hpcinfra}}, you should see the two files listed in your home directory (`~`):
+
+```shell
+$ ls ~
+run.sh tensorflow_mnist.py
+```
+
+When you do not see these files, make sure you uploaded the files to your **home directory**.
+
+### Submitting a job
+
+Jobs are submitted and executed using job scripts. In our case **run.sh** can be used as a (very minimal) job script.
+
+A job script is a shell script, a text file that specifies the resources, 
+the software that is used (via `module load` statements), 
+and the steps that should be executed to run the calculation.
+
+Our job script looks like this:
+
+<center>-- run.sh --</center>
+
+```bash
+#!/bin/bash
+
+module load TensorFlow/2.11.0-foss-2022a
+
+python tensorflow_mnist.py
+
+```
+<sub>As you can see this job script will run the Python script named **tensorflow_mnist.py**.</sub>
+
+
+The jobs you submit are per default executed on **cluser/{{defaultcluster}}**, you can swap to another cluster by issuing the following command.
+
+```shell
+module swap cluster/{{othercluster}}
+```
+
+!!! Tip
+
+    When submitting jobs with limited amount of resources, it is recommended to use the [debug/interactive cluster](interactive_debug.md#interactive-and-debug-cluster): `donphan`. 
+
+{%- if site == 'Gent' %}
+
+    To get a list of all clusters and their hardware, see <https://www.ugent.be/hpc/en/infrastructure>.
+
+{%- endif %}
+
+This job script can now be submitted to the cluster's job system for execution, using the qsub (**q**ueue **sub**mit) command:
+
+```shell
+$ qsub run.sh
+{{jobid}}
+```
+
+This command returns a job identifier (*{{jobid}}*) on the HPC cluster. This is a unique identifier for the job which can be used to monitor and manage your job.
+
+!!! Warning "Make sure you understand what the `module` command does"
+
+    Note that the module commands only modify environment variables. For instance, running `module swap cluster/{{othercluster}}` will update your shell environment so that `qsub` submits a job to the `{{othercluster}}` cluster, 
+    but our active shell session is still running on the login node.
+
+    It is important to understand that while `module` commands affect your session environment, they do ***not*** change where the commands your are running are being executed: they will still be run on the login node you are on.
+
+    When you submit a job script however, the commands ***in*** the job script will be run on a workernode of the cluster the job was submitted to (like `{{othercluster}}`).
+
+For detailed information about `module` commands, read the [running batch jobs](running_batch_jobs.md) chapter.
+
+### Wait for job to be executed
+
+Your job is put into a queue before being executed, so it may take a while before it actually starts.
+(see [when will my job start?](running_batch_jobs.md#when-will-my-job-start) for scheduling policy).
+
+You can get an overview of the active jobs using the `qstat` command:
+<pre><code>$ qstat
+Job ID     Name             User            Time Use S Queue
+---------- ---------------- --------------- -------- - -------
+{{jobid}}     run.sh           {{userid}}        0:00:00  <b style="color:orange">Q</b> {{othercluster}}
+</code></pre> 
+
+Eventually, after entering `qstat` again you should see that your job has started running:
+<pre><code>$ qstat
+Job ID     Name             User            Time Use S Queue
+---------- ---------------- --------------- -------- - -------
+{{jobid}}     run.sh           {{userid}}        0:00:01  <b style="color:green">R</b> {{othercluster}}
+</code></pre> 
+
+If you don't see your job in the output of the `qstat` command anymore, your job has likely completed.
+
+Read [this section](running_batch_jobs.md#monitoring-and-managing-your-jobs) on how to interpret the output.
+
+### Inspect your results
+
+When your job finishes it generates 2 output files:
+
+- One for normal output messages (*stdout* output channel).
+- One for warning and error messages (*stderr* output channel).
+
+By default located in the directory where you issued `qsub`.
+
+{%- if site == 'Gent' %}
+
+!!! Info
+
+    For more information about the stdout and stderr output channels, see this [section](./only/gent/linux-tutorial/beyond_the_basics.md#inputoutput).
+
+{%- endif %}
+
+In our example when running <code>ls</code> in the current directory you should see 2 new files:
+
+- **run.sh.o{{jobid}}**, containing *normal output messages* produced by job {{jobid}};
+- **run.sh.e{{jobid}}**, containing *errors and warnings* produced by job {{jobid}}.
+
+!!! Info
+
+    run.sh.e{{jobid}} should be empty (no errors or warnings).
+
+!!! Warning "Use your own job ID"
+
+    Replace <b>{{jobid}}</b> with the jobid you got from the `qstat` command (see above) or simply look for added files in your current directory by running `ls`.
+
+When examining the contents of ``run.sh.o{{jobid}}`` you will see something like this:
+```
+Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
+11493376/11490434 [==============================] - 1s 0us/step
+Epoch 1/5
+1875/1875 [==============================] - 2s 823us/step - loss: 0.2960 - accuracy: 0.9133
+Epoch 2/5
+1875/1875 [==============================] - 1s 771us/step - loss: 0.1427 - accuracy: 0.9571
+Epoch 3/5
+1875/1875 [==============================] - 1s 767us/step - loss: 0.1070 - accuracy: 0.9675
+Epoch 4/5
+1875/1875 [==============================] - 1s 764us/step - loss: 0.0881 - accuracy: 0.9727
+Epoch 5/5
+1875/1875 [==============================] - 1s 764us/step - loss: 0.0741 - accuracy: 0.9768
+313/313 - 0s - loss: 0.0782 - accuracy: 0.9764
+```
+
+Hurray 🎉, we trained a deep learning model and achieved 97,64 percent accuracy.
+
+!!! Warning
+
+    When using TensorFlow specifically, you should actually submit jobs to a GPU cluster for better performance, see [GPU clusters](gpu.md).
+
+    For the purpose of this example, we are running a very small TensorFlow workload on a CPU-only cluster.
+
+### Next steps
+
+- [Running interactive jobs](running_interactive_jobs.md)
+- [Running jobs with input/output data](running_jobs_with_input_output_data.md)
+- [Multi core jobs/Parallel Computing](multi_core_jobs.md)
+- [Interactive and debug cluster](interactive_debug.md#interactive-and-debug-cluster)
+
+For more examples see [Program examples](program_examples.md) and [Job script examples](jobscript_examples.md)