Skip to content

Tutorial: Upload a dataset to VALERIA

Olivier Gamache edited this page Sep 25, 2024 · 11 revisions

The University provided access to cutting edge data storage and transfer solutions through the VALERIA service. Most notably, professors have access to 6 Tb of Amazon S3 storage through VALERIA. This storage is ideal for managing datasets. In this tutorial, we will cover how to get access to this storage, upload data and eventually share datasets with the community.

This guide was written by Dominic Baril, 2022

Update, Maxime Vaidis September 2022

Getting access to the storage

  • You will first need to create a VALERIA account, and ask for your supervisor to give you access to the platform.

  • Once your account is created, you need to ask for permission to access to the storage. For example, François has a /norlab repository.

  • You then need to configure your access to the storage. If you are on Ubuntu, the easiest client to configure is rclone.

  • Install rclone using sudo apt install rclone (to ensure the version is stable).

  • Follow the steps shown on this page starting at step 2.1 to configure it on your local computer. You will need to enter your access keys, which can be found on your VALERIA dashboard. Once your rclone is configured to have access to VALERIA's S3 storage, you should see the following return when entering the rclone config command:

Current remotes:

Name                 Type
====                 ====
VALERIAS3            s3

Uploading / Downloading datasets

You can then use rclone commands to interact with the S3 storage. Refer to the rclone documentation for details on all possible commands. Here are some simple useful ones:

Creating a link for anyone to download your dataset

  • Once your files are uploaded in the S3 storage, you might want to create a link for quick remote access to share your dataset with the community. To do so, you will need to use the s3cmd client instead. You can access it through a terminal via the JupyterHub server on VALERIA.

  • Once on JupyterHub, you will first need to configure the S3 command-line tool. Open a terminal and create a .s3cfg file:

touch .s3cfg
  • Then, using your favorite command-line text editor, paste the following lines in the .s3cfg file and adjust the values of the access_key and secret_key parameters:
[default]
# VOTRE IDUL
access_key = YOUR_ACCESS_KEY
secret_key = YOUR_SECRET_KEY

access_token =
add_encoding_exts =
add_headers =
bucket_location = US
ca_certs_file =
cache_file =
check_ssl_certificate = True
check_ssl_hostname = True
cloudfront_host = cloudfront.amazonaws.com
content_disposition =
content_type =
default_mime_type = binary/octet-stream
delay_updates = False
delete_after = False
delete_after_fetch = False
delete_removed = False
dry_run = False
enable_multipart = True
encoding = UTF-8
encrypt = False
expiry_date =
expiry_days =
expiry_prefix =
follow_symlinks = False
force = False
get_continue = False
gpg_command = /usr/bin/gpg
gpg_decrypt = %(gpg_command)s -d --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_encrypt = %(gpg_command)s -c --verbose --no-use-agent --batch --yes --passphrase-fd %(passphrase_fd)s -o %(output_file)s %(input_file)s
gpg_passphrase =
guess_mime_type = True
host_base = s3.valeria.science
host_bucket = %(bucket)s.s3.valeria.science
human_readable_sizes = False
invalidate_default_index_on_cf = False
invalidate_default_index_root_on_cf = True
invalidate_on_cf = False
kms_key =
limit = -1
limitrate = 0
list_md5 = False
log_target_prefix =
long_listing = False
max_delete = -1
mime_type =
multipart_chunk_size_mb = 15
multipart_max_chunks = 10000
preserve_attrs = True
progress_meter = True
proxy_host =
proxy_port = 0
  • Finally, generate the URL to the dataset:
s3cmd signurl s3://norlab/fr2021_dataset/winter_dataset.zip/winter.zip $(echo "`date +%s` + 3600 * 24 * 7 * 1000" | bc)

This command returns a URL to download the dataset, which is valid for 1000 weeks. It is up to you to define the expiry date for your URL, but beware that an expired URL will prevent other people in the scientific community from accessing your dataset.

Norlab's Robots

Protocols

Templates

Resources

Grants

Datasets

Mapping

Deep Learning

ROS

Ubuntu

Docker (work in progress)

Tips & tricks

Clone this wiki locally