First go to https://huggingface.co/bigscience/ and via your username (right upper corner) create "new Model"
while choosing the bigscience
as org.
Say you created https://huggingface.co/bigscience/misc-test-data/
Now on JZ side
module load git-lfs
git lfs install
git clone https://huggingface.co/bigscience/misc-test-data/
cd misc-test-data/
Now you can add files which are less than 10M, commit and push.
Make sure that if the file is larger than 10M its extension is tracked by git LFS, e.g. if you're adding foo.tar.gz
make sure *gz
is in .gitattributes
like so:
*.gz filter=lfs diff=lfs merge=lfs -text
if it isn't add it:
git lfs track "*.gz"
git commit -m "compressed files" .gitattributes
git push
only now add your large file foo.tar.gz
cp /some/place/foo.tar.gz .
git add foo.tar.gz
git commit -m "foo.tar.gz" foo.tar.gz
git push
Now you can tell the contributor on the other side where they can download the files you have just uploaded by sending them to the corresponding hub repo.
Once a repo has been cloned and is used as a destination for checkpoints and log files, the following process will automatically push any new files into it.
- Auth.
Typically you can skip directly to the stage 2 as stage 1 should already work.
We use a shared auth file located at $six_ALL_CCFRWORK/auth/.hub_info.json
for all processes syncing to the hub. We use a special account of the bigscience-bot
user so that the process doesn't depend on personal accounts.
If for some reason you need to override this shared file with a different auth data for a specific project, simply run:
tools/hub-auth.py
And enter login and password, and email, at prompt. This will create tools/.hub_info.json
with the username, email and then auth token locally.
- Now for each tracking repo, run the script with the desired pattern, e.g.:
module load git-lfs
DATA_OUTPUT_PATH=$six_ALL_CCFRSCRATCH/checkpoints/tr1-13B
CHECKPOINT_PATH=$DATA_OUTPUT_PATH/checkpoints
TENSORBOARD_PATH=$DATA_OUTPUT_PATH/tensorboard
CODECARBON_PATH=$DATA_OUTPUT_PATH/codecarbon
BIG_SCIENCE_REPO_PATH=$six_ALL_CCFRWORK/code/bigscience
$BIG_SCIENCE_REPO_PATH/tools/hub-sync.py --repo-path $TENSORBOARD_PATH --patterns '*tfevents*'
$BIG_SCIENCE_REPO_PATH/tools/hub-sync.py --repo-path $CODECARBON_PATH --patterns '*csv'
$BIG_SCIENCE_REPO_PATH/tools/hub-sync.py --repo-path $CHECKPOINT_PATH --patterns '*pt'
Of course this needs to be automated, so we will create slurm jobs to perform all these. These must be run on the prepost
partition, since it has a limited Internet access.
$ cat tr1-13B-hub-sync-tensorboard.slurm
#!/bin/bash
#SBATCH --job-name=tr1-13B-hub-sync-tensorboard # job name
#SBATCH --ntasks=1 # number of MP tasks
#SBATCH --nodes=1 # number of nodes
#SBATCH --cpus-per-task=1 # number of cores per task
#SBATCH --hint=nomultithread # we get physical cores not logical
#SBATCH --time=20:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
#SBATCH --partition=prepost
echo "START TIME: $(date)"
module load git-lfs
DATA_OUTPUT_PATH=$six_ALL_CCFRSCRATCH/checkpoints/tr1-13B
TENSORBOARD_PATH=$DATA_OUTPUT_PATH/tensorboard
BIG_SCIENCE_REPO_PATH=$six_ALL_CCFRWORK/code/bigscience
$BIG_SCIENCE_REPO_PATH/tools/hub-sync.py --repo-path $TENSORBOARD_PATH --patterns '*tfevents*' -d
echo "END TIME: $(date)"
XXX: create a slurm script for codecarbon when it starts operating
XXX: create a slurm script for checkpoints once we figure out how to share those
XXX: concern: if this is run from cron.hourly
what if the first git push
is still uploading when the next round is pushed?
Normally *txt
files aren't LFS tracked, so if your log file gets synced to he hub an it has grown over 10M you will get the next push fail with:
* Pushing 1 files
remote: -------------------------------------------------------------------------
remote: Your push was rejected because it contains files larger than 10M.
remote: Please use https://git-lfs.github.com/ to store larger files.
remote: -------------------------------------------------------------------------
remote: Offending files:
remote: - logs/main_log.txt (ref: refs/heads/main)
To https://huggingface.co/bigscience/tr3n-1B3-pile-fancy-logs
! [remote rejected] main -> main (pre-receive hook declined)
error: failed to push some refs to 'https://bigscience-bot:api_gyGezHBUDEGfyBxlAYTHCxQIbkjMUUEpaK@huggingface.co/bigscience/tr3n-1B3-pile-fancy-logs'
So you need to do the following from the cloned repo dir in question:
- Unstage the commits that weren't pushed:
git reset --soft origin/HEAD
- Add
*txt
to LFS-tracking
git lfs track "**.txt"
gc -am text .gitattributes
this will automatically switch to LFS on the next commit
- commit/push normally
git commit -m "update file" logs/main_log.txt
git push
In order to avoid this issue in the first place, it's best to set it up to:
git lfs track "**.txt"
gc -am text .gitattributes
when you first setup the repo clone.