Skip to content

Commit

Permalink
Merge pull request #10 from smaht-dac/rclone-support-fresh-20240423
Browse files Browse the repository at this point in the history
Mostly rclone related work
  • Loading branch information
dmichaels-harvard authored Jun 11, 2024
2 parents b094888 + 678b9a5 commit 1d6c889
Show file tree
Hide file tree
Showing 120 changed files with 258,226 additions and 4,034 deletions.
56 changes: 56 additions & 0 deletions .github/workflows/main-integration-tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Build for submitr

name: INTEGRATION TESTS

# Controls when the action will run.
on:
# Triggers the workflow on push or pull request events but only for the master branch
push:
branches: [ master ]
pull_request:
branches: [ master ]

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
# This workflow contains a single job called "build"
build:
name: TEST INTEGRATION WITH PYTHON ${{ matrix.python_version }}

# The type of runner that the job will run on
runs-on: ubuntu-22.04
strategy:
matrix:
python_version: [3.11]

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
- uses: actions/checkout@v3
- uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python_version }}

- name: BUILD
run: |
make build
# The integration tests actually talk to AWS S3 and Google Cloud Storage (GCS);
# both directly (via Python boto3 and google.cloud.storage) and via rclone.
# The access credentials are defined by the environment variables described below.
- name: INTEGRATION TESTS
env:
# These are setup in GitHub as "secrets". The AWS access key values are currently,
# May 2024, for the special user test-integration-user in the smaht-wolf account;
# the access key was created on 2024-05-15. The Google value is the JSON from the
# service account file exported from the HMS Google account for the smaht-dac project;
# the service account email is ga4-service-account@smaht-dac.iam.gserviceaccount.com;
# its key ID is b488dd9cfde6b59b1aa347aabd9add86c7ff9057; it was created on 2024-04-28.
AWS_DEFAULT_REGION: ${{ secrets.AWS_DEFAULT_REGION }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
GOOGLE_CLOUD_SERVICE_ACCOUNT_JSON: ${{ secrets.GOOGLE_CLOUD_SERVICE_ACCOUNT_JSON }}
run: |
make test-integration
6 changes: 3 additions & 3 deletions .github/workflows/main-publish.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# PyPi publish for submitr

name: publish
name: PUBLISH

# Controls when the action will run.
on:
Expand Down Expand Up @@ -29,11 +29,11 @@ jobs:
# For publication, we're not testing anything, so use latest allowable version of Python
python-version: 3.9

- name: Build
- name: BUILD
run: |
make build
- name: Publish
- name: PUBLISH
env:
PYPI_USER: ${{ secrets.PYPI_USER }}
PYPI_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
Expand Down
12 changes: 6 additions & 6 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@ on:
jobs:
# This workflow contains a single job called "build"
build:
name: Test submitr with Python ${{ matrix.python_version }}
name: TEST WITH PYTHON ${{ matrix.python_version }}

# The type of runner that the job will run on
runs-on: ubuntu-22.04
strategy:
matrix:
python_version: [3.8, 3.9, 3.11]
python_version: ['3.9', '3.10', '3.11']

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
Expand All @@ -33,20 +33,20 @@ jobs:
with:
python-version: ${{ matrix.python_version }}

- name: Install Deps
- name: INSTALL DEPENDENCIES
run: |
make configure
- name: Build
- name: BUILD
run: |
poetry install
poetry run flake8 submitr
- name: CI
run: |
poetry run coverage run --source submitr -m pytest -vv
poetry run coverage run --source submitr -m pytest -vv -m "not integration"
- name: Coveralls
- name: COVERALLS
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
# For testing, use the oldest allowable version of Python to be sure we're not using features added later
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -145,3 +145,7 @@ docs/build

# Vim
*.swp

# Junk Python files.
?.py
.tmp/
38 changes: 27 additions & 11 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,12 @@ smaht-submitr
Change Log
----------

0.8.2
=====

* 2024-05-08/dmichaels/PR-8
Pass validate_only flag to ingester on --validate-remote-skip to
skip server-side validation on submit; previously this flag merely
served to skip kicking off server-side validation from submitr.
ONLY allowed (on server-side) for admin users.


0.8.1
0.8.3
=====

* 2024-05-14/dmichaels/PR-10
* Added rclone support; most relevant code in submitr/rclone directory.
A lot of refactoring of file upload related code for this (see files_for_upload.py)
* Added metadata_template.py module with goal of checking the user's metadata
file with the latest HMS DBMI metadata template and giving a warning if the
version appears to be out of date. Also new convenience command to export and
Expand All @@ -33,8 +26,31 @@ Change Log
* Improved messaging for check-submission.
* Fix for usage of --keys (was not being used for server validation/submission).
* Minor fix for --validate-local-skip option (undefined structured_data variable).
* Fix for --validate-remote-skip option to pass validate_skip to ingester to
skip the validation on submission which happens by default before the loadxl.
* Added --files for use with --info to submit-metadata-bundle.
* For file uploads, after asking the same yes/no question and getting the same response many
times in a row, ask if all subsequent such questions should automatically get the same answer.
* Removed ref_lookup_strategy references for structured_data; refactored/internalized in dcicutils.


0.8.2
=====

* 2024-05-08/dmichaels/PR-8
Pass validate_only flag to ingester on --validate-remote-skip to
skip server-side validation on submit; previously this flag merely
served to skip kicking off server-side validation from submitr.
ONLY allowed (on server-side) for admin users.


0.8.2
=====

* 2024-05-08/dmichaels/PR-8
Pass validate_only flag to ingester on --validate-remote-skip to
skip server-side validation on submit; previously this flag merely
served to skip kicking off server-side validatieon from submitr.

0.8.0
=====
Expand Down
6 changes: 5 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,11 @@ build: # builds
poetry install

test:
pytest -vv
pytest -m "not integration"

test-integration:
# pytest -vv submitr/tests/test_rclone_support.py
pytest -m integration

retest: # runs only failed tests from the last test run. (if no failures, it seems to run all?? -kmp 17-Dec-2020)
pytest -vv --last-failed
Expand Down
94 changes: 94 additions & 0 deletions demo/annual_2024/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
Notes on demo file (bcm_formatted_hapmapmix.xlsx) for Annual Meeting in St. Louis, June 2024

- Some commands:

- Submit metadata:
> submit-metadata-bundle --env smaht-local --submit --directory files bcm_formatted_hapmapmix.xlsx

- Validate metadata:
> submit-metadata-bundle --env smaht-local --validate --directory files bcm_formatted_hapmapmix.xlsx

- Submit metadata with rclone support to upload (transfer) to S3 from Google (GCS) if a file for upload is there:
> submit-metadata-bundle --env smaht-local --submit --directory files \
--rclone-google-source smaht-submitr-rclone-testing/demo \
--rclone-google-credentials ~/.config/google-cloud/smaht-dac-617e0480d8e2.json bcm_formatted_hapmapmix.xlsx
> Note these files are currently (2024-05-31) in GCS:
- gs://smaht-submitr-rclone-testing/demo/222TWJLT4-1-IDUDI0056v2_S2_L001_R2_001.fastq.gz
- gs://smaht-submitr-rclone-testing/demo/222TWJLT4-1-IDUDI0055v2_S1_L001_R2_001.fastq.gz

- Resume upload:
> resume-uploads --env smaht-local --directory files <submission-uuid-or-upload-file-uuid-or-accession>

- Get info (only - no submit or validate) related to metadata file:
> submit-metadata-bundle --env smaht-local --info --refs --files --directory files bcm_formatted_hapmapmix.xlsx

- Dump metadata (only - no submit or validate) as JSON:
> submit-metadata-bundle --env smaht-local --json-only bcm_formatted_hapmapmix.xlsx

- View known submission-centers/consortia:
> submit-metadata-bundle --env smaht-local --submission-centers --consortia

- List recent submissions (add --mine to see only yours):
> list-submissions --env smaht-local

- Get info on submission - with optional continue on to submission if the submission-uuid is for a validation:
> check-submission --env smaht-local <submission-uuid>

- Download latest HMS metadata template:
> get-metadata-template <file-name-with-dot-xlsx-suffix>

- View arbitrary portal object (for troubleshooting)
> view-portal-object --env smaht-local <uuid-or-object-path>

- Use rclone to copy smaht-local file to Google (for testing/troubleshooting):
> rcloner copy <your-file> gs://smaht-submitr-rclone-testing/demo -gcs ~/.config/google-cloud/smaht-dac-617e0480d8e2.json

- Use rclone to copy file from Google to local current directory (for testing/troubleshooting):
> rcloner copy gs://smaht-submitr-rclone-testing/demo/<your-file> . -gcs ~/.config/google-cloud/smaht-dac-617e0480d8e2.json

- Use rclone to get info about a file in Google (for testing/troubleshooting):
> rcloner info gs://smaht-submitr-rclone-testing/demo/<some-file> . -gcs ~/.config/google-cloud/smaht-dac-617e0480d8e2.json

- File bcm_formatted_hapmapmix.xlsx from William Feng on 2024-05-21
https://docs.google.com/spreadsheets/d/1qCm0bY-vG4a9uiaOvmKHZ12MvhmMKKRfEpgAm-7Hsh4/edit#gid=1645623888
https://hms-dbmi.slack.com/archives/D05LSGRQYV7/p1716239277185859

- Made some minor corrections to this spreadsheet locally
- Removed blank row #3 in Sequencing sheet
- Change values of target_read_length in Sequencing tab from '25-30 kb' and '15-20 kb' to 27500 and 17500
- Changed all submission-center prefixes in submitted_id values to be DAC (previously mixture of BCM, MAYO, WASHU, USWC)

- Dependencies for this spreadsheet; in smaht-portal/src/encoded/tests/data/demo_inserts;
also in dependencies directory here; manually load/upsert these with create-dependencies.sh.
/Assay/bulk_rna_seq
/Assay/bulk_wgs_pcr_free
/FileFormat/bam
/FileFormat/bam_bai
/FileFormat/bam_pbi
/FileFormat/fastq_gz
/Sequencer/illumina_novaseq_6000
/Sequencer/illumina_novaseq_x
/Sequencer/ont_promethion_24
/Sequencer/pacbio_revio_hifi
/Sequencing/BCM_SEQUENCING_NOVASEQX-400X
/Sequencing/BCM_SEQUENCING_ONT-100X
/Sequencing/BCM_SEQUENCING_PACBIO-100X
/Software/BCM_SOFTWARE_BCL2FASTQ2
/Software/BCM_SOFTWARE_DORADO
/Software/BCM_SOFTWARE_LIMA
/Software/BCM_SOFTWARE_MINKNOW
/Software/BCM_SOFTWARE_REVIO-ICS
/Software/BCM_SOFTWARE_SAMTOOLS
/Software/BCM_SOFTWARE_SMRTLINK
/SubmissionCenter/bcm_gcc
/SubmissionCenter/mayo_tdd
/SubmissionCenter/washu_gcc

- Referenced files to upload in the files directory here (these have dummy/random content):
222TWJLT4-1-IDUDI0055V2_S1_L001_R1_001.FASTQ.GZ
222TWJLT4-1-IDUDI0056V2_S2_L001_R1_001.FASTQ.GZ
222TWJLT4-1-IDUDI0057_S3_L001_R1_001.FASTQ.GZ
222TWJLT4-1-IDUDI0055V2_S1_L001_R2_001.FASTQ.GZ
222TWJLT4-1-IDUDI0056V2_S2_L001_R2_001.FASTQ.GZ
> THIS ONE IS CURRENTLY (2024-05-29) in GCS:
> gs://smaht-submitr-rclone-testing/demo/222TWJLT4-1-IDUDI0056v2_S2_L001_R2_001.fastq.gz
Binary file added demo/annual_2024/bcm_formatted_hapmapmix.xlsx
Binary file not shown.
Binary file not shown.
1 change: 1 addition & 0 deletions demo/annual_2024/create-dependencies.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
update-portal-object --env smaht-local --upsert dependencies
30 changes: 30 additions & 0 deletions demo/annual_2024/dependencies/assay.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
[
{
"code": "101",
"title": "RNA-Seq",
"status": "released",
"accession": "SMAASDZA8VHK",
"consortia": [
"358aed10-9b9d-4e26-ab84-4bd162da182b"
],
"identifier": "bulk_rna_seq",
"submission_centers": [
"9626d82e-8110-4213-ac75-0a50adf890ff"
],
"uuid": "beb12f96-624b-4fb8-afd5-8c637f5c0b97"
},
{
"code": "001",
"title": "WGS, PCR-free",
"status": "released",
"accession": "SMAASOMSCCDC",
"consortia": [
"358aed10-9b9d-4e26-ab84-4bd162da182b"
],
"identifier": "bulk_wgs_pcr_free",
"submission_centers": [
"9626d82e-8110-4213-ac75-0a50adf890ff"
],
"uuid": "87fbe483-31c6-4ff7-8abd-043d185150af"
}
]
Loading

0 comments on commit 1d6c889

Please sign in to comment.