Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ Store ACL in authz field with new rules #22

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ jobs:

- name: 🐳 Start Dataservice docker-compose
run: |
docker network create kf-data-stack
./bin/setup_dataservice.sh

- name: 🐍 Setup Python
Expand Down
40 changes: 18 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,10 @@ The `--match_aliquot` flag will match dbGaP `submitted_sample_id` to `external_a

## ACL Definitions

* study_kfid: (e.g. "SD_12345678")
* study_phs: (e.g. "phs001138")
* root_phs_acl: f"{study_phs}.c999" (This gives root access to the study)
* consent_acl: f"{study_phs}.c{code}" (Not c999 which is a reserved admin code)
* default_acl: [study_kfid, root_phs_acl]
* open_acl: ["*"]
* consent_acl: f"/programs/{study_phs}.c{consent_code}" (consent_code for the specimen)
* default_acl: set([{consent_acl} from visible biospecimens which contribute to the genomic file])
* open_acl: ["/open"]

## ACL Rules

Expand All @@ -58,26 +56,24 @@ The `--match_aliquot` flag will match dbGaP `submitted_sample_id` to `external_a
* If a biospecimen is hidden in the dataservice, its descendants (genomic
files, read groups, etc) should also be hidden.

* All non-hidden (aka visible) genomic files in the dataservice with their
`controlled_access` field set to **False** should get `{open_acl}`.

* All non-hidden (aka visible) genomic files in the dataservice with their
* All visible genomic files in the dataservice with their
controlled_access field set to **null** should **return or display a QC
failure alert**.

* All hidden genomic files in the dataservice with their controlled_access
field set to **null** should get `{empty_acl}`.
* All visible genomic files in the dataservice with their `controlled_access`
field set to **False** should get `{open_acl}`.

* All visible genomic files in the dataservice with their `controlled_access`
field set to **True** should get the `{default_acl}`.

* The `default_acl` is the unique set of the `consent_acl` from the visible
specimens in the study which contribute to the genomic_file.

* The `consent_acl` is composed of the study phs ID and the
reported sample consent code of the sample, prepended with the dbgap
prefix "/programs" (e.g. "/programs/phs001138.c1")

* All other genomic files in the dataservice should get `{default_acl}`.
* All other genomic files in the dataservice should get `{empty_acl}`
indicating no access.

* Each reported sample consent code should be added to each
`controlled_access=True` genomic file that has contribution from any
biospecimen(s) in the study with the reported sample external ID by adding
the `{consent_acl}` in addition to the default **IF AND ONLY IF** the genomic
file and its contributing biospecimen(s) are all visible in the dataservice,
**with the following exception:**

* Until indexd supports "and" composition rules, if a genomic file has
multiple contributing specimens with non-identical access control codes,
that genomic file should get `{default_acl}`. **Return or display an alert
for each such case.**
3 changes: 1 addition & 2 deletions bin/setup_dataservice.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,11 @@ set -e
if [ -d "./kf-api-dataservice" ];
then
cd kf-api-dataservice
git pull
git pull -f
cd ..
else
git clone --depth 1 https://github.com/kids-first/kf-api-dataservice.git
fi
cp kf-api-dataservice/.env.sample kf-api-dataservice/.env
cp docker-compose.yml kf-api-dataservice/
docker-compose -f kf-api-dataservice/docker-compose.yml up -d --build
./bin/health-check.sh
36 changes: 0 additions & 36 deletions docker-compose.yml

This file was deleted.

171 changes: 80 additions & 91 deletions kf_update_dbgap_consent/sample_status.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,10 @@
"""
## ACL Definitions

* study_kfid: (e.g. "SD_12345678")
* study_phs: (e.g. "phs001138")
* root_phs_acl: f"{study_phs}.c999" (This gives root access to the study)
* consent_acl: f"{study_phs}.c{code}" (Not c999 which is a reserved admin code)
* default_acl: [study_kfid, root_phs_acl]
* open_acl: ["*"]
* empty_acl: []
* consent_acl: f"/programs/{study_phs}.c{consent_code}" (consent_code for the specimen)
* default_acl: unique([{consent_acl} from visible biospecimens which contribute to the genomic file])
* open_acl: ["/open"]

## ACL Rules

Expand All @@ -32,29 +29,29 @@
* If a biospecimen is hidden in the dataservice, its descendants (genomic
files, read groups, etc) should also be hidden.

* All non-hidden (aka visible) genomic files in the dataservice with their
`controlled_access` field set to **False** should get `{open_acl}`.

* All non-hidden (aka visible) genomic files in the dataservice with their
* All visible genomic files in the dataservice with their
controlled_access field set to **null** should **return or display a QC
failure alert**.

* All hidden genomic files in the dataservice with their controlled_access
field set to **null** should get `{empty_acl}`.
* All visible genomic files in the dataservice with their `controlled_access`
field set to **False** should get `{open_acl}`.

* All visible genomic files in the dataservice with their `controlled_access`
field set to **True** should get the `{default_acl}`. If the genomic file
previously had an ACL containing the study KF ID, this will be replaced with
the `{default_acl}` containing the PHS ID.

* The `default_acl` is the unique set of the `consent_acl` from the visible
specimens in the study which contribute to the genomic_file.

* All other genomic files in the dataservice should get `{default_acl}`.
* The `consent_acl` is composed of the study phs ID and the
reported sample consent code of the sample, prepended with the dbgap
prefix "/programs" (e.g. "/programs/phs001138.c1")

* All other genomic files in the dataservice should get `{empty_acl}`
indicating no access.

* Each reported sample consent code should be added to each
`controlled_access=True` genomic file that has contribution from any
biospecimen(s) in the study with the reported sample external ID by adding
the `{consent_acl}` in addition to the default **IF AND ONLY IF** the genomic
file and its contributing biospecimen(s) are all visible in the dataservice,
**with the following exception:**

* Until indexd supports "and" composition rules, if a genomic file has
multiple contributing specimens with non-identical access control codes,
that genomic file should get `{default_acl}`. **Return or display an alert
for each such case.**
"""
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor, as_completed
Expand Down Expand Up @@ -97,9 +94,9 @@ def get_patches_for_study(
print("Looking up dbGaP accession ID")
study_phs, study_version = self.get_accession(study_id)
print(f"Found accession ID: {study_phs}")
default_acl = {study_id, f"{study_phs}.c999"}
open_acl = {"*"}
open_acl = {"/open"}
empty_acl = set()
default_acl = empty_acl
alerts = []
patches = defaultdict(lambda: defaultdict(dict))

Expand Down Expand Up @@ -210,6 +207,8 @@ def entities_dict(endpoint, filt):
"consent_type": None,
"dbgap_consent_code": None,
"visible": False,
"visibility_reason": "Consent Hold",
"visibility_comment": "Sample is not registered in dbGaP",
}
hidden_specimens[kfid] = bs

Expand All @@ -230,6 +229,12 @@ def entities_dict(endpoint, filt):
for k, e in entities.items():
storage[endpoint][k] = e
patches[endpoint][k]["visible"] = False
patches[endpoint][k]["visibility_reason"] = "Consent Hold"
patches[endpoint][k][
"visibility_comment"
] = "Sample is not registered in dbGaP"
if endpoint == "genomic-files":
hidden_genomic_files.add(k)

print()

Expand All @@ -238,87 +243,71 @@ def entities_dict(endpoint, filt):
all_biospecimens_visible = all(
[k not in hidden_specimens for k in bsids]
)
biospecimen_codes = set(
patches["biospecimens"][k].get("dbgap_consent_code")
for k in bsids
)
if (gfid not in hidden_genomic_files) and all_biospecimens_visible:
if storage["genomic-files"][gfid]["controlled_access"] is False:
gf_visible = gfid not in hidden_genomic_files
controlled_access = storage["genomic-files"][gfid][
"controlled_access"
]
# GenomicFile visible = True and
# all contributing Biospecimen visible = True
if gf_visible and all_biospecimens_visible:
if controlled_access == None:
"""
Rule: All non-hidden (aka visible) genomic files in the dataservice
with their `controlled_access` field set to **False** should get
`{open_acl}`.
"""
patches["genomic-files"][gfid].update(
{"acl": sorted(open_acl)}
)
elif (
storage["genomic-files"][gfid]["controlled_access"] is None
):
"""
Rule: All non-hidden (aka visible) genomic files in the dataservice
with their controlled_access field set to **null** should **return or
display a QC failure alert**.
Rule: All visible genomic files in the dataservice with
their controlled_access field set to **null** should
**return or display a QC failure alert**.
"""
alerts.append(
f"ALERT: GF {gfid} is visible but has controlled_access"
" set to null instead of True/False."
)
print(alerts[-1])
else:
"""
Rule: All other genomic files in the dataservice should get
{default_acl}.
elif controlled_access == False:
"""
all_biospecimens_same_code = len(biospecimen_codes) == 1
if all_biospecimens_same_code:
"""
Rule: Each reported sample consent code should be added to
each `controlled_access=True` genomic file that has
contribution from any biospecimen(s) in the study with the
reported sample external ID by adding the `{consent_acl}`
in addition to the default **IF AND ONLY IF** the genomic
file and its contributing biospecimen(s) are all visible in
the dataservice...
"""
patches["genomic-files"][gfid].update(
{"acl": sorted(default_acl | biospecimen_codes)}
)
else:
"""
...with the following exception:

Until indexd supports "and" composition rules, if a genomic
file has multiple contributing specimens with non-identical
access control codes, that genomic file should get
{default_acl}. Return or display an alert for each such
case.
"""
alerts.append(
f"ALERT: GF {gfid} has inconsistent sample access"
f" codes {biospecimen_codes}"
)
print(alerts[-1])
patches["genomic-files"][gfid].update(
{"acl": sorted(default_acl)}
)
else:
if storage["genomic-files"][gfid]["controlled_access"] is None:
"""
Rule: All hidden genomic files in the dataservice with their
controlled_access field set to **null** should get `{empty_acl}`.
Rule: All visible genomic files in the dataservice with
their `controlled_access` field set to **False** should get
`{open_acl}`.
"""
patches["genomic-files"][gfid].update(
{"acl": sorted(empty_acl)}
{"authz": sorted(open_acl)}
)
else:
elif controlled_access == True:
"""
Rule: All other genomic files in the dataservice should get
{default_acl}.
Rule: All visible genomic files in the dataservice with
their `controlled_access` field set to **True** should get
the `{default_acl}`.

* The `default_acl` is the unique set of the `consent_acl`
from the visible specimens in the study which contribute to
the genomic_file.

* The `consent_acl` is composed of the study phs ID and the
reported consent code of the sample, prepended with the
dbgap prefix "/programs" (e.g. "/programs/phs001138.c1")
"""
biospecimen_codes = set(
patches["biospecimens"][k].get("dbgap_consent_code")
for k in bsids
)
patches["genomic-files"][gfid].update(
{"acl": sorted(default_acl)}
{
"authz": sorted(
[
f"/programs/{code}"
for code in biospecimen_codes
]
)
}
)
# GenomicFile visible = False OR one of contributing Biospecimen
# visible=False
else:
"""
Rule: All other genomic files in the dataservice should get
`{empty_acl}` indicating no access.
"""
patches["genomic-files"][gfid].update(
{"authz": sorted(empty_acl)}
)

# remove known unneeded patches

Expand Down
Loading