Update README

vistalab-technion · Apr 8, 2024 · ad1bedd · ad1bedd
1 parent bbf2766
commit ad1bedd
Showing 1 changed file with 51 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -4,19 +4,24 @@ This repo contains an implementation of a toolkit for analysis of protein backbo
 structure, specifically for: (i) estimating the distribution of dihedral angles and
 quantifying the differences between such distributions; (ii) finding matched pairs
 of proteins with regions of identical sequence and contacts but different backbone
-structure.
+structure; (iii) collection of protein datasets from the PDB which contain codon and
+altloc information.
 
 It contains the code required to collect the data and reproduce the results of
 these papers:
 
+    Aviv A. Rosenberg, Ailie Marx, Alex Bronstein.
+    "A catalogue of alternately located segments in protein crystal structures."
+    Unpublished (2024).
+
     Aviv A. Rosenberg, Alex M. Bronstein, Ailie Marx.
     "Does one sequence always translate to one structure?"
     Unpublished (2023).
 
     Aviv A. Rosenberg, Nitsan Yehishalom, Ailie Marx, Alex Bronstein.
     "An amino domino model described by a cross peptide bond Ramachandran plot
     defines amino acid pairs as local structural units"
-    Unpublished (2023).
+    PNAS (2023).
 
     Aviv A. Rosenberg, Ailie Marx and Alex M. Bronstein.
     "Codon-specific Ramachandran plots show amino acid backbone conformation depends on
@@ -87,6 +92,50 @@ pp5 pgroup --ref-pdb-id 2WUR:A --match-len 2 --context-len 1 --compare-contacts
 ```
 This will generate output CSV files in the `out/prgroup` directory.
 
+## Re-collecting the altloc dataset
+
+To re-collect the dataset described in our paper "A catalogue of alternately located
+segments in protein crystal structures", use the following bash script.
+Note that due to updates on the PDB servers over time, re-collecting the data will not
+produce exactly the same dataset as was analyzed in the paper.
+
+```shell
+#!/bin/bash
+set -eux
+
+# Clear prec CSV output dir and global pp5 cache
+rm -rf out/prec
+rm -rf /tmp/pp5_data
+
+PROCESSES=84
+ASYNC_TIMEOUT="3600"
+ASYNC_RETRY_DELTA="5"
+RESOLUTION="3.5"
+RFREE="0.33"
+SIMILARITY="1.0"
+MAX_CHAINS="20"
+TAG="r${RESOLUTION}-${PDB_SOURCE}"
+pp5 \
+  -p="$PROCESSES" collect-prec \
+  --async-timeout="$ASYNC_TIMEOUT" \
+  --async-retry-delta="$ASYNC_RETRY_DELTA" \
+  --expr-sys="" \
+  --source-taxid="" \
+  --resolution="$RESOLUTION" \
+  --r-free="$RFREE" \
+  --query-max-chains="$MAX_CHAINS" \
+  --seq-similarity-thresh="$SIMILARITY" \
+  --pdb-source="rc" \
+  --out-tag="altlocs-$TAG" \
+  --with-altlocs \
+  --with-backbone \
+  --with-contacts \
+  --write-zip
+```
+
+The data will be collected to a subfolder with a name containing the `out-tag`,
+within the `out/` folder (which will be created in the `pwd`).
+
 ## Reproducing "Does one sequence always translate to one structure?"
 
 The data collection and structure pair matching can be performed by running `pp5