Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all databanks are indexed (correctly) #40

Open
drlemmus opened this issue Nov 25, 2016 · 3 comments
Open

Not all databanks are indexed (correctly) #40

drlemmus opened this issue Nov 25, 2016 · 3 comments
Assignees

Comments

@drlemmus
Copy link

A number of databanks do not have any statistics (e.g. HSSP and PDB_REDO). The counts for other databanks (e.g. DSSP and STRUCTURFACTORS) are incorrect: missing entries are not listed.

@jonblack
Copy link
Contributor

Coos is looking into this for #38.

@jonblack
Copy link
Contributor

#38 is now closed. The stats page is now correct based on what's in the database. The issue is with the crawler and annotator.

@jonblack jonblack self-assigned this Nov 30, 2016
@jonblack
Copy link
Contributor

jonblack commented Dec 1, 2016

The crawling/annotation method is a bit backwards. We start the process without any expectations. If we scan only 100 PDB files, then we only expect 100 files maximum in the other databanks that depend on the PDB.

In reality we have a pretty good idea before we start what the ideal scenario is. We can download a list of all valid and obsolete PDB IDS from pdb.org and use that as a base. When we crawl we're no longer indexing what we have but instead checking to see what's missing. Those that are missing can then be passed through the annotator.

I'm going to update this process to use the ids downloaded from pdb.org as the source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants