Add extraction of Connectivity Search PathCount table #4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds an extraction of the Connectivity Search PathCount table. I attempted to use the SQL credentials under https://github.com/greenelab/connectivity-search-backend?tab=readme-ov-file#database but I found that my query would hang indefinitely when attempting to access table
dj_hetmech_app_pathcount
(this may have been user error, I'm unsure). As a result my efforts here surrounded using the SQL statement backups. I tried to focus on extracting only thedj_hetmech_app_pathcount
table and avoided unnecessary data loading into a full PostgreSQL database in order to keep things lightweight and avoid potentially larger than system resource consumption or cost (mostly storage).To avoid a full data extraction of the database I filtered the SQL backup statements to create and then populate a single table within a DuckDB database (
pg_restore
offers a single table ingest but only if the archive file is non-text/SQL, so again, I avoided using PostgreSQL directly). Then, to simplify the data access further, I extract the table from the database as a Parquet table. I plan to share along the results directly via email but this code is also able to generate the results where needed.CC @NegarJanani @cgreene