How to load HPDS data from CSV/DBMS, and the format issue #49

finkbine · 2023-02-10T21:31:56Z

Hi,
I am a new user, I tried to follow the instruction in "Project Load HPDS Data From CSV" part, however, the variable definition is not clear to me:
"PATIENT_NUM","CONCEPT_PATH","NVAL_NUM","TVAL_CHAR","TIMESTAMP". Could you give me a real example file, especially for "CONCEPT_PATH"?

You mentioned that "This job requires datafile in csv format in location - /usr/local/docker-config/hpds_csv/allConcepts.csv", what if I want to upload my own csv files? After "Run Jenkins job - Start PIC-SURE" is finished, does it mean that I will see new samples posted in pic-sure website?
thanks a lot!

anilk2hms · 2023-02-13T20:14:03Z

Jenkins is configured to use local directory on host. For the job "Project Load HPDS Data From CSV" it uses the folder /usr/local/docker-config/hpds_csv/ on host and mounted into the container in this path: /opt/local/hpds/
So, on the host you place your csv file in this location: /usr/local/docker-config/hpds_csv/ - Make sure you delete the allConcepts.csv and place the file with same name and run the jenkins job

dmpillion · 2023-02-13T20:59:19Z

We have examples of how to map and load your data and examples of how the NHANES data was loaded in this repo: https://github.com/hms-dbmi/pic-sure-hpds-phenotype-load-example
Here is an example of a concept path from NHANES: \demographics\SEX\female\

Please let us know if you need additional assistance.

finkbine · 2023-02-13T21:29:06Z

thank you !

finkbine · 2023-02-14T16:15:18Z

Hi,
Dose pic-sure allow for multiple databases/projects ?
For example, the folder /usr/local/docker-config/hpds_csv/allConcepts.csv can only have one allConcepts.csv file.

thanks a lot!

dmpillion · 2023-02-16T12:06:35Z

Hi,

We want to make sure we understand the question. When you say "multiple databases/projects", does that mean that you want them displayed with different root paths?

Can you provide a more detailed example?

Thanks!

finkbine · 2023-02-16T22:25:53Z

Dear, dmpillion:

Yes, projects have different root paths, I don't know the exact meaning you mentioned. For example, a single project will have a set of SUBJECT_ID as primary key, another project will have a different set of SUBJECT_ID, therefore these two csv files cannot be combined to one single allConcepts.csv file.

There are also several other questions:

In the csv file, what if a single SUBJECT_ID has multiple different records with the same variable name CONCEPT_PATH, this is common for a patient with many observations, for example:
1,"\a\b\c",1,,111111
1,"\a\b\c",2,,111111
1,"\a\b\c",3,,111111
Can pic-sure correctly handle it? I saw that pic-sure can list the number of observation and the number of unique primary SUBJECT_ID.
We are interested in searching keywords in variable list, also the contents of variable. Now it seems that pic-sure cannot search keywords in contents, for example, we want to search a 10000 words long text content of a variable.
A character variable with long string (~ 10000 length) cannot be imported correctly, error message was:

Feb 21, 2023 4:36:23 AM com.google.common.cache.LocalCache processPendingNotifications
WARNING: Exception thrown by removal listener
java.lang.OutOfMemoryError: Java heap space

Where is the function/mechanism to share our data with other researchers?
Date format, not character or numeric.

thanks a lot!
xiangjun

mangmang1216 · 2023-02-16T22:43:40Z

Thank you both. I'm the PI on one of the pilot AIM-AHEAD projects (I'm a physician scientist and not a data scientist) and Dr. Paul Avilach advised our group to try installing PIC-SURE to be linked to AWS SWB (Xiangjun has been working on this for several weeks). The concept mapping is interesting but unclear how feasible it is. I have extracted clinical data on 20,000 patients with likely millions of different unique longitudinal lab values and several millions of unique ICD/CPT/HCPCS codes. Would each one require its own concept mapping for PIC-SURE to function properly? We also have semi-structured and unstructured long clinical notes.

I read from the example that the core of PIC-SURE is i2b2 which is a data aggregation/search platform that our institution already has. Personally, I'm trying to understand the benefit of using PIC-SURE HPDS platform over standard SQL platform... or just leave them as csv files that we can easily import to any statistical software for data merging and analyses. Is PIC-SURE more like i2b2, SlicerDicer, or does it have any built-in NLP capability similar to EPIC search engine?

finkbine · 2023-03-30T05:00:04Z

Dear dmpillion:

We have error when importing csv into pic-sure, Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

The heapsize was assigned in default: + docker run --name=hpds-etl -v /usr/local/docker-config/hpds_temp:/opt/local/hpds -v /usr/local/docker-config/hpds_csv/allConcepts.csv:/opt/local/hpds/allConcepts.csv -e HEAPSIZE=4096 -e LOADER_NAME=CSVLoader --name hpds_data_load_csv hms-dbmi/pic-sure-hpds-etl:LATEST

How can users set memory size?

thanks

anilk2hms · 2023-03-30T15:54:28Z

Assuming you are using this Job (Load HPDS Data From CSV) to load the data.
If you want to adjust the HEAPSIZE, click on Configure the job and go to the build section, there you can find -e HEAPSIZE=4096, adjust this and save. Set the HEAPSIZE half of your available RAM. Then rerun the jenkins job.

finkbine · 2023-03-30T21:16:31Z

Hi,
There is still error message for this setting, heapsize was increased to 100000.
Another question is that datetime can only be imported as numeric variable (unix timestamp as seconds), is there any way to have a function to transfer unix timestamp back to datetime ?

+ docker run --name=hpds-etl -v /usr/local/docker-config/hpds_temp:/opt/local/hpds -v /usr/local/docker-config/hpds_csv/allConcepts.csv:/opt/local/hpds/allConcepts.csv -e HEAPSIZE=100000 -e LOADER_NAME=CSVLoader --name hpds_data_load_csv hms-dbmi/pic-sure-hpds-etl:LATEST

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fd4aa000000, 6325010432, 0) failed; error='Not enough space' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 6325010432 bytes for committing reserved memory.
# An error report file with more information is saved as:
# //hs_err_pid7.log
Build step 'Execute shell' marked build as failure
Finished: FAILURE

Thanks

dmpillion · 2023-03-31T17:58:35Z

1.) Can you confirm the available RAM on your machine?

2.) Can you explain the use case for wanting to transfer the UNIX?
Once the data is loaded into PIC-SURE it will be in date time, not UNIX timestamp as seconds.

finkbine · 2023-03-31T18:41:40Z

Hi, dmpillion:

Our configuration is:
EC2 Instance: t2.2xlarge, 32 GB of RAM, 8 cores, 100 GB of hard drive
Instruction said only numeric and character, two types of variable, for example, a string "2021-1-22". Do you mean if "2021-1-22" was treated as character, pic-sure will transform it to datetime ? In my previous importing process, "2021-1-22" can not be identified as datetime, just as a string.

thanks again!
xiao

mangmang1216 · 2023-04-04T01:44:01Z

To further clarify Xiao's comment, our dataset has longitudinal date/time variable stamps. For example, we need to load every complete blood count result from 1/1/2011 to 1/1/2023. Based on the NHANES tutorial, all date/time stamps must be first converted to UNIX since it would be otherwise treated as a string character. However, after they are converted to UNIX, we can't seem to be able to convert them back into date/time presentation in PIC-SURE. Thank you.

anilk2hms · 2023-04-05T00:36:45Z

Machine has 32GB Ram, but provisioned (HEAPSIZE=100000, 100000/1024) ~97 GB..
Set the HEAPSIZE half of available RAM. Above use case ( 32 GB RAM ) , it should not be more than 16384.

finkbine · 2023-04-05T04:31:41Z

thank you, I will try it when our system admin comes back

finkbine · 2023-04-12T17:09:55Z

@anilk2hms Hi, we upgraded our system with 64 Gb ram, there is no any error message, please see the log file attached.
However, pic-sure cannot show tree structure, could you help us?

test.log

thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to load HPDS data from CSV/DBMS, and the format issue #49

How to load HPDS data from CSV/DBMS, and the format issue #49

finkbine commented Feb 10, 2023

anilk2hms commented Feb 13, 2023

dmpillion commented Feb 13, 2023

finkbine commented Feb 13, 2023

finkbine commented Feb 14, 2023

dmpillion commented Feb 16, 2023

finkbine commented Feb 16, 2023 •

edited

Loading

mangmang1216 commented Feb 16, 2023

finkbine commented Mar 30, 2023

anilk2hms commented Mar 30, 2023

finkbine commented Mar 30, 2023 •

edited

Loading

dmpillion commented Mar 31, 2023

finkbine commented Mar 31, 2023

mangmang1216 commented Apr 4, 2023

anilk2hms commented Apr 5, 2023

finkbine commented Apr 5, 2023

finkbine commented Apr 12, 2023 •

edited

Loading

How to load HPDS data from CSV/DBMS, and the format issue #49

How to load HPDS data from CSV/DBMS, and the format issue #49

Comments

finkbine commented Feb 10, 2023

anilk2hms commented Feb 13, 2023

dmpillion commented Feb 13, 2023

finkbine commented Feb 13, 2023

finkbine commented Feb 14, 2023

dmpillion commented Feb 16, 2023

finkbine commented Feb 16, 2023 • edited Loading

mangmang1216 commented Feb 16, 2023

finkbine commented Mar 30, 2023

anilk2hms commented Mar 30, 2023

finkbine commented Mar 30, 2023 • edited Loading

dmpillion commented Mar 31, 2023

finkbine commented Mar 31, 2023

mangmang1216 commented Apr 4, 2023

anilk2hms commented Apr 5, 2023

finkbine commented Apr 5, 2023

finkbine commented Apr 12, 2023 • edited Loading

finkbine commented Feb 16, 2023 •

edited

Loading

finkbine commented Mar 30, 2023 •

edited

Loading

finkbine commented Apr 12, 2023 •

edited

Loading