Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting a subset of data from raw nanopore signal data #63

Open
hasindu2008 opened this issue Jul 19, 2022 · 5 comments
Open

Extracting a subset of data from raw nanopore signal data #63

hasindu2008 opened this issue Jul 19, 2022 · 5 comments

Comments

@hasindu2008
Copy link

I was looking for an ONT raw signal dataset at very high coverage (a few 100X) and the nanopore dataset in this repository seems to be ideal. It is just a few genomic regions that I need the raw data for. Is there a way to selectively download a set of read IDs from the raw dataset, without having to download and extract all the terabytes of tar.gz (which I estimate to take weeks-months)?

@skoren
Copy link
Member

skoren commented Aug 1, 2022

Unfortunately, we don't have the data organized by chromosome so your only option would be to download and extract the full set. If you have IDs of the reads you're interested in and post them here, I can try to look up which partitions they are in and you can download just those.

@hasindu2008
Copy link
Author

As the reads seemed to be distributed all throughout the partitions (and I would have to iteratively try different subsets), I ended up downloading the whole thing and after like 2 weeks it has fully downloaded! Now extracting all and hopefully, the file system can handle a large number of files. Let you know how it goes. This is an exciting dataset.

@gringer
Copy link

gringer commented Aug 21, 2022

It'd be really useful to have fast5 files sorted by chromosome/position. That'd be a lot of effort to set up, though.

@hasindu2008
Copy link
Author

@gringer When it is in FAST5 - yes every manipulation task is hard.
I have successfully converted all the partitions into BLOW5 recently and now any type of sorting is now a few bash commands. I would be able to provide such sorting if you are interested.

@skoren Do you have the total number of reads in the dataset?
After conversion to BLOW5, the total size was reduced to 3.4TB, which was originally 5.2TB in compressed FAST5 tar.gz archives. This is to double-check if all the reads are present in the converted version.

@hasindu2008 hasindu2008 mentioned this issue Sep 1, 2022
@Marynotmartha
Copy link

Marynotmartha commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants