Extracting a subset of data from raw nanopore signal data #63

hasindu2008 · 2022-07-19T05:50:05Z

I was looking for an ONT raw signal dataset at very high coverage (a few 100X) and the nanopore dataset in this repository seems to be ideal. It is just a few genomic regions that I need the raw data for. Is there a way to selectively download a set of read IDs from the raw dataset, without having to download and extract all the terabytes of tar.gz (which I estimate to take weeks-months)?

skoren · 2022-08-01T20:08:44Z

Unfortunately, we don't have the data organized by chromosome so your only option would be to download and extract the full set. If you have IDs of the reads you're interested in and post them here, I can try to look up which partitions they are in and you can download just those.

hasindu2008 · 2022-08-11T03:43:08Z

As the reads seemed to be distributed all throughout the partitions (and I would have to iteratively try different subsets), I ended up downloading the whole thing and after like 2 weeks it has fully downloaded! Now extracting all and hopefully, the file system can handle a large number of files. Let you know how it goes. This is an exciting dataset.

gringer · 2022-08-21T23:47:15Z

It'd be really useful to have fast5 files sorted by chromosome/position. That'd be a lot of effort to set up, though.

hasindu2008 · 2022-08-25T04:13:12Z

@gringer When it is in FAST5 - yes every manipulation task is hard.
I have successfully converted all the partitions into BLOW5 recently and now any type of sorting is now a few bash commands. I would be able to provide such sorting if you are interested.

@skoren Do you have the total number of reads in the dataset?
After conversion to BLOW5, the total size was reduced to 3.4TB, which was originally 5.2TB in compressed FAST5 tar.gz archives. This is to double-check if all the reads are present in the converted version.

Marynotmartha · 2022-10-11T07:42:09Z

Ask me for my raw DNA.

…

On Thu, Aug 25, 2022 at 12:13 AM Hasindu Gamaarachchi < ***@***.***> wrote: @gringer <https://github.com/gringer> When it is in FAST5 - yes every manipulation task is hard. I have successfully converted all the partitions into BLOW5 recently and now any type of sorting is now a few bash commands. I would be able to provide such sorting if you are interested. @skoren <https://github.com/skoren> Do you have the total number of reads in the dataset? After conversion to BLOW5, the total size was reduced to 3.4TB, which was originally 5.2TB in compressed FAST5 tar.gz archives. This is to double-check if all the reads are present in the converted version. — Reply to this email directly, view it on GitHub <#63 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AUGHGEJYSF6GPNV6A4LTUSLV23XGNANCNFSM536VQQFA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

hasindu2008 mentioned this issue Sep 1, 2022

guppy v6 #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting a subset of data from raw nanopore signal data #63

Extracting a subset of data from raw nanopore signal data #63

hasindu2008 commented Jul 19, 2022

skoren commented Aug 1, 2022

hasindu2008 commented Aug 11, 2022

gringer commented Aug 21, 2022

hasindu2008 commented Aug 25, 2022

Marynotmartha commented Oct 11, 2022 via email

Extracting a subset of data from raw nanopore signal data #63

Extracting a subset of data from raw nanopore signal data #63

Comments

hasindu2008 commented Jul 19, 2022

skoren commented Aug 1, 2022

hasindu2008 commented Aug 11, 2022

gringer commented Aug 21, 2022

hasindu2008 commented Aug 25, 2022

Marynotmartha commented Oct 11, 2022 via email