-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting a subset of data from raw nanopore signal data #63
Comments
Unfortunately, we don't have the data organized by chromosome so your only option would be to download and extract the full set. If you have IDs of the reads you're interested in and post them here, I can try to look up which partitions they are in and you can download just those. |
As the reads seemed to be distributed all throughout the partitions (and I would have to iteratively try different subsets), I ended up downloading the whole thing and after like 2 weeks it has fully downloaded! Now extracting all and hopefully, the file system can handle a large number of files. Let you know how it goes. This is an exciting dataset. |
It'd be really useful to have fast5 files sorted by chromosome/position. That'd be a lot of effort to set up, though. |
@gringer When it is in FAST5 - yes every manipulation task is hard. @skoren Do you have the total number of reads in the dataset? |
Ask me for my raw DNA.
…On Thu, Aug 25, 2022 at 12:13 AM Hasindu Gamaarachchi < ***@***.***> wrote:
@gringer <https://github.com/gringer> When it is in FAST5 - yes every
manipulation task is hard.
I have successfully converted all the partitions into BLOW5 recently and
now any type of sorting is now a few bash commands. I would be able to
provide such sorting if you are interested.
@skoren <https://github.com/skoren> Do you have the total number of reads
in the dataset?
After conversion to BLOW5, the total size was reduced to 3.4TB, which was
originally 5.2TB in compressed FAST5 tar.gz archives. This is to
double-check if all the reads are present in the converted version.
—
Reply to this email directly, view it on GitHub
<#63 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AUGHGEJYSF6GPNV6A4LTUSLV23XGNANCNFSM536VQQFA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I was looking for an ONT raw signal dataset at very high coverage (a few 100X) and the nanopore dataset in this repository seems to be ideal. It is just a few genomic regions that I need the raw data for. Is there a way to selectively download a set of read IDs from the raw dataset, without having to download and extract all the terabytes of tar.gz (which I estimate to take weeks-months)?
The text was updated successfully, but these errors were encountered: