-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features to add #8
Comments
Hi, Thank you for your work around here! Since this issue dates back a bit, I wanted to know about the plans for the package, I would be particularly interested in write/filtering functions for BGEN files. Something in the spirit of https://openmendel.github.io/SnpArrays.jl/latest/. Is that something you would have time to do? |
@olivierlabayle The write/filtering features will be much welcomed. @kose-y You may give an update of the status and plan for these two features? |
These features are to be implemented when the need arises internally or externally. Among writing/filtering features, filtering variants is the simplest because the file is variant-by-variant compression. Filtering samples is trickier and may need stats to be computed online. It cannot be as efficient as what we do in I will prioritize filtering variants, and start implementing it this week. @olivierlabayle If you need to filter out samples, what would be the criteria? Directly providing a binary mask vector or using genotyping success rate is the simplest, and I can implement it. Please let me know if you need some other criteria. Also, please let me know if you need other writing features. |
@kose-y Thank you for your quick reply! Eventually I would like to filter variants and samples and then write to disk. I think filtering by providing a mask or list is a nice feature because in theory I you can do anything with it with some user defined logic. |
@olivierlabayle I think that kind of feature can be completed in about 2-3 weeks. |
@olivierlabayle filter function is added with v0.1.9. |
@kose-y amazing thank you, will try that soon! |
Hi, I have thus generated some filtered bgen file using the new feature and next tried to use plink2 (v2.00a3.7LM 64-bit Intel (24 Oct 2022)) on those files. It seems the compression is not supported by the tool, is that linked to the fact that BGEN.jl uses ZSTD? What would be the way forward?
|
I'm not sure at this point. Plink2 sometimes does nonstandard things with compression BGEN format. Could you please provide a working example with a public small dataset? I think one possibility is adding Gzip compression and seeing if the file is read correctly with plink2. I will add the option this week. |
Sorry, I realize my previous comment doesn't help much. So I've tried with some examples provided in this repo and they seem to work fine. I don't know if the issue is related to plink2, I have also tried with qctoolv2 for the conversion to bed and I get a core dumped. Do you know if there is anything related to uk-biobank files that could corrupt the filtering process? Happy to try with another public dataset if you can give me other pointers. |
Do you mean that the filtered files with the example data in this repo work OK with plink2 and qctoolv2? |
Yes, this is right, I've done it with plink2 after filtering (example of the 10 first variants in the doc) and the program completes without error. I've done in for a variety of example files: 8, 16, 20, 32, 24, 32 bits. |
I have finally suceeded in producing a minimal example using a file provided in this repo, you will need plink2 (version above) and Julia 1.8. using BGEN
using Random
rng = MersenneTwister(0)
bgen = Bgen(BGEN.datadir("example.8bits.bgen"))
### Rand example
vidx = BitVector(rand(rng, Bool, bgen.header.n_variants))
sidx = BitVector(rand(rng, Bool, bgen.header.n_samples))
BGEN.filter("test.bgen", bgen, vidx, sidx)
run(`plink2 --bgen test.bgen ref-unknown --sample test.sample --make-bed`)
rm("test.bgen")
rm("test.sample")
### Even simpler example
vidx = zeros(Bool, bgen.header.n_variants)
vidx[[10, 20, 30]] .= 1
sidx = zeros(Bool, bgen.header.n_samples)
sidx[[10, 20, 30, 40, 50, 100]] .= 1
BGEN.filter("test.bgen", bgen, BitVector(vidx), BitVector(sidx))
run(`plink2 --bgen test.bgen ref-unknown --sample test.sample --make-bed`)
### Reading the filtered output with BGEN.jl seems to work fine
filtered_bgen = Bgen("test.bgen")
for v in iterator(filtered_bgen)
p = probabilities!(filtered_bgen, v)
println(size(p))
end
rm("test.bgen")
rm("test.sample") So it seems the problem may occur if the masked SNPs/samples are not contiguous? Note: that using the filtering examples provided in the documentation works fine. |
@olivierlabayle Thanks for the reproducible example! I will look into it. |
Did you manage to make progress on that by any chance? |
Sorry for the delay. I will look into it this weekend. |
@olivierlabayle I was able to reproduce the issue, but I think it will take a while to fix it. I will need to dig into Plink 2's source code to figure out where the current compression procedure and Plink's expectation for the Bgen format deviate. |
I cannot find what I did wrong just based on the BGEN file format documentation. Plink might have some other constraints not listed in the format documentation. |
@olivierlabayle The bug in the filter function is now fixed. Plink2 had a more rigid check on decompressed data length. |
Thank you that's great, I will try again soon! |
The following are the items raised so far.
select a variant by rsid or index within the region selected
getindex function
major/minor alleles for haplotypes
ref allele dosage (avoid confusion with minor allele)
minor allele frequency
HWE test
comparison: parse timing vs. VCF format
import values in a matrix (Feature request: import genotype/haplotypes into numeric matrix #7)
write functions
expose interface for non-allocating probability reading functions
documentation: define RSID
documentation/code: pos -> position
allow integer arguments for chromosome selection
show()
function forBgen
,Variant
,Genotypes
documentation: slash (
/
) for unphased, bar (|
) for phased genotypes and haplotypesdocumentation: clarify the explanation of
probabilities!(b, v)[i, j]
documentation: biallelic diploid examples then polyploid exceptions
same name for similar functionalities to
SnpArrays.jl
The text was updated successfully, but these errors were encountered: