Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features to add #8

Open
3 of 18 tasks
kose-y opened this issue Mar 4, 2021 · 20 comments
Open
3 of 18 tasks

Features to add #8

kose-y opened this issue Mar 4, 2021 · 20 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed

Comments

@kose-y
Copy link
Member

kose-y commented Mar 4, 2021

The following are the items raised so far.

  • select a variant by rsid or index within the region selected

  • getindex function

  • major/minor alleles for haplotypes

  • ref allele dosage (avoid confusion with minor allele)

  • minor allele frequency

  • HWE test

  • comparison: parse timing vs. VCF format

  • import values in a matrix (Feature request: import genotype/haplotypes into numeric matrix #7)

  • write functions

  • expose interface for non-allocating probability reading functions

  • documentation: define RSID

  • documentation/code: pos -> position

  • allow integer arguments for chromosome selection

  • show() function for Bgen, Variant, Genotypes

  • documentation: slash (/) for unphased, bar (|) for phased genotypes and haplotypes

  • documentation: clarify the explanation of probabilities!(b, v)[i, j]

  • documentation: biallelic diploid examples then polyploid exceptions

  • same name for similar functionalities to SnpArrays.jl

@kose-y kose-y added documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed labels Mar 4, 2021
@olivierlabayle
Copy link

Hi,

Thank you for your work around here!

Since this issue dates back a bit, I wanted to know about the plans for the package, I would be particularly interested in write/filtering functions for BGEN files. Something in the spirit of https://openmendel.github.io/SnpArrays.jl/latest/. Is that something you would have time to do?

@Hua-Zhou
Copy link
Member

@olivierlabayle The write/filtering features will be much welcomed.

@kose-y You may give an update of the status and plan for these two features?

@kose-y
Copy link
Member Author

kose-y commented Jul 26, 2022

These features are to be implemented when the need arises internally or externally. Among writing/filtering features, filtering variants is the simplest because the file is variant-by-variant compression. Filtering samples is trickier and may need stats to be computed online. It cannot be as efficient as what we do in SnpArrays.jl and needs more consideration into it.

I will prioritize filtering variants, and start implementing it this week.

@olivierlabayle If you need to filter out samples, what would be the criteria? Directly providing a binary mask vector or using genotyping success rate is the simplest, and I can implement it. Please let me know if you need some other criteria. Also, please let me know if you need other writing features.

@olivierlabayle
Copy link

@kose-y Thank you for your quick reply!

Eventually I would like to filter variants and samples and then write to disk. I think filtering by providing a mask or list is a nice feature because in theory I you can do anything with it with some user defined logic.

@kose-y
Copy link
Member Author

kose-y commented Jul 31, 2022

@olivierlabayle I think that kind of feature can be completed in about 2-3 weeks.

@kose-y
Copy link
Member Author

kose-y commented Sep 15, 2022

@olivierlabayle filter function is added with v0.1.9.

@olivierlabayle
Copy link

@kose-y amazing thank you, will try that soon!

@olivierlabayle
Copy link

olivierlabayle commented Nov 2, 2022

Hi, I have thus generated some filtered bgen file using the new feature and next tried to use plink2 (v2.00a3.7LM 64-bit Intel (24 Oct 2022)) on those files. It seems the compression is not supported by the tool, is that linked to the fact that BGEN.jl uses ZSTD? What would be the way forward?

plink2 --bgen filtered.ukb_53116_chr12.bgen ref-unknown --sample filtered.ukb_53116_chr12.sample --indep-pairwise 1000 50 0.05 --exclude range exclusion_regions_hg19.txt
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --bgen filtered.ukb_53116_chr12.bgen ref-first
  --exclude range exclusion_regions_hg19.txt
  --indep-pairwise 1000 50 0.05
  --sample filtered.ukb_53116_chr12.sample

Start time: Wed Nov  2 15:35:04 2022
385228 MiB RAM detected; reserving 192614 MiB for main workspace.
Using up to 32 threads (change this with --threads).
--bgen: 28312 variants detected, format v1.3.
487207 samples imported from .sample file to plink2-temporary.psam .

Error: Invalid compressed SNP block in .bgen file.

@kose-y
Copy link
Member Author

kose-y commented Nov 2, 2022

I'm not sure at this point. Plink2 sometimes does nonstandard things with compression BGEN format. Could you please provide a working example with a public small dataset? I think one possibility is adding Gzip compression and seeing if the file is read correctly with plink2. I will add the option this week.

@olivierlabayle
Copy link

Sorry, I realize my previous comment doesn't help much. So I've tried with some examples provided in this repo and they seem to work fine. I don't know if the issue is related to plink2, I have also tried with qctoolv2 for the conversion to bed and I get a core dumped. Do you know if there is anything related to uk-biobank files that could corrupt the filtering process? Happy to try with another public dataset if you can give me other pointers.

@kose-y
Copy link
Member Author

kose-y commented Nov 2, 2022

Do you mean that the filtered files with the example data in this repo work OK with plink2 and qctoolv2?

@olivierlabayle
Copy link

Yes, this is right, I've done it with plink2 after filtering (example of the 10 first variants in the doc) and the program completes without error. I've done in for a variety of example files: 8, 16, 20, 32, 24, 32 bits.

@olivierlabayle
Copy link

olivierlabayle commented Nov 4, 2022

I have finally suceeded in producing a minimal example using a file provided in this repo, you will need plink2 (version above) and Julia 1.8.

using BGEN
using Random

rng = MersenneTwister(0)
bgen = Bgen(BGEN.datadir("example.8bits.bgen"))

### Rand example

vidx = BitVector(rand(rng, Bool, bgen.header.n_variants))
sidx = BitVector(rand(rng, Bool, bgen.header.n_samples))
BGEN.filter("test.bgen", bgen, vidx, sidx)
run(`plink2 --bgen test.bgen ref-unknown --sample test.sample --make-bed`)

rm("test.bgen")
rm("test.sample")

### Even simpler example

vidx = zeros(Bool, bgen.header.n_variants)
vidx[[10, 20, 30]] .= 1

sidx = zeros(Bool, bgen.header.n_samples)
sidx[[10, 20, 30, 40, 50, 100]] .= 1

BGEN.filter("test.bgen", bgen, BitVector(vidx), BitVector(sidx))
run(`plink2 --bgen test.bgen ref-unknown --sample test.sample --make-bed`)

### Reading the filtered output with BGEN.jl seems to work fine

filtered_bgen = Bgen("test.bgen")

for v in iterator(filtered_bgen)
    p = probabilities!(filtered_bgen, v)
    println(size(p))
end

rm("test.bgen")
rm("test.sample")

So it seems the problem may occur if the masked SNPs/samples are not contiguous?

Note: that using the filtering examples provided in the documentation works fine.

@kose-y
Copy link
Member Author

kose-y commented Nov 4, 2022

@olivierlabayle Thanks for the reproducible example! I will look into it.

@olivierlabayle
Copy link

Did you manage to make progress on that by any chance?

@kose-y
Copy link
Member Author

kose-y commented Nov 17, 2022

Sorry for the delay. I will look into it this weekend.

@kose-y
Copy link
Member Author

kose-y commented Nov 20, 2022

@olivierlabayle I was able to reproduce the issue, but I think it will take a while to fix it. I will need to dig into Plink 2's source code to figure out where the current compression procedure and Plink's expectation for the Bgen format deviate.

@kose-y
Copy link
Member Author

kose-y commented Nov 20, 2022

I cannot find what I did wrong just based on the BGEN file format documentation. Plink might have some other constraints not listed in the format documentation.

@kose-y
Copy link
Member Author

kose-y commented Jan 22, 2023

@olivierlabayle The bug in the filter function is now fixed. Plink2 had a more rigid check on decompressed data length.

@olivierlabayle
Copy link

Thank you that's great, I will try again soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants