Features to add #8

kose-y · 2021-03-04T20:24:04Z

olivierlabayle · 2022-07-26T14:27:31Z

Hi,

Thank you for your work around here!

Since this issue dates back a bit, I wanted to know about the plans for the package, I would be particularly interested in write/filtering functions for BGEN files. Something in the spirit of https://openmendel.github.io/SnpArrays.jl/latest/. Is that something you would have time to do?

Hua-Zhou · 2022-07-26T19:23:30Z

@olivierlabayle The write/filtering features will be much welcomed.

@kose-y You may give an update of the status and plan for these two features?

kose-y · 2022-07-26T20:57:54Z

These features are to be implemented when the need arises internally or externally. Among writing/filtering features, filtering variants is the simplest because the file is variant-by-variant compression. Filtering samples is trickier and may need stats to be computed online. It cannot be as efficient as what we do in SnpArrays.jl and needs more consideration into it.

I will prioritize filtering variants, and start implementing it this week.

@olivierlabayle If you need to filter out samples, what would be the criteria? Directly providing a binary mask vector or using genotyping success rate is the simplest, and I can implement it. Please let me know if you need some other criteria. Also, please let me know if you need other writing features.

olivierlabayle · 2022-07-27T07:56:13Z

@kose-y Thank you for your quick reply!

Eventually I would like to filter variants and samples and then write to disk. I think filtering by providing a mask or list is a nice feature because in theory I you can do anything with it with some user defined logic.

kose-y · 2022-07-31T04:07:07Z

@olivierlabayle I think that kind of feature can be completed in about 2-3 weeks.

kose-y · 2022-09-15T19:27:30Z

@olivierlabayle filter function is added with v0.1.9.

olivierlabayle · 2022-09-20T10:13:38Z

@kose-y amazing thank you, will try that soon!

olivierlabayle · 2022-11-02T15:41:48Z

Hi, I have thus generated some filtered bgen file using the new feature and next tried to use plink2 (v2.00a3.7LM 64-bit Intel (24 Oct 2022)) on those files. It seems the compression is not supported by the tool, is that linked to the fact that BGEN.jl uses ZSTD? What would be the way forward?

plink2 --bgen filtered.ukb_53116_chr12.bgen ref-unknown --sample filtered.ukb_53116_chr12.sample --indep-pairwise 1000 50 0.05 --exclude range exclusion_regions_hg19.txt
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --bgen filtered.ukb_53116_chr12.bgen ref-first
  --exclude range exclusion_regions_hg19.txt
  --indep-pairwise 1000 50 0.05
  --sample filtered.ukb_53116_chr12.sample

Start time: Wed Nov  2 15:35:04 2022
385228 MiB RAM detected; reserving 192614 MiB for main workspace.
Using up to 32 threads (change this with --threads).
--bgen: 28312 variants detected, format v1.3.
487207 samples imported from .sample file to plink2-temporary.psam .

Error: Invalid compressed SNP block in .bgen file.

kose-y · 2022-11-02T16:56:22Z

I'm not sure at this point. Plink2 sometimes does nonstandard things with compression BGEN format. Could you please provide a working example with a public small dataset? I think one possibility is adding Gzip compression and seeing if the file is read correctly with plink2. I will add the option this week.

olivierlabayle · 2022-11-02T17:53:12Z

Sorry, I realize my previous comment doesn't help much. So I've tried with some examples provided in this repo and they seem to work fine. I don't know if the issue is related to plink2, I have also tried with qctoolv2 for the conversion to bed and I get a core dumped. Do you know if there is anything related to uk-biobank files that could corrupt the filtering process? Happy to try with another public dataset if you can give me other pointers.

kose-y · 2022-11-02T21:30:11Z

Do you mean that the filtered files with the example data in this repo work OK with plink2 and qctoolv2?

olivierlabayle · 2022-11-03T09:07:33Z

Yes, this is right, I've done it with plink2 after filtering (example of the 10 first variants in the doc) and the program completes without error. I've done in for a variety of example files: 8, 16, 20, 32, 24, 32 bits.

olivierlabayle · 2022-11-04T11:00:19Z

I have finally suceeded in producing a minimal example using a file provided in this repo, you will need plink2 (version above) and Julia 1.8.

using BGEN
using Random

rng = MersenneTwister(0)
bgen = Bgen(BGEN.datadir("example.8bits.bgen"))

### Rand example

vidx = BitVector(rand(rng, Bool, bgen.header.n_variants))
sidx = BitVector(rand(rng, Bool, bgen.header.n_samples))
BGEN.filter("test.bgen", bgen, vidx, sidx)
run(`plink2 --bgen test.bgen ref-unknown --sample test.sample --make-bed`)

rm("test.bgen")
rm("test.sample")

### Even simpler example

vidx = zeros(Bool, bgen.header.n_variants)
vidx[[10, 20, 30]] .= 1

sidx = zeros(Bool, bgen.header.n_samples)
sidx[[10, 20, 30, 40, 50, 100]] .= 1

BGEN.filter("test.bgen", bgen, BitVector(vidx), BitVector(sidx))
run(`plink2 --bgen test.bgen ref-unknown --sample test.sample --make-bed`)

### Reading the filtered output with BGEN.jl seems to work fine

filtered_bgen = Bgen("test.bgen")

for v in iterator(filtered_bgen)
    p = probabilities!(filtered_bgen, v)
    println(size(p))
end

rm("test.bgen")
rm("test.sample")

So it seems the problem may occur if the masked SNPs/samples are not contiguous?

Note: that using the filtering examples provided in the documentation works fine.

kose-y · 2022-11-04T18:30:53Z

@olivierlabayle Thanks for the reproducible example! I will look into it.

olivierlabayle · 2022-11-16T18:42:10Z

Did you manage to make progress on that by any chance?

kose-y · 2022-11-17T04:50:07Z

Sorry for the delay. I will look into it this weekend.

kose-y · 2022-11-20T23:16:51Z

@olivierlabayle I was able to reproduce the issue, but I think it will take a while to fix it. I will need to dig into Plink 2's source code to figure out where the current compression procedure and Plink's expectation for the Bgen format deviate.

kose-y · 2022-11-20T23:18:30Z

I cannot find what I did wrong just based on the BGEN file format documentation. Plink might have some other constraints not listed in the format documentation.

kose-y · 2023-01-22T03:54:37Z

@olivierlabayle The bug in the filter function is now fixed. Plink2 had a more rigid check on decompressed data length.

olivierlabayle · 2023-01-25T10:49:03Z

Thank you that's great, I will try again soon!

kose-y added documentation Improvements or additions to documentation enhancement New feature or request help wanted Extra attention is needed labels Mar 4, 2021

olivierlabayle mentioned this issue Jul 27, 2022

End to End Test of the pipeline TARGENE/targene-pipeline#37

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features to add #8

Features to add #8

kose-y commented Mar 4, 2021 •

edited

Loading

olivierlabayle commented Jul 26, 2022

Hua-Zhou commented Jul 26, 2022

kose-y commented Jul 26, 2022

olivierlabayle commented Jul 27, 2022

kose-y commented Jul 31, 2022

kose-y commented Sep 15, 2022

olivierlabayle commented Sep 20, 2022

olivierlabayle commented Nov 2, 2022 •

edited

Loading

kose-y commented Nov 2, 2022

olivierlabayle commented Nov 2, 2022

kose-y commented Nov 2, 2022

olivierlabayle commented Nov 3, 2022

olivierlabayle commented Nov 4, 2022 •

edited

Loading

kose-y commented Nov 4, 2022

olivierlabayle commented Nov 16, 2022

kose-y commented Nov 17, 2022

kose-y commented Nov 20, 2022

kose-y commented Nov 20, 2022

kose-y commented Jan 22, 2023

olivierlabayle commented Jan 25, 2023

Features to add #8

Features to add #8

Comments

kose-y commented Mar 4, 2021 • edited Loading

olivierlabayle commented Jul 26, 2022

Hua-Zhou commented Jul 26, 2022

kose-y commented Jul 26, 2022

olivierlabayle commented Jul 27, 2022

kose-y commented Jul 31, 2022

kose-y commented Sep 15, 2022

olivierlabayle commented Sep 20, 2022

olivierlabayle commented Nov 2, 2022 • edited Loading

kose-y commented Nov 2, 2022

olivierlabayle commented Nov 2, 2022

kose-y commented Nov 2, 2022

olivierlabayle commented Nov 3, 2022

olivierlabayle commented Nov 4, 2022 • edited Loading

kose-y commented Nov 4, 2022

olivierlabayle commented Nov 16, 2022

kose-y commented Nov 17, 2022

kose-y commented Nov 20, 2022

kose-y commented Nov 20, 2022

kose-y commented Jan 22, 2023

olivierlabayle commented Jan 25, 2023

kose-y commented Mar 4, 2021 •

edited

Loading

olivierlabayle commented Nov 2, 2022 •

edited

Loading

olivierlabayle commented Nov 4, 2022 •

edited

Loading