-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: a validate
subcommand to check whether a .hap
file is valid
#47
Comments
validate
subcommand to check whether a .hap
file is validvalidate
subcommand to check whether a .hap
file is valid
we could also use this subcommand to validate other kinds of files like In that case, maybe the best way to structure this code would be to create a Update: After some discussion, we decided not to use this to validate other kinds of files, after all. |
probably the best way to start on this PR would be to create a Inside that function, you can iterate over the haptools/haptools/data/haplotypes.py Lines 894 to 915 in d74633a
|
also note: that way, we can account for situations where the user might have a text editor that automatically inserts spaces when the tab key is pressed update: |
This base is still missing some features Still not implemented as a cli switch Refer to #47
Important: Please familiarize yourself with the
.hap
file specification before reading this issue!Originating from item 2 in the "Future Work" section of PR #43:
At first, this command should reject unsorted
.hap
files, but at some point we should also add a--no-sort
parameter to support unsorted files, since those are also technically valid input.For each violation of the standard, it would be nice if the
validate
subcommand reported the exact line that contains the issue, and ideally, it would quote the problematic part of the line, as well.We should probably also add an optional argument that specifies the subcommand that this
.hap
file will be used as input for. That way, we can import its custom Haplotype class and acquire the expected extra field types from there.Here are some rules it should check the
.hap
file follows:--no-sort
flag is set)pysam.tabix_iterator
?)--genotypes
parameter specifying the path to the genotypes file is present.pysam.tabix_iterator
works for this, then we should consider addingread_variants()
andread_samples()
methods to the GenotypesVCF class.And here are some rules for the header of the
.hap
file:Haplotypes.check_version()
)#
, followed by a tab, followed by a recognized metadata name: currently, "version", "orderH", "orderV", and "orderR"--no-sort
is specified, do the metadata lines appear before the extra field declarations?#
followed immediately by a symbol other than "H", "R", or "V"?The text was updated successfully, but these errors were encountered: