Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence file input validator does not accept gzipped files #77

Open
sarahet opened this issue Feb 13, 2020 · 9 comments
Open

Sequence file input validator does not accept gzipped files #77

sarahet opened this issue Feb 13, 2020 · 9 comments

Comments

@sarahet
Copy link
Member

sarahet commented Feb 13, 2020

When defining an input file with a validator for sequence input files the following way:

parser.add_option(options.queryFile, 'q', "query", "Query sequences.",
seqan3::option_spec::REQUIRED,
seqan3::input_file_validator<seqan3::sequence_file_input<>>{});

it does not accept .gz files but only [embl,fasta,fa,fna,ffn,faa,frn,fastq,fq,genbank,gb,gbk,sam]

Shouldn't this be possible as most sequence input files are actually compressed?

@sarahet sarahet added the bug Something isn't working label Feb 13, 2020
@rrahn
Copy link

rrahn commented Mar 24, 2020

Hey @sarahet thanks for the report.

This is currently indeed not supported and would be a feature. It does need some thoughts though to do it correctly. We will discuss this and see how we can put this into out road map.

@sarahet sarahet added feature/proposal and removed bug Something isn't working labels Mar 25, 2020
@sarahet
Copy link
Member Author

sarahet commented Mar 25, 2020

Great, thanks. A related issue would be that input validators generally do not have the possibility to allow for file endings hat include multiple dots, like .fastq.gz. Could that be included as well or is that rather not planned? Related seqan/lambda#161

@rrahn
Copy link

rrahn commented Mar 25, 2020

yes, I guess that is somewhat related. is the multiple dot version restricted to 2 dots assuming one valid extension and one compression extension? Or are more compex cases possible?

@sarahet
Copy link
Member Author

sarahet commented Apr 7, 2020

So far I can only think of one extension and a compression extension, but of course there could be other use cases, that I currently don't have in mind ..

@smehringer
Copy link
Member

Is it a use-case to compress you file multiple times, e.g. fasta.gz.bz2? (I don't even know if this makes sense, just wondering)

@rrahn
Copy link

rrahn commented Apr 8, 2020

I think .fasta.tar.bz etc. could be. Not that we would support that at the moment but maybe interesting for the future

@joergi-w
Copy link
Member

Would it be a solution to have a constant list of "compression file extensions" similar to the sequence file extensions?
For a given filename, the validator algorithm would be as follows:

  1. search for a compression extension
  2. if found: possibly flag the file as compressed or assign the compression stream already, then remove the extension
  3. repeat 1 and 2 until "not found" (alternatively allow e.g. .tar.bz being one element and list all common ones)
  4. check the validity of the remaining filename

I suggest not to add the compression extensions to the help page, as the product of [file ext.] x [compression ext.] is too large and not helpful to repeat for each input file parameter. We can document somewhere that files with certain extensions get implicitly extracted.

@smehringer smehringer transferred this issue from seqan/seqan3 Mar 21, 2022
@smehringer
Copy link
Member

Core Meeting 2022-03-22

We will defer this feature until the seqan3 I/O design is fixed.

In the meantime, @eseiler will post a workaround here:

@eseiler
Copy link
Member

eseiler commented Mar 22, 2022

My workaround
#include <seqan3/argument_parser/all.hpp>
#include <seqan3/io/sequence_file/input.hpp>

class my_validator : public seqan3::input_file_validator<void> // No template param in sharg
{
public:

    my_validator() : my_validator{combined_extensions} {}
    my_validator(my_validator const &) = default;
    my_validator & operator=(my_validator const &) = default;
    my_validator(my_validator &&) = default;
    my_validator & operator=(my_validator &&) = default;
    ~my_validator() = default;

    explicit my_validator(std::vector<std::string> const & extensions)
    {
        // my_validator::extensions_str = sharg::detail::to_string(extensions); // Sharg only
        my_validator::extensions = std::move(extensions);
    } 

    // Optional for readable help page:
    std::string get_help_page_message() const
    {
        return seqan3::detail::to_string("The input file must exist and read permissions must be granted. Valid file extensions are: ",
                                         sequence_extensions,
                                         #if defined(SEQAN3_HAS_BZIP2) || defined(SEQAN3_HAS_ZLIB)
                                         ", possibly followed by ", compression_extensions,
                                         #endif
                                         '.');
    }

private:
    std::vector<std::string> sequence_extensions{seqan3::detail::valid_file_extensions<typename seqan3::sequence_file_input<>::valid_formats>()};
    std::vector<std::string> compression_extensions{[&] ()
                             {
                                 std::vector<std::string> result;
                                 #ifdef SEQAN3_HAS_BZIP2
                                     result.push_back("bz2");
                                 #endif
                                 #ifdef SEQAN3_HAS_ZLIB
                                     result.push_back("gz");
                                     result.push_back("bgzf");
                                 #endif
                                 return result;
                             }()};
    std::vector<std::string> combined_extensions{[&] ()
                             {
                                 if (compression_extensions.empty())
                                    return sequence_extensions;
                                 std::vector<std::string> result;
                                 for (auto && sequence_extension : sequence_extensions)
                                 {
                                     result.push_back(sequence_extension);
                                     for (auto && compression_extension : compression_extensions)
                                         result.push_back(sequence_extension + std::string{'.'} + compression_extension);
                                 }
                                return result;
                             }()};
};

int main()
{
    std::string some_path{};
    const char * argv[] = {"./test", "-h"};
    seqan3::argument_parser parser{"test_parser", 2, argv, seqan3::update_notifications::off};
    parser.add_option(some_path, 'i', "input", "Fancy descprition,", seqan3::option_spec::required, my_validator{});

    parser.parse();
}
Possible output
test_parser
===========

OPTIONS

  Basic options:
    -h, --help
          Prints the help page.
    -hh, --advanced-help
          Prints the help page including advanced options.
    --version
          Prints the version information.
    --copyright
          Prints the copyright/license information.
    --export-help (std::string)
          Export the help page information. Value must be one of [html, man].
    -i, --input (std::string)
          Fancy descprition, The input file must exist and read permissions
          must be granted. Valid file extensions are:
          [embl,fasta,fa,fna,ffn,faa,frn,fas,fastq,fq,genbank,gb,gbk,sam],
          possibly followed by [bz2,gz,bgzf].

VERSION
    Last update:
    test_parser version:
    SeqAn version: 3.2.0-rc.1

Works for both sharg and seqan3, I added two comments where the code for both differ. ALso, you would need to use the SHARG macros instead of SEQAN3.

The output for a failed validation is quite noisy for seqan3 (it will print all combinations of extensions X compression_extensions). With sharg, you could set the extensions_str to the same as what is returned in the help page message and get a nicer error message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants