Replies: 7 comments 3 replies
-
Hi Gao, Following up on our discussion yesterday re: standard format for summary statistics, after some more thinking, I believe that it would be important to use one of the standards that have been proposed (GWAS-VCF or the GWAS catalog one) rather than come up with a modified standard, which defeats the purpose. I understand that the modified standard may be more convenient for some specific purpose (e.g., ease of query) but it seems to me trivial to transform something like GWAS-VCF from wide to long format (in memory or as an intermediate file) during processing, while still keeping GWAS-VCF (or whatever standard format we decide to adopt) as the format for sharing with GCAD and across groups. Let me know your thoughts! Dado @marcora |
Beta Was this translation helpful? Give feedback.
-
Hi Dado, Yes I meant to reach out to you too but was buried in back to back meetings today ... So suppose we call the long format for xQTLs the "xQTL-VCF" format. Two questions for xQTL-VCF: 1) does it violate GWAS-VCF standard? 2) does it violate VCF standard. I think these questions are separate. xQTL-VCF clearly violates 1) and as a result we should either stick to that standard or come up with modified standard for xQTL studies. The latter is difficult given the timeline of the xQTL project although might still be a relevant discussion down the road for a short GWAS-VCF 2.0, or xQTL-VCF paper. Now between tabix indexed TSV vs standard GWAS-VCF for GCAD distribution, I personally vote for the former because the wide GWAS-VCF format -- disrupts query and does not make sense to save multiple tissues in one file -- seems to lose its major advantages. The only remaining advantage is the meta-info which is replaceable by a README file ... xQTL-VCF does not seem to violate 2) -- I conclude that is the case after I read VCF 4.2 specification earlier. What seems the only "non-standard" thing we do is duplicated variants. This is not banned in the VCF specification and indeed we found duplicates in ROSMAP VCF file after Phil's group preprocess it via standard VCF software (#150 (comment)). The implication is that it should be safe to use bcftools to process the data as we are doing (we'll evaluate with small tests to make sure). We have found no issue so far. I therefore incline to continue using it for internal data communication for integration pipeline because it helps harmonizing alleles from different studies, supports query on INFO and sample fields, and is compact in terms of storage. The only other format that I would otherwise adopt would be sqlite3 but this file based database is not compact at all and prone to corruption. I dont want to use server based databases such as mysql, and I dont think I am motivated to invest the time on other database solutions .. What do you think? Kindest regards, |
Beta Was this translation helpful? Give feedback.
-
Hi Gao, Thanks for your reply. These are my comments: VCF (and by extension GWAS-VCF) is a glorified TSV with comments on top (personally I don’t like splitting data and meta-data in separate files for traceability/integrity reasons and it is easy enough, whether using TSV or any other text based format, to include them at the top with each meta-data line identified by a “comment” prefix and then just make the tools skip those lines when reading the data, most tools support this option). So the question is not so much about VCF vs TSV but about how we want to format the information contained in those files. In this regard I strongly believe we should adopt one of the existing standards rather than come up with a new one (since proliferation of formats is what standards were developed to prevent in the first place). According to the VCF standard, INFO field is for position/site-level information (information that does not change across samples). Sample-level information goes into sample columns. The major change we made in the GWAS-VCF spec is to use sample columns to contain trait-level information (specifically sumstats of variant-trait associations from GWAS). As a consequence, in GWAS-VCF, the header of each SAMPLE/TRAIT column, instead of a sample unique identifier, contains a unique trait identifier (e.g., ENSGENEID), which can be further annotated with meta-data in the meta-information lines of the VCF file. In this context, the xQTL-VCF format (by including trait-level information like unique trait identifiers in the INFO field) violates the VCF (and by extension the GWAS-VCF) format specification/philosophy of keeping the concerns of the INFO and SAMPLE fields very well separated (the former being about a site/position in the genome, the latter being about a sample/trait). The other issue you raised is whether having the same variant (i.e., same CHROM, POS, REF, ALT values) in multiple rows of a VCF file violates the VCF spec (https://samtools.github.io/hts-specs/VCFv4.3.pdf). Indeed, I wish the VCF spec was a bit more explicit about this point (and indeed we made that explicit in our GWAS-VCF spec). The VCF spec says that: “VCF is a text file format [that contains] data lines EACH containing information about a POSITION in the genome”. The key word here is POSITION, which is not the same as VARIANT. Indeed, multi-allelic positions, according to the VCF spec, should be preferably expressed as a single row with a “comma-separated list of alternate non-reference alleles” in the ALT field. The VCF spec also says that the ID field should contain a “semicolon-separated list of unique identifiers. No identifier should be present in more than one record”. But it also says that: “within each reference sequence CHROM it is permitted to have multiple records with the same POS”. This was added to allow for multi-allelic positions to be “unpacked” into multiple rows, one for each non-reference allele. To be clear and also keep things simple, in the GWAS-VCF spec, we made it mandatory to use multiple rows for multi-allelic positions. From all of the above, an regardless of whether a “one-line-per-position” or a “one-line-per-variant” style is followed in a VCF file, it is clear to me that (in parallel with the explicit requirement for the position or variant ID value to be unique within a VCF file) there is an implicit requirement for the combination of CHROM, POS, REF and ALT values to also be unique within a VCF file. Indeed, if you run a VCF file containing duplicate CHROM:POS:REF:ALT combined values through the vcf-validator tool from EBI (the authors of the VCF spec), it will tell you that “According to the VCF specification, the input file is not valid! Error: Duplicated variant found”. But the real question is: Why do we need the xQTL-VCF format (granted that we should not use the word VCF for the reasons I outlined above)? Long vs wide seems to me just an stylistic issue that do not justify the creation of a new format. The VCF spec forced us to use the wide instead of the long format. I too prefer long format, but for that I just read it the GWAS VCF file and transform it in-memory or in intermediate files. The other issue you mentioned is with the ease of querying using existing VCF query tools. Could you provide me with examples of queries you found difficult to perform using the GWAS-VCF format? In general, we prioritized VCF format compatibility and good software engineering principles rather than ease of use with specific tools, because there are a lot of different tools that can be used and it is difficult to please everyone. Should we post this discussion on the GitHub? Another thing we should do is to see how xQTL data would be formatted using the GWAS Catalog standard. Best, Dado |
Beta Was this translation helpful? Give feedback.
-
Hi Dado, Thank you for your comprehensive message. Regarding whether xQTL-VCF violates VCF standards, I agree with all of what you said except that I think INFO field is very flexible and some of it do connect with SAMPLE fields -- such as when you put total read depth, allele frequency of sample or subsample. In that regards it does change across samples, right? However I get your point that each row in VCF should be a position (which does not have to be unique) and this position should be independent from samples. In xQTL-VCF each row is a trait-variant pair so it definitely has to do with samples. That ID identifier will thus be unique ("gene:chr:pos:ref:alt"). My previous argument was based on technical compatibility (in order to use a format and its tools) and not so much on the principles based on which the format was designed -- although on my defense you do see that this notion is not explicit in VCF specs. Thanks for pointing out the outcome of vcf-validator on Duplicated variant error. The is the only explicit evidence that xQTL-VCF may not meet VCF specs. Again you cannot tell that from the Google Doc you send me. So they really should make this more explicit .. I agree with you on the "real question". I view our above discussions are more of an intellectual communication of the idea behind VCF design (which I'm happy to continue). Now for the real question, those who assert must prove. Since I asserted that long format has those advantages over wide format, I'll have to prove that it can achieve essential tasks which are otherwise hard in wide format. I can imagine the wide format harder to work with, for problem such as "for 8 tissues I want to select variants near APOE that has p-value < 1E-8 in all brain tissues but not in blood". This will be a couple of lines of bcftools command for the long format. However to be truthful those advantages are just in my imaginary of user cases relevant to data integration pipelines we developed and have not withstood the field test of various xQTLs. Would you mind giving me more time on this? As mentioned in my previous email I see this a database problem when it comes to data integration and from engineering prospective bcftools seems to have implemented a set of database language syntax that helps with it. What I aim to justify is that this is not a trivial database management problem to satisfy our proposed integrative analysis, and list all the cases in our pipeline where we took advantage of the long format as a database and bcftools as the query language. At that point we'll evaluate alternatives. Perhaps we'll find that after all it was not that complicated and xQTL-VCF (or whatever we call it) is not worth it. At this point it's a bit too soon for me to tell. Thank you again for the discussions -- I'm all in favor of posting to github. Can I post the original emails as is without editing?
This is an important question. We should involve Fanny in a separate discussion ... I'll initializat that. Kindest regards, |
Beta Was this translation helpful? Give feedback.
-
Thanks Gao, You should definitely post this thread to the GitHub Discussions (in whichever way you prefer, I am fine with proceeding without editing). Re: the INFO field, of course it contains variant-level information that depends on the samples analyzed (e.g., AF), but it should not contain sample-level information, according to the VCF spec. GWAS-VCF spec was developed to produce valid VCF files, with only a semantic change from SAMPLE (genotype) to TRAIT (association) for the sample columns. Re: the duplicate CHROM:POS:REF:ALT requirement I also wished it was made 100% clear and explicit in the VCF spec but, at least in my opinion, it can be deduced from the spec itself and it is confirmed by the fact that the VCF validator tool developed by the organization that came up with the VCF spec says duplicate variants/sites are not allowed. This issue is not solved by making sure the variant ID is unique for each row. The VCF (and GWAS-VCF) format is centered around variants/sites (the name says it all and indeed each record/line is a variant/site). The xQTL-VCF format is centered around (multi-tissue) xQTL effects and, in doing so, fundamentally breaks the VCF format (which is not so much a philosophical worry I have but rather a worry that it may break/corrupt data processing through standard VCF tools without warning). On the more practical aspect, as you noted, you are trying to solve a complex data storage and query problem (with multiple layers of one-to-many and many-to-many relations) using a rather simple text file format (VCF) that wasn’t really designed to do that (it was designed to express a single one-to-many relation between one variant/site and multiple samples). In my opinion, to solve that problem, it would be trivial to computationally read all the source GWAS-VCF files (one for each tissue for example) into an in-memory or file-based data structure (e.g., a fully flattened long format data frame) that can be efficiently queried, summarized, visualized, etc. This is a similar issue I faced when I was a member of the BioMart development team at the EBI and which we solved, to facilitate queries across multiple datasets, by flattening the SQL databases (which remain the official data sources) into a single data table. Hope this helps, Dado |
Beta Was this translation helpful? Give feedback.
-
Thanks Dado, I agree with all you said in the message, particularly how it can be misleading from engineering prospective for improper usage of tools. We too have to be careful not to abuse existing tools that results in silent errors we don't notice! I'm aware of the potential limitations for xQTL-VCF for our problem. It's a choice out of desperation and it seems to be fine for what I want to do for now although it is yet to withstand other tests. As I mentioned in my previous emails, an ideal solution is more formal database but I don't think it is trivial. The database tables can follow directly from GWAS catalog or whatever plain long format that eases queries. That is trivial. Writing the program and build that database is less trivial -- importing to sqlite3 for our scale of data is nearly impossible given my experiences with sqlite3 (it was one of my main programming tool during 2011 to 2016 although things may change now) and I don't like server based solutions such as mysql. Even we have time to research and pick a good database engine, I dont think our team has the time to learn and rewrite the infrastructure including I/O for various specific other software for data integration, given the timeline we try to keep. Database corruption is also a huge concern and plain text are more robust. In contrast, working out set of unit test for the operations we use in our pipeline seems easier and should address to the concern of abusing tools. Still, if you know someone with specific database engine suited for the task and it works for large consortium projects please let me know. I'll post these on github and draft Fanny an email by the end of the day. Kindest regards, |
Beta Was this translation helpful? Give feedback.
-
@marcora we can continue relevant discussions here? For example if you have time could you elaborate what BioMart used? |
Beta Was this translation helpful? Give feedback.
-
Background:
Beta Was this translation helpful? Give feedback.
All reactions