Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison between MungeSumstat and our sumstat_merger #212

Closed
hsun3163 opened this issue Apr 6, 2022 · 7 comments
Closed

Comparison between MungeSumstat and our sumstat_merger #212

hsun3163 opened this issue Apr 6, 2022 · 7 comments

Comments

@hsun3163
Copy link
Collaborator

hsun3163 commented Apr 6, 2022

Following review is based purely on the MungeSumstat documentation as MungeSumstat require R4.1 which is not on our cluster. Fields, where sumstat_merger is advantageous, was in bold


Summary:

MungeSumstat is a more holistic toolset that has more utilities and potentially better generalizability (Ours is more like in-house unification).

However, the drawback of MungeSumstat is its dependency on the Bioconductor reference genome, which each requires quite a long time to download, and its usage of R, which may have poorer performance.

It is unrealistic for our package to achieve MungeSumstat level's functionality, however, as a lot of function from MungeSumstat is not needed. if the drawback mentioned above is resolvable, then perhaps rewriting the sumstat merger with munge sumstat R code is a better idea and should prove relatively simple as MungeSumstat is somehow off the shelf.


Scope:
MungeSumstat aim to reformat 1 sumstat format as well as perform a variety of qc and quality of life changes
sumstat_merger: aim to reformat 1 set of sumstat file

Input:
MungeSumstat: 1 sumstat of both table format and vcf format
sumstat_merger: a list of sumstat in table format

Headers
MungeSumstat: Infer header automatically based on an internal map file, allow user to input customized mapping files.
sumstat_merger: Allowed user to input customizable yml to specify the matrix

Reference:
MungeSumstat: Require download of independent GRCh37/GRCh38 using Bioconductor
sumstat_merger: 1 of the sumstat of the input or 5-columns tables.

Environments:
MungeSumstat require R 4.1
sumstat_merger require python 3.7


Utilities: Following utilities are provided by MungeSumstat but not in sumstat_merger:

  • convert_small_p Binary, should p-values < 5e-324 be converted to 0? Small p-values pass the R limit and can cause errors with LDSC/MAGMA and should be converted. Default is TRUE.

  • convert_n_int Binary, if N (the number of samples) is not an integer, should this be rounded? Default is TRUE. analysis_trait If multiple traits were studied, name of the trait for analysis from the GWAS. Default is NULL

  • INFO_filter 0-1 The minimum value permissible of the imputation information score (if present in sumstatsfile). Default 0.9

  • FRQ_filter 0-1 The minimum value permissible of the frequency(FRQ) of the SNP (i.e. Allele Frequency (AF)) (if present in sumstats file). By default no filtering is done, i.e. value of 0.

  • pos_se Binary Should the standard Error (SE) column be checked to ensure it is greater than 0? Those that are, are removed (if present in sumstats file). Default TRUE.

  • effect_columns_nonzero Binary should the effect columns in the data BETA,OR (odds ratio),LOG_ODDS,SIGNED_SUMSTAT be checked to ensure no SNP=0. Those that do are removed(if present in sumstats file). Default TRUE.

  • N_std Numeric, the number of standard deviations above the mean a SNP’s N is needed to be removed. Default is 5. N_dropNA controls whether the SNPs with a missing N value are dropped or not (Default is TRUE).

  • rmv_chr vector or character The chromosomes on which the SNPs should be removed. Use NULL if no filtering necessary. Default is X, Y and mitochondrial.rmv_chrPrefix controls whether “chr”/“CHR” is removed from chromosome names (Default is TRUE).

  • on_ref_genome Binary, should a check take place that all SNPs are on the reference genome by SNP ID. Any SNPs not on the reference genome, will be corrected from the reference genome (if possible) using the chromosome and base pair position data. Default is TRUE

  • snp_ids_are_rs_ids Binary, should the SNP IDs inputted be inferred as RS IDs or some arbitrary ID. Default is TRUE.

  • frq_is_maf Binary, conventionally the FRQ column is intended to show the minor/effect allele frequency (MAF) but sometimes the major allele frequency can be inferred as the FRQ column. This logical variable indicates that the FRQ column should be renamed to MAJOR_ALLELE_FRQ if the frequency values appear to relate to the major allele i.e. >0.5. By default mapping won’t occur i.e. is TRUE.

  • sort_coordinates Whether to sort by coordinates of resulting sumstats.

  • ldsc_format Ensure that output format meets all requirements to be passed directly into LDSC without the need for additional munging.

  • imputation_ind Binary Should a column be added for each imputation step to show what SNPs have imputed values for differing fields. This includes a field denoting SNP allele flipping (flipped). On the flipped value, this denoted whether the alelles where switched based on MungeSumstats initial choice of A1, A2 from the input column headers and thus may not align with what the creator intended.Note these columns will be in the formatted summary statistics returned.

  • force_new If a formatted file of the same names as exists, formatting will be skipped and this file will be imported instead (default). Set to override this.


Exclusion:
MungeSumstat automatically rejects SNPs with following problems, while the sumstat_merger didn't

  • snp_multi_rs_one_row - Where the SNP (RS ID) contained more than one RS ID.

  • snp_missing_rs - Where the SNP (RS ID) was missing the rs prefix. Note that these are only removed when other snps have an rs prefix.

  • snp_multi_colon - Where the SNP ID has mutliple colons (“:”) in one SNP.

  • snp_not_found_from_bp_chr - Where the RS ID was attempted to be imputed from the CHR and BP (Base-Pair) information, using the reference genome, but wasn’t successful.

  • chr_bp_not_found_from_snp - Where the CHR and BP (Base-Pair) was attempted to be imputed from the SNP (RS ID), using the reference genome, but wasn’t successful.

  • alleles_not_found_from_snp - Where the alleles (A1 and/or A2) was attempted to be imputed from the SNP (RS ID), using the reference genome, but wasn’t successful.

  • alleles_dont_match_ref_gen - Where the alleles (A1 and/or A2) don’t match what’s on the reference genome.

  • missing_data - Where there is data missing across the inputted columns.

  • dup_snp_id - Where the SNP ID is duplicated in the input.

  • dup_base_pair_position - Where the base-pair position is duplicated in the input.

  • info_filter - SNP INFO value below the specified threshold.

  • se_neg - SNPs SE (Standard Error) value is 0 or negative.

  • effect_col_zero - SNPs effect column(s) value is zero e.g. BETA=0.

  • n_large - SNPs N is N standard deviations greater than the mean.

  • n_null - SNPs N is null.

  • chr_excl - SNP is on a chromosome specified to be excluded.

@gaow
Copy link
Contributor

gaow commented Apr 7, 2022

Thanks @hsun3163 . I do agree we should not reinvent wheels if possible. Howeer I think the focus should be what features that we have and they don't, and for those features, can we easily (re)implement using mungesumstat package.

#153

From my recollection, these might be features we have and thy dont?

  1. support merging with respect to a customized reference target, whereas they require merging with reference genome. That might not be a bit issue if we also QC our LD reference panel to a reference genome
  2. Merging multiple data-sets -- but in mungesumstat if we can allow for merge without taking intersections (keep variants unique to data-sets as is Summary statistics merger allow unique variants #54 ), then we can loop through data and merge in pair-wise fashion
  3. Handling of indels

3 might be tricky. @hsun3163 could you look into that? Finally as @hsun3163 pointed out, it's hard to tell how efficient it works .

@hsun3163
Copy link
Collaborator Author

hsun3163 commented Apr 7, 2022

For 1, it is not just we qc our LD to a reference genome, but the reference genome to their choosing, which are

BiocManager::[install](https://rdrr.io/pkg/BiocManager/man/install.html)("SNPlocs.Hsapiens.dbSNP144.GRCh38")
BiocManager::[install](https://rdrr.io/pkg/BiocManager/man/install.html)("BSgenome.Hsapiens.NCBI.GRCh38")
BiocManager::[install](https://rdrr.io/pkg/BiocManager/man/install.html)("SNPlocs.Hsapiens.dbSNP144.GRCh37")
BiocManager::[install](https://rdrr.io/pkg/BiocManager/man/install.html)("BSgenome.Hsapiens.NCBI.GRCh37")

For 2, they dont really consider the merging concept, but will produce a vcf if desired. And since each vcf is standarized to their default format. bcftools can merged them directly (In other word, MungeSumstat replace the unify_sumstat and sumstat_to_vcf step of our sumstat merger, but not the actual merging one.)

For 3, they dont handle indels because indel is not included in their reference.

indels Binary does your Sumstats file contain Indels? These don’t exist in our reference file so they will be excluded from checks if this value is TRUE. Default is TRUE.

@gaow
Copy link
Contributor

gaow commented Apr 7, 2022

1 and 3 sound bad, 2 sounds good. But perhaps we can do SNPs with 2 and still use our routines to somewhat handle 3? The problem now is just 1, whether or not we can use a customized reference for it.

@hsun3163
Copy link
Collaborator Author

hsun3163 commented Apr 7, 2022

Another draw back I found is that, I dont think they can handle geneXsnp pair as we did.

Potentially, in their VCF output, if GENE info is stored in the info field we can still merge them. But as I am still trying to get it to work so need to verify whether this work.


For 1, it seems they acquire ref information based on function provided by the BSgenome (Infrastructure shared by all the Biostrings-based genome data packages.). So best case scenario is that we have our ref as a Biostrings-based genome data packages, which again dont have indel.

For 3, our routine in nature require multiple sumstat table. So one way of doing it is use our routine to scan through the output of MungeSS. However if we are to use our routine on the output of MungeSumstat. We may as well simply use ours and then use MungeSS only if needs arise.

@marcora
Copy link
Collaborator

marcora commented Apr 7, 2022

According to the GWAS-VCF format, the trait (in the case of eQTL datasets the gene id) should go into the SAMPLE column, not in the INFO field.

@gaow
Copy link
Contributor

gaow commented Apr 7, 2022

@marcora thank you and indeed this makes sense to trans-QTL, as I understand it. But I'm still not sure what's the best format for cis-QTL. I have initialized a disucssion MRCIEU/gwasvcf#18 and just bumped it with my concerns. May I suggest that we continue the discussion there so more people on the GWAS-VCF team can see it?

@hsun3163
Copy link
Collaborator Author

hsun3163 commented Apr 8, 2022

As illustrated in MRCIEU/gwasvcf#18, the only way for bcftools merge to merge geneXsnp pair correctly is to add gene information as a prefix in the ID, which added another reason for not to use MungeSumstat for it cant automatically generate such ID field.

I implement the gene as prefix of ID change to our ss_2_vcf function. The output was as followed:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20220408
##FORMAT=<ID=STAT,Number=1,Type=Float,Description="Effect size estimate relative to the alternative allele">
##FORMAT=<ID=SE,Number=1,Type=Float,Description="Standard error of effect size estimate">
##FORMAT=<ID=P,Number=1,Type=Float,Description="The Pvalue corresponding to ES">
##INFO=<ID=GENE,Number=1,Type=String,Description="The name of genes">
##FORMAT=<ID=TSS_D,Number=1,Type=Integer,Description="Customized Field TSS_D">
##FORMAT=<ID=AF,Number=1,Type=Float,Description="Customized Field AF">
##FORMAT=<ID=MA_SAMPLES,Number=1,Type=Integer,Description="Customized Field MA_SAMPLES">
##FORMAT=<ID=MA_COUNT,Number=1,Type=Integer,Description="Customized Field MA_COUNT">
##FORMAT=<ID=GENE.1,Number=1,Type=Str,Description="Customized Field GENE.1">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ALL
22      15281327        DUXAP8:chr22:15281327_C_T       C       T       .       PASS    GENE=DUXAP8     STAT:SE:P:TSS_D:AF:MA_SAMPLES:MA_COUNT:GENE.1                   0.15173951:0.08768614:0.0844871543707946:-503632:0.16746987:125:139:DUXAP8
22      15281373        DUXAP8:chr22:15281373_G_A       G       A       .       PASS    GENE=DUXAP8     STAT:SE:P:TSS_D:AF:MA_SAMPLES:MA_COUNT:GENE.1                   0.15173951:0.08768614:0.0844871543707946:-503586:0.16746987:125:139:DUXAP8
22      15282081        DUXAP8:chr22:15282081_C_A       C       A       .       PASS    GENE=DUXAP8     STAT:SE:P:TSS_D:AF:MA_SAMPLES:MA_COUNT:GENE.1                   -0.002963132:0.09356474:0.9747550803873822:-502878:0.14337349:109:119:DUXAP8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants