Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fraction Column is lost and reevaluated by MSStats #174

Open
tillenglert opened this issue Nov 23, 2021 · 19 comments
Open

Fraction Column is lost and reevaluated by MSStats #174

tillenglert opened this issue Nov 23, 2021 · 19 comments

Comments

@tillenglert
Copy link

I'm currently adding MSFragger as a search engine for ProteomicsLFQ. When running the minimal test profile I ran into an issue with MSstats. The tool could not figure out the fractionation of the samples and stopped the executation with following message:

"** It is hard to find the same fractionation across sample, due to lots of overlapped features between fractionations.
	                 Please add Fraction column in input."

Now searching for the reason of this issue I looked into the source code of MSstats and the function OpenMStoMSstatsFormat, which preprocesses the data for MSstats before doing the dataProcess function.
This function also just takes the required columns of the out.csv of proteomicslfq which are the following:

requiredinput.general <- c("ProteinName", "PeptideSequence", "PrecursorCharge", 
                                "FragmentIon", "ProductCharge", "IsotopeLabelType",
                                "Condition", "BioReplicate", "Run", "Intensity")

source: https://rdrr.io/bioc/MSstats/src/R/OpenMStoMSstatsFormat.R (MSstats 3.22)

Which leads to the loss of the Fraction Column. This was not leading to an Error when using Comet or MSGF+ search engines, as MSstats is analysing the features and can detect if its Technical Replicates or Fractionated Samples if the features are clear enough. I guess the problem in MSFragger was that it found too many overlapping features and at the same time too many duplicated features across fractions and samples.

When testing the newest version of MSstats (4.2) it could actually correctly assign the fractions. The latest version is dependent on MSstatsConvert which includes the conversion tools for different MS tools. So maybe it would make the ProteomicsLFQ pipeline more robust to errors especially as the information of fractions is lost.

@jpfeuffer
Copy link
Collaborator

I think it would be better if openms just exports a fraction column correctly. Instead of hoping for a correct guess b Msstats.

@jpfeuffer
Copy link
Collaborator

@timosachsenberg I have no idea why this is not the case. I thought we export everything.

@jpfeuffer
Copy link
Collaborator

I also did a PR to MSstats once to address this issue. Maybe it did not make it into 3.22? Did you check 3.22.1 or whatever ele came before 4?
Because I never made 4 work with newer OpenMS versions because OpenMS does not build on bioconda anymore and is incompatible with some dependencies I think.

@jpfeuffer
Copy link
Collaborator

@tillenglert
Copy link
Author

https://github.com/Vitek-Lab/MSstats/blob/3a3acbbd37f3cdebbb8db7bf165c96306f732e2d/R/converters.R#L234

Seems not to be in the code anymore, after they changed their code structure!

@tillenglert
Copy link
Author

I tested with 3.22.1, which should be the latest version before v4.

And yes v4 is not compatible in any case to be used in the nfcore/proteomicslfq docker... For testing (v4.2.0) I had to build another container.

@timosachsenberg
Copy link

@timosachsenberg I have no idea why this is not the case. I thought we export everything.

Yeah, we checked. We export it, and it seems that the issue is on the MSstats side (see Till's comments).

@jpfeuffer
Copy link
Collaborator

jpfeuffer commented Dec 2, 2021

Can you find out why it is not compatible? In theory the openms::openms2.7.0pre package should be built with the latest conda packages. 2.6.0 from bioconda is of course outdated.
It could be that some thirdparties clash in the openms-thirdparty package. I already removed some of them (maybe some of them can be fixed by conda rebuilds/updates). In the worst case we use openms and only add the ones we need separately.

I think this would be the way forward.
Otherwise we need to monkey patch the function in our R code. I remember having done such a thing before in my own scripts.

@tillenglert
Copy link
Author

tillenglert commented Dec 3, 2021

proteomicslfq_docker_build.log

Attached is the log of the dockerfile build of nf-core/proteomicslfq with the following environment.yml:

name: nf-core-proteomicslfq-1.0.0
channels:

  • openms
  • conda-forge
  • bioconda
    dependencies:
  • openms::openms
  • openms::openms-thirdparty
  • bioconda::bioconductor-msstats=4 # will include R
  • bioconda::sdrf-pipelines=0.0.9 # for SDRF conversion
  • conda-forge::r-ptxqc=1.0.5 # for QC reports
  • conda-forge::xorg-libxt=1.2.0 # until this R fix is merged: Update run dependencies for capabilities() and grSoftVersion() conda-forge/r-base-feedstock#128
  • conda-forge::fonts-conda-ecosystem=1 # for the fonts in QC reports
  • conda-forge::python=3.8.5
  • conda-forge::markdown=3.2.2
  • conda-forge::pymdown-extensions=8.0.1
  • conda-forge::pygments=2.7.1

So there are conflicts but conda can't figure out where.

@jpfeuffer
Copy link
Collaborator

I would try "mamba" to find the conflicts. Conda is basically useless for this. And in this case even seems to be bugged.
I think you can just install mamba instead of conda and use the same commands.

@tillenglert
Copy link
Author

After some testing I finally managed to include MSstats v4.2, but for this I needed to change the version of python (to v3.9) and ptxqc (to v1.0.12). Unfortunately, this leads to an error in ptxqc when running the test profile. The current environment is:

name: nf-core-proteomicslfq-1.0.0
channels:

  • openms
  • conda-forge
  • bioconda
    dependencies:
  • openms::openms=2.7.0pre
  • openms::openms-thirdparty=2.7.0pre
  • bioconda::bioconductor-msstats=4.2 # will include R
  • bioconda::sdrf-pipelines=0.0.9 # for SDRF conversion
  • conda-forge::r-ptxqc=1.0.12 # for QC reports
  • conda-forge::xorg-libxt=1.2.0 # until this R fix is merged: Update run dependencies for capabilities() and grSoftVersion() conda-forge/r-base-feedstock#128
  • conda-forge::fonts-conda-ecosystem=1 # for the fonts in QC reports
  • conda-forge::python=3.9
  • conda-forge::markdown=3.2.2
  • conda-forge::pymdown-extensions=8.0.1
  • conda-forge::pygments=2.7.1

The error of ptxqc is the following:

Loading required package: PTXQC
Loading package PTXQC (version 1.0.12)
Error in file.exists(pattern = mqpar_filename) : invalid 'file' argument
Calls: createReport -> getMetaFilenames -> getMQPARValue -> file.exists
In addition: Warning messages:
1: In (function (parents, id = names(parents), name = id, obsolete = setNames(nm = id, :
Some parent terms not found: MS:1001456
2: In (function (parents, id = names(parents), name = id, obsolete = setNames(nm = id, :
Some parent terms not found: UO:0000000
Execution halted

@timosachsenberg
Copy link

I will ask @cbielow if he knows what the issue is here

@cbielow
Copy link

cbielow commented Dec 6, 2021

I cannot find anything obviously wrong with the code in PTXQC.
There should be a warning() (not an error) on the console which provides further details if mqpar.xml cannot be found, but your output has none... this is a bit strange.
Can someone point me to the script and the data that you are actually running?!

@jpfeuffer
Copy link
Collaborator

Why does it want an mqpar.xml at all? We input mztab.

@cbielow
Copy link

cbielow commented Dec 7, 2021

its quite an unusual combination indeed, but the mqpar.xml is used to find some threshold parameters, if available.

@tillenglert
Copy link
Author

tillenglert commented Dec 7, 2021

The script I'm using is this nextflow script:

https://github.com/tillenglert/proteomicslfq/blob/master/main.nf#L1304

with this config (testfiles):
https://github.com/tillenglert/proteomicslfq/blob/master/conf/test.config#L20

As I'm still working on msfragger I tested the ptxqc process with comet. The logs and inputfiles are attached to this comment:
ptxqc_logs.zip

@cbielow
Copy link

cbielow commented Dec 7, 2021

the error is fixed in the current development version of PTXQC.
It will be some time before the new version is published.

Since this is a regression, the last working version should be PTXQC v1.00.10 - May 2021.
If you can use that version for the time being, the bug should be resolved.

@tillenglert
Copy link
Author

Ah perfect! I haven't tried this version, but it's working and compatible with the remaining packages.

This is the current environment I'm using, which is working vor msstats and ptxqc:

name: nf-core-proteomicslfq-1.0.0
channels:

  • openms
  • conda-forge
  • bioconda
    dependencies:
  • openms::openms=2.7.0pre
  • openms::openms-thirdparty=2.7.0pre
  • bioconda::bioconductor-msstats=4.2 # will include R
  • bioconda::sdrf-pipelines=0.0.9 # for SDRF conversion
  • conda-forge::r-ptxqc=1.0.10 # for QC reports
  • conda-forge::xorg-libxt=1.2.0 # until this R fix is merged: Update run dependencies for capabilities() and grSoftVersion() conda-forge/r-base-feedstock#128
  • conda-forge::fonts-conda-ecosystem=1 # for the fonts in QC reports
  • conda-forge::python=3.9
  • conda-forge::markdown=3.2.2
  • conda-forge::pymdown-extensions=8.0.1
  • conda-forge::pygments=2.7.1

@jpfeuffer
Copy link
Collaborator

Feel free to open a PR with the environment update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants