Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in sEV_recognizer - output path file creation issue w/ multiple samples? #13

Open
123chrisc opened this issue Feb 21, 2024 · 6 comments

Comments

@123chrisc
Copy link

123chrisc commented Feb 21, 2024

Hello! I appear to be running into issues with initializing the correct file paths/sample file. I am a relatively new python coder, so my apologies if I am providing insufficient information. Please let me know and I will do my best to correct.

For context, my directory is formatted as such:

raw_data
--sample1
----raw_feature_bc_matrix
------barcodes.tsv.gz
------features.tsv.gz
------matrix.tsv.gz
--sample2
----raw_feature_bc_matrix
------barcodes.tsv.gz
------features.tsv.gz
------matrix.tsv.gz
... --sample6

Within the raw_data folder, I have the sample_file.txt file, containing the relative paths to my files (attached). sample_file.txt. I initially tried entering the absolute file paths (as recommended by documentation, 'Here, first parameter was the abosulte path of each sample row by row.'), but received the same issue as below.

When I run the following code:
SEVtras.sEV_recognizer(input_path='./',sample_file='./raw_data/sample_file.txt', out_path='./sev_results', species='Homo',dir_origin=False,predefine_threads=30)

I receive the following output:

0 1
1 1

FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_90051/889495166.py in
----> 1 SEVtras.sEV_recognizer(input_path='./',sample_file='./raw_data/sample_file.txt', out_path='./sev_results', species='Homo',dir_origin=False,predefine_threads=30)

~/anaconda3/envs/SEVtras_env/lib/python3.7/site-packages/SEVtras/main.py in sEV_recognizer(sample_file, out_path, input_path, species, predefine_threads, get_only, score_t, search_UMI, alpha, dir_origin)
155 pass
156 else:
--> 157 os.mkdir(str(out_path) + '/tmp_out/' + sample)
158
159 adata.write(str(out_path) + '/tmp_out/' + sample + '/raw_' + sample + '.h5ad')

FileNotFoundError: [Errno 2] No such file or directory: './sev_results/tmp_out/raw_data/sample1/raw_feature_bc_matrix'

I am hoping for some guidance on how to tackle, or some increased clarity on the correct file naming/path procedures. Thank you!

@RuiqiaoHe
Copy link
Member

You have the right understanding of how to input files for SEVtras. However, this is a bug in SEVtras. The code "os.mkdir" cannot create a directory if the parent directory does not exit. So, if you use the argument of dir_origin=False, you can only add the sample name in the sample_file.txt with the current version of SEVtras. I will change "os.mkdir" to "os.mkdirs" to solve this problem.
Here is the solution for the current SEVtras v0.2.8:
SEVtras.sEV_recognizer(input_path='./raw_data/',sample_file='./raw_data/sample_file.txt', out_path='./sev_results', species='Homo', dir_origin=False ,predefine_threads=30)
List in sample_file.txt:
sample1
sample2
......
sample6
The directory should be formatted as follows:
raw_data
--sample1
----barcodes.tsv.gz
----features.tsv.gz
----matrix.tsv.gz
--sample2
----barcodes.tsv.gz
----features.tsv.gz
----matrix.tsv.gz
...
--sample6

By the way, the output "1 1" means that SEVtras finds only one represented gene for sEV identification. So I recommend to lower the argument alpha, for example alpha=0.09.
Thanks for your testing.

@123chrisc
Copy link
Author

Dear Dr. He,

Thank you so much for the quick response! I will try lowering the alpha to 0.09.

I had 3 other questions that came up in the meantime:

  1. Processing time related inquiry

In my previous runs, it took over 1 hour to complete 1/6 sample with 32 cpus (predefine set to 30) and 128gb of memory. After looking at some of your other threads, I see that this time might be unusually large (#10). However, I am running SEVtras on Mac (via python script) and the multiprocessing appears to be working, and using 10x Genomics scRNA-seq data. Any ideas on how to speed up processing time?

  1. Running SEVtras on different conditions (i.e., Healthy vs. diseased states)

I was wondering if SEVtras should be run on all samples, or if SEVtras should be split and run on samples by condition. For example, I am working with a dataset that contains 3 healthy and 3 diseased samples -- should I run SEVtras once on all 6, or opt to do 2 runs of 3 samples?

  1. Downstream analysis?

Finally, I was hoping to inquire about some of the downstream steps after sEV_recognizer is done running.

  • I am expecting to receive an output for raw_SEVtras.h5ad and sEVs_SEVtras.h5ad --> sEVs_SEVtras.h5ad will be used in the ESAI_calculator step).
  • I should make a copy of the raw adata_cell (adata_cell.raw). Is it correct that the adata_cell.raw is just the object created from sc.read_10x_mtx() from the raw counts matrix? (i.e., https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html, line 4, but using raw data)
  • When annnotating adata_cell --> should I use the regular filtered matrix (from 10x processing steps)?
  • I was planning on using Seurat for the basic preprocessing and cell-type annotation workflow (as I have already completed these steps for my dataset). However, within my preprocessing I perform filtering and batch correction. It is unclear to me how the SEVtras_combined.h5ad will factor in batch correction (for the sEV 'cells'). Or does this even matter?
  • Finally, can I perform downstream analysis treating the labelled SEVs as cells? (i.e., cell-cell communication analysis)

Thank you again for your time and all your package support!

@RuiqiaoHe
Copy link
Member

Thank you for your meaningful questions to SEVtras.

  1. The runtime of your example seems to be in the normal range. You can try to speed it up by increasing the number of CPUs if the memory load is sufficient.
  2. I suggest that you run all samples at once. SEVtras is a data driven algorithm, more samples means more solid identification results. And it can also make sure to compare the two conditions in the same standard.
    3_1. The two files are created in the output directory.
    3_2. It depends on the argument "Xraw" in SEVtras.ESAI_calculator. The raw means the raw count matrix with unprocessed and unfiltered process. Yes, a raw data in the 10X output only adding cell type information is the correct input.
    3_3. I suggest that you can use the filtered matrix in the conventional 10X procedure.
    3_4. SEVtras can tolerate batch effect to a certain extent. I recommend that you run SEVtras on the same dataset, which may have a slight batch effect. If you want to apply it to multiple datasets, you can input a raw dataset with batch effect corrected to SEVtras.
    3_5. It is worth to try and extend the analysis of SEVtras.

@123chrisc
Copy link
Author

Dear Dr. He,

Thank you for your detailed responses! Your support is greatly appreciated.

I wanted to clarify 3_2 --> how would we go about adding cell type information to raw and unfiltered data? Most standard workflows seem to apply some sort of filtering (i.e., number of genes), or regressing (i.e., mitochondrial genes) prior to cell type annotation steps. Are you suggesting to bypass this standard workflow when working with the adata_cell.raw object? Or, are you instead suggesting that we should just run the standard preprocessing pipeline, but using the raw_matrix (rather than the filtered_matrix).


On a different note, I proceeded with the analysis, setting 'Xraw = False', and using adata_cell (regular 10x filtered matrix, regular Seurat preprocessing steps), similar to the recommendations you made prior.

I was able to successfully run ESAI_calculator, but noticed that my ESAIumap plots look slightly different. For some reason, my ESAIumap visualization appears to have a high alpha for the yellow sEV cluster; but in the example plot you provided, the sEV cluster is removed (to better visualize the sEV cell type contributions).
ESAIumap.pdf

@RuiqiaoHe
Copy link
Member

If 3_2 refers to how to add cell type information, my suggestion is similar to the answer of 3_3. You can obtain the cell type information based on your own processing procedure, such as filtering or regressing. And the input for SEVtras also depends on your mind with the choice of Xraw argument. For example, you can input raw adata with the cell type information generated from the filtered one. I suggests this previously because I am worried that sEV-characterized genes might be filtered out in the regular preprocessing and filtering steps of the cell matrix, and the raw adata for input can prevent this situation.
For the slightly different figure, is there any sEVs information in the adata_cell? Or is your sEVs key in the adata_ev not the same as the default OBSev='sEV'? I may need more information to fix the bug. And all the information for plotting this figure has been saved as "SEVtras_combined.h5ad", you can use this to plot this figure.

@kingwzun
Copy link

kingwzun commented Aug 2, 2024

Dear Dr. He,

Thank you for your detailed responses! Your support is greatly appreciated.

I wanted to clarify 3_2 --> how would we go about adding cell type information to raw and unfiltered data? Most standard workflows seem to apply some sort of filtering (i.e., number of genes), or regressing (i.e., mitochondrial genes) prior to cell type annotation steps. Are you suggesting to bypass this standard workflow when working with the adata_cell.raw object? Or, are you instead suggesting that we should just run the standard preprocessing pipeline, but using the raw_matrix (rather than the filtered_matrix).

On a different note, I proceeded with the analysis, setting 'Xraw = False', and using adata_cell (regular 10x filtered matrix, regular Seurat preprocessing steps), similar to the recommendations you made prior.

I was able to successfully run ESAI_calculator, but noticed that my ESAIumap plots look slightly different. For some reason, my ESAIumap visualization appears to have a high alpha for the yellow sEV cluster; but in the example plot you provided, the sEV cluster is removed (to better visualize the sEV cell type contributions). ESAIumap.pdf

Hi 123chrisc, can I get your email? I want to get the code to generate adata_cell in ESAI_calculator.(My email is kingwzun@gmail.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants