Using Spectre on very very large datasets #170

Alby86 · 2023-08-04T07:55:39Z

Alby86
Aug 4, 2023

Hello,

Thank you so much for this wonderful package!

I have been using the function 'Spectre::read.files' to read a large amount of .fcs files, but it appears that it can't do that: R crashes or the function gives up. I am running it on a cluster, so I am not sure whether the issue is just space. Any idea?
To go around this problem, I performed a sub-sample of the data to 1% of the total (around 52 million cells), and then everything works fine: I identified some clusters that makes sense and performed proper downstream statistics. Now, though, I would like to label the remaining data (I don't expect to see new clusters), and perform statistics over the whole dataset. How could I use the same FlowSOM model to predict the clusters? I know I can do that with flowSOM, but I can't see any documentation for Spectre. Is there such an option?

Thank you in advance!

SamGG · 2023-08-04T08:51:22Z

SamGG
Aug 4, 2023

Hi,

I think you should look at the classification term in Spectre article. The relevant paragraphs are "3.6 Automated cellular classification and label transfer between aligned datasets" and " 2.11 Classification". The mapping is based on a knn neighbors algorithm, which is different from FlowSOM. There is also an interesting Suppl Fig about selecting k.

I cite

Classification provides the means to transfer the annotation of a dataset to another dataset.
Spectre facilitates classification via the run.knn.classifier function.
To determining suitable k value, Spectre provides the train.knn.classifier function to assist in evaluation.

The authors/maintainers will correct my answer.
Best.

0 replies

tomashhurst · 2023-08-04T10:44:10Z

tomashhurst
Aug 4, 2023
Maintainer

Hi @Alby86 , thanks for reaching out! In theory the read.files function itself should deal with a dataset size of anything, but it might be the memory availability that is the limiting factor. We have some solutions for this which will be part of our v2 release. You can indeed map to a FlowSOM grid, or use the kNN classifier as @SamGG alluded to. At the moment we discard the actual FlowSOM model in Spectre and just retain the outputs, but the v2 structure keeps both so you can do post-hoc mapping of extra data. Effectively you would run the analysis on a subset, then read in chunks of the data a bit at a time, map to the flowsom grid, summarise, and save to disk.

I can send you some notes on how to do this next week.

1 reply

Alby86 Aug 17, 2023
Author

Hi Thomas. Thanks a lot for your reply. I am not sure how to map to a flowSOM grid to be honest, and it would be important to keep a flowSOM framework instead of moving to the KNN classifier to remain consistent with the rest of the study. Could you help me out with the notes you mentioned? Or, is v2 coming out soon?

SamGG · 2023-08-06T17:19:52Z

SamGG
Aug 6, 2023

Hi. I take the opportunity of your question to get opinions about the sampling ratio.

I don't know your experiment, but I feel that picking 1% out of 52M cells is too low although you pointed that you got the cell populations you expected. May be you are at the first level of the analysis and you only picked up the markers that separate the major populations. In that case, I understand 1% as sampling ratio.

If this is not the case, I feel that 1% is too low. In my opinion, there should be at least the equivalent of one complete sample of each group in the concatenation. This would allow to exploit the depth aka wealth of a typical sample. So let's say that each sample is about 1M cells, so there are about 50 samples. Let's say there are at least 10 samples per group, so there are about 4 groups. So the concatenation should be around 4M cells, 1M cells sampled in each group. Or in another way, a sampling ratio of 1/50*4 = 8%.

What's your opinion about the "1 complete representative sample per concatenation" design rule?

0 replies

Alby86 · 2023-08-07T05:06:22Z

Alby86
Aug 7, 2023
Author

Hi Thomas!Thank you for your reply. I also think it should be a memory issue, but I am already taking a whole cluster node, so I don’t have another option.If you could send me some notes, that would be amazing. When is version 2 coming out? Thanks a lot!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Spectre on very very large datasets #170

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Using Spectre on very very large datasets #170

Alby86 Aug 4, 2023

Replies: 4 comments · 1 reply

SamGG Aug 4, 2023

tomashhurst Aug 4, 2023 Maintainer

Alby86 Aug 17, 2023 Author

SamGG Aug 6, 2023

Alby86 Aug 7, 2023 Author

Alby86
Aug 4, 2023

Replies: 4 comments 1 reply

SamGG
Aug 4, 2023

tomashhurst
Aug 4, 2023
Maintainer

Alby86 Aug 17, 2023
Author

SamGG
Aug 6, 2023

Alby86
Aug 7, 2023
Author