Batch adjusting and scaling the dataset #948

rbutleriii · 2024-04-26T19:38:52Z

rbutleriii
Apr 26, 2024

Hello, linking this in with #945 and #862. I am looking at a merged sections across multiple slides, so for instance the wild type vehicle group, on slides A, B, and C. and I want to merge the three and process together as the wild type vehicle group. The question is how to properly account for slide differences?

Potentially, it could simply be:

gobj %>%
  filterGiotto(
    feat_type = 'rna',
    feat_det_in_min_cells = 5,
    min_det_feats_per_cell = 5
  ) %>%
  normalizeGiotto(feat_type = 'rna') %>%
  normalizeGiotto(
    feat_type = 'rna',
    scalefactor = 5000,
    norm_methods = 'pearson_resid',
    update_slot = 'pearson'
  ) %>%
  addStatistics(feat_type='rna') %>%
  adjustGiottoMatrix(
    batch_columns = 'list_ID', 
    update_slot = 'normalized'
  ) -> gobj

where list_ID is a column with slide numbers. However, I noticed this doesn't update the scaled slot, and there isn't a function that independently scales the data outside of normalizeGiotto. Looking through a few tutorials (CosMx, seqfish mini, seqfish Cortex), it would appear most downstream function calls all reference either the normalized slot, or else for the heatmaps they seem to re-scale the data on the fly (what I assume "rescaled" is in plotHeatmap).

Is there a way to regenerate the scaled slot? And is there a function that will use it if you do?

As an aside, adjustGiottoMatrix is in the tutorials a couple times regressing out nr_feats and total_expr. This would I imagine pretty much remove any batch effect that impacts the whole slide, though that would in effect turn feature expression into feature expression residuals. Is there a rationale there for doing that. I think in seurat at least, I have had...not great success doing DE testing on residuals.

rbutleriii · 2024-05-03T17:09:58Z

rbutleriii
May 3, 2024
Author

So, following up on this, when I do normalize by slide the primary problem I encounter would be that the distribution now includes negative numbers:

# Un-normalized
summary(as.vector(gobj@expression$cell$rna$normalized@exprMat))
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#  0.0000  0.0000  0.0000  0.8786  0.0000 11.1888
gobj %>%
  adjustGiottoMatrix(
    batch_columns = 'list_ID',
    update_slot = 'normalized'
  ) -> gobj
# normalized already exists and will be replaced with new values
summary(as.vector(gobj@expression$cell$rna$normalized@exprMat))
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
# -0.955590 -0.081362  0.005135  0.847343  0.540294 12.286635

This doesn't cause problems with most of the downstream functions until I get to ligand-receptor cell-cell communication, where I will get a warning for exprCellCellcom and spatCellCellcom:

Warning message:
In eval(jsub, SDenv, parent.frame()) : NaNs produced

The NAs are in the log2fc and PI columns, as they do not want to take the log of a negative. The easy solution is to set log2FC_addendum = 1 instead of 0.1, but I did want to check if I should be trying to shift my entire distribution up by one immediately after normalization. Will the negative numbers adversely impact spatial networks, spatial genes, and icfs even though they don't deliver an error?

2 replies

rbutleriii May 7, 2024
Author

Hmm, the variance on this is higher than expected between four experimental groups I am comparing, and it doesn't seem realistic to do a log2FC_addendum = 2 for all of them.

[1] "After normalization, the expression matrix has the following distribution:"
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.41408 -0.08250  0.09673  1.07518  2.27879 12.25981 

[1] "After normalization, the expression matrix has the following distribution:"
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-1.81828 -0.09395  0.05637  1.08330  2.41265 12.09038 

[1] "After normalization, the expression matrix has the following distribution:"
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.57007 -0.01931  0.01594  0.92636  0.25105 11.60394 

[1] "After normalization, the expression matrix has the following distribution:"
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.974910 -0.081921  0.007865  0.820172  0.535465 12.947419

Also, considering the median is near zero, not sure any kind of shift would be advised. I also tried a higher cell filtering threshold (feat_det_in_min_cells = 100) to no effect, and quantile normalization which lessened the negative Min values partially, but they still remain.

If I were to apply the log2FC_addendum then, the question becomes, do I apply the same value to all four groups, or the Min value for each distribution to itself. Presumably, since spat/exprCellCellcom are comparing the expr value to permutations within the same distribution, the log2fc between groups are relative and hence comparable, thought having different log2FC_addendum values would impact lower expression values quite a bit more, as:

# low expression 2-fold increase
log2((0.3 + 2)/(0.15 + 2))
[1] 0.0972972
log2((0.3 + 0.1)/(0.15 + 0.1))
[1] 0.6780719

# high expression 2-fold increase
log2((6 + 2)/(3 + 2))
[1] 0.6780719
log2((6 + 0.1)/(3 + 0.1))
[1] 0.976541

I could also set log2FC_addendum = 1 and then hope the mean expr values would be >0.

rbutleriii May 29, 2024
Author

Interestingly, when building tensors in cell2cell, they go ahead and simply set negative counts to 0. In this case I think that would also work, as the majority of the distrubution would be unaffected, and they are otherwise fairly consistent above the lower quartile (similar medians, means, quartiles). Wondering what y'all think...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch adjusting and scaling the dataset #948

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Batch adjusting and scaling the dataset #948

rbutleriii Apr 26, 2024

Replies: 1 comment · 2 replies

rbutleriii May 3, 2024 Author

rbutleriii May 7, 2024 Author

rbutleriii May 29, 2024 Author

rbutleriii
Apr 26, 2024

Replies: 1 comment 2 replies

rbutleriii
May 3, 2024
Author

rbutleriii May 7, 2024
Author

rbutleriii May 29, 2024
Author