Next steps for feature extraction pipeline #43

ngreenwald · 2024-04-09T05:29:32Z

I wanted to consolidate all of the new ideas for feature extraction here, separate from the supplementary plots issues, so it's easier to keep track of what's for the paper and what is next.

Identifying which features to keep
- The current pipeline uses some relatively simple heuristics to filter out features prior to analysis. For example, if a given compartment-specific feature is highly correlated with the image-wide feature, then that feature isn't included in the final dataset. However, we don't compare the compartment-specific features to each other, and we don't compare features of different types to each other, as a way of identifying additional potentially correlated features. Would it make sense to add in additional checks to identify potentially duplicative features?
- Alternatively, should we think about different or complimentary approaches to identify which features to prune? For example, what about a feature that isn't correlated, but is just random noise? Are there ways we could look at the data and infer that a given feature does not contain any useful information? Could we use a metric other than correlation to determine if a feature should be kept? For example, statistical testing to determine if means are different across compartments?
- Should we revisit the thresholds for number of non-zero/non-missing values for a feature to be included? Should this threshold be adaptive, based on what tier of feature we're looking at?
Additional features
- Functional marker expression per compartment. Given the interesting differences for cell type abundance by compartment, can we do the same thing for functinoal markers? Once again, doing it across all functional markers would likely be too much feature polution, but maybe there's a more targeted way to introduce this? Just for a subset of functional marker/cell type combinations perhaps? Just the ones associated with survival? Maybe a more stringent criteria for determining which features are uncorrelated with global level, or require more images to be positive?
- The cell ratio features have been quite informative, can we take a similar approach to functional marker features to look at ratios? For example, at the cluster lineage resolution, for a given functional marker, look at the ratios between all the cell types that are positive? This would likely be too many features. Maybe we do it for a subset? Or a manually curated list? Would it make more sense to look at ratios of proportions, i.e. 30% CD4T, 40%CD8T, 0.75 ratio? Or 200 CD4T positive, 100 CD8T positive, ratio of 2? Can we do this per compartment?
- Are there other ratio-based features that it would make sense to include? Which metrics would lend themselves to being calculated in this way without a lot of manual guidance?
- Is there a more principled way we can decide which features to compute per compartment, and which features not?
- Are we excluding the double positive functional marker combinations in the right way?

ngreenwald added the enhancement New feature or request label Apr 9, 2024

ngreenwald assigned camisowers Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next steps for feature extraction pipeline #43

Next steps for feature extraction pipeline #43

ngreenwald commented Apr 9, 2024 •

edited

Loading

Next steps for feature extraction pipeline #43

Next steps for feature extraction pipeline #43

Comments

ngreenwald commented Apr 9, 2024 • edited Loading

ngreenwald commented Apr 9, 2024 •

edited

Loading