Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next steps for feature extraction pipeline #43

Open
2 tasks
ngreenwald opened this issue Apr 9, 2024 · 0 comments
Open
2 tasks

Next steps for feature extraction pipeline #43

ngreenwald opened this issue Apr 9, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@ngreenwald
Copy link
Member

ngreenwald commented Apr 9, 2024

I wanted to consolidate all of the new ideas for feature extraction here, separate from the supplementary plots issues, so it's easier to keep track of what's for the paper and what is next.

  • Identifying which features to keep
    • The current pipeline uses some relatively simple heuristics to filter out features prior to analysis. For example, if a given compartment-specific feature is highly correlated with the image-wide feature, then that feature isn't included in the final dataset. However, we don't compare the compartment-specific features to each other, and we don't compare features of different types to each other, as a way of identifying additional potentially correlated features. Would it make sense to add in additional checks to identify potentially duplicative features?
    • Alternatively, should we think about different or complimentary approaches to identify which features to prune? For example, what about a feature that isn't correlated, but is just random noise? Are there ways we could look at the data and infer that a given feature does not contain any useful information? Could we use a metric other than correlation to determine if a feature should be kept? For example, statistical testing to determine if means are different across compartments?
    • Should we revisit the thresholds for number of non-zero/non-missing values for a feature to be included? Should this threshold be adaptive, based on what tier of feature we're looking at?
  • Additional features
    • Functional marker expression per compartment. Given the interesting differences for cell type abundance by compartment, can we do the same thing for functinoal markers? Once again, doing it across all functional markers would likely be too much feature polution, but maybe there's a more targeted way to introduce this? Just for a subset of functional marker/cell type combinations perhaps? Just the ones associated with survival? Maybe a more stringent criteria for determining which features are uncorrelated with global level, or require more images to be positive?
    • The cell ratio features have been quite informative, can we take a similar approach to functional marker features to look at ratios? For example, at the cluster lineage resolution, for a given functional marker, look at the ratios between all the cell types that are positive? This would likely be too many features. Maybe we do it for a subset? Or a manually curated list? Would it make more sense to look at ratios of proportions, i.e. 30% CD4T, 40%CD8T, 0.75 ratio? Or 200 CD4T positive, 100 CD8T positive, ratio of 2? Can we do this per compartment?
    • Are there other ratio-based features that it would make sense to include? Which metrics would lend themselves to being calculated in this way without a lot of manual guidance?
    • Is there a more principled way we can decide which features to compute per compartment, and which features not?
    • Are we excluding the double positive functional marker combinations in the right way?
@ngreenwald ngreenwald added the enhancement New feature or request label Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants