Skip to content

Regression of galaxy spin with morphology, stellar mass and (local + large-scale) environment. Random forests, and code to generate thesis plots.

Notifications You must be signed in to change notification settings

Chris-Duckworth/spin_bias

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tree

Galaxy spin and environment

This repo contains code to understand how a galaxy spin depends on properties such as stellar mass, halo mass, group membership, and, large-scale environment.

A galaxy's spin appears to be strongly correlated with morphology, with secondary dependences on stellar mass and inclination (i.e. λR is a biased obervational estimate), however, local and large-scale environment (filamentary structure) are informative (to a lesser degree). This is detailed in chapter 4 of my thesis, and, best quickly summarised here based on results from ./scripts/random_forest/ and ./plots/random_forest/.

Data

Data is taken from various sources and a basic summary of each catalogue is given here, however, for more detail see here and the references therein.

  • Integral field unit observations are from the MaNGA galaxy survey, which is processed by the internal Data Analysis Pipeline DAP. This is used to compute λR, a flux weighted measure of coherent rotation of a galaxy see here.

  • Additional information is taken from the NASA-Sloan Atlas targetting catalogue which provides stellar mass, and, galaxy inclination. For all galaxies in MaNGA, morphological classifications from citizen science project galaxyZoo are found.

  • These catalogues are cross-matched with group catalogues found from galaxies in the SDSS-DR7 spectroscopic sample, which provides halo mass, and, central/satellite definition.

  • Cosmic web catalogues are also cross-matched to this data to provide distances to morphological features of the cosmic web such as distances to filaments and nodes.

The total number of MaNGA galaxies (for MPL-9) after cross-matching information from each of these catalogues (top row) and cumulatively cross-matching (bottom row) are given here:

MaNGA (w GalaxyZoo) Group membership Cosmic-web
Cross-matched 7398 6343 6378
Cumulative cross-matched 7398 6343 6117

Catalog class object

Data catalogues (and various versions of MaNGA data releases) are brought together by the catalog class object found here, which performs the cross-matching. Catalog class objects store cross-matched information in the form of a pandas.DataFrame object (stored as property catalog.df).

The catalog class objects also contain various methods to select galaxy sub-samples based on these properties (./lib/catalog_init.py), for data processing ./lib/catalog_process.py and plotting ./lib/catalog_plot.py. These methods are tied together in ./catalog.py, however, are mainly used for the ./scripts/thesis_plots directory.

MaNGA data used by the catalog object here is currently proprietary and hence not included in the repo.

Random Forest

To evaluate the importance of various galaxy properties (including local and large-scale environment) in predicting a galaxy's spin, we generate a random forest to predict λR.

Input features

  • (nsersic) sersic index
  • (b/a) inclination
  • (Mstel) stellar mass
  • (Mhalo) halo mass
  • (Dnode) distance to nearest node (of the cosmic web)
  • (Dskel) distance to nearest filament segment (of the cosmic web)
  • (cen/sat) group membership (i.e. if the galaxy is the most massive (central) in the group or not (satellite))

We test various parameterisations of morphology including one-hot encoding of galaxyZoo classifications (i.e. ETGs, LTGs etc.), empirically estimated morphological T-type (using vote fractions in GZ), however, sersic index appears to be most informative for predicting galaxy spin (see ./plots/random_forest/morphology_importances.pdf.

Hyperparamter tuning

The best performing random forest is found using the sklearn optimized (cross-validated) grid-search method GridSearchCV. In contrast to a full grid-search, not all parameter values are tried, however, a fixed number of setting combinations are sampled from the distribution. We find there is no significant drop-off in performance between various hyperparameter combinations, however, our best random forest set-up is found to be:

{'n_estimators': 1000,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_features': 'sqrt',
 'max_depth': 10}

Model fitting

Using the following parameters and hyperparameters, a random forest is generated using the cross-matched data from the catalog class (6117 galaxies), where 80% is used for training and 20% for testing. We find that the model predictions have a mean absolute error of 0.12538. To compare the performance of the network (and importance of secondary parameters), we also generate a random forest (with the same hyperparameters) just using nsersic and b/a. We find the model predictions to have a mean absolute error of 0.12937. The λR predicted distributions for the random forests (using all parameters and just the two most important (nsersic, b/a)) and the actual values are shown here:

hist

To evaluate the performance of the predictions we also plot the correlation (with pearson correlation coeffients) with the actual λR values for the test data. We find that the random forests make reasonably informed predictions of λR, however, struggle to predict low or high values. The correlation plot is shown here:

corner

Feature importances

A natural output of random forests is the relative importances of input features (based on the frequency of a given feature in nodes across all of the decision trees in the forest). The relative importances (i.e. normalised so they all sum to one) are given here:

imp

We find that sersic index and inclination are most informative of predicting λR, however there are reasonable contributions from stellar and halo mass along with small but significant contributions from cosmic-web environment. Group membership appears to be insignificant (however is encoded as a binary discrete parameter, which may impact its importance ranking relative to all other features which are continuous).

About

Regression of galaxy spin with morphology, stellar mass and (local + large-scale) environment. Random forests, and code to generate thesis plots.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published