This repo contains code to understand how a galaxy spin depends on properties such as stellar mass, halo mass, group membership, and, large-scale environment.
A galaxy's spin appears to be strongly correlated with morphology, with secondary dependences on stellar mass and inclination (i.e. λR is a biased obervational estimate), however, local and large-scale environment (filamentary structure) are informative (to a lesser degree). This is detailed in chapter 4 of my thesis, and, best quickly summarised here based on results from ./scripts/random_forest/
and ./plots/random_forest/
.
Data is taken from various sources and a basic summary of each catalogue is given here, however, for more detail see here and the references therein.
-
Integral field unit observations are from the MaNGA galaxy survey, which is processed by the internal Data Analysis Pipeline DAP. This is used to compute λR, a flux weighted measure of coherent rotation of a galaxy see here.
-
Additional information is taken from the NASA-Sloan Atlas targetting catalogue which provides stellar mass, and, galaxy inclination. For all galaxies in MaNGA, morphological classifications from citizen science project galaxyZoo are found.
-
These catalogues are cross-matched with group catalogues found from galaxies in the SDSS-DR7 spectroscopic sample, which provides halo mass, and, central/satellite definition.
-
Cosmic web catalogues are also cross-matched to this data to provide distances to morphological features of the cosmic web such as distances to filaments and nodes.
The total number of MaNGA galaxies (for MPL-9) after cross-matching information from each of these catalogues (top row) and cumulatively cross-matching (bottom row) are given here:
MaNGA (w GalaxyZoo) | Group membership | Cosmic-web | |
---|---|---|---|
Cross-matched | 7398 | 6343 | 6378 |
Cumulative cross-matched | 7398 | 6343 | 6117 |
Data catalogues (and various versions of MaNGA data releases) are brought together by the catalog class object found here, which performs the cross-matching. Catalog class objects store cross-matched information in the form of a pandas.DataFrame object (stored as property catalog.df).
The catalog class objects also contain various methods to select galaxy sub-samples based on these properties (./lib/catalog_init.py
), for data processing ./lib/catalog_process.py
and plotting ./lib/catalog_plot.py
. These methods are tied together in ./catalog.py
, however, are mainly used for the ./scripts/thesis_plots
directory.
MaNGA data used by the catalog object here is currently proprietary and hence not included in the repo.
To evaluate the importance of various galaxy properties (including local and large-scale environment) in predicting a galaxy's spin, we generate a random forest to predict λR.
- (nsersic) sersic index
- (b/a) inclination
- (Mstel) stellar mass
- (Mhalo) halo mass
- (Dnode) distance to nearest node (of the cosmic web)
- (Dskel) distance to nearest filament segment (of the cosmic web)
- (cen/sat) group membership (i.e. if the galaxy is the most massive (central) in the group or not (satellite))
We test various parameterisations of morphology including one-hot encoding of galaxyZoo classifications (i.e. ETGs, LTGs etc.), empirically estimated morphological T-type (using vote fractions in GZ), however, sersic index appears to be most informative for predicting galaxy spin (see ./plots/random_forest/morphology_importances.pdf
.
The best performing random forest is found using the sklearn
optimized (cross-validated) grid-search method GridSearchCV
. In contrast to a full grid-search, not all parameter values are tried, however, a fixed number of setting combinations are sampled from the distribution. We find there is no significant drop-off in performance between various hyperparameter combinations, however, our best random forest set-up is found to be:
{'n_estimators': 1000,
'min_samples_split': 5,
'min_samples_leaf': 4,
'max_features': 'sqrt',
'max_depth': 10}
Using the following parameters and hyperparameters, a random forest is generated using the cross-matched data from the catalog class (6117 galaxies), where 80% is used for training and 20% for testing. We find that the model predictions have a mean absolute error of 0.12538. To compare the performance of the network (and importance of secondary parameters), we also generate a random forest (with the same hyperparameters) just using nsersic and b/a. We find the model predictions to have a mean absolute error of 0.12937. The λR predicted distributions for the random forests (using all parameters and just the two most important (nsersic, b/a)) and the actual values are shown here:
To evaluate the performance of the predictions we also plot the correlation (with pearson correlation coeffients) with the actual λR values for the test data. We find that the random forests make reasonably informed predictions of λR, however, struggle to predict low or high values. The correlation plot is shown here:
A natural output of random forests is the relative importances of input features (based on the frequency of a given feature in nodes across all of the decision trees in the forest). The relative importances (i.e. normalised so they all sum to one) are given here:
We find that sersic index and inclination are most informative of predicting λR, however there are reasonable contributions from stellar and halo mass along with small but significant contributions from cosmic-web environment. Group membership appears to be insignificant (however is encoded as a binary discrete parameter, which may impact its importance ranking relative to all other features which are continuous).