The primary user interaction is through a parameter file, a python file in which several functions must be defined. These functions are described below and are used to build the ingredients for a fit (data, model, and noise model.) During execution any supplied command line options are parsed -- includiing any user defined custom arguments -- and the resulting set of arguments is passed to each of these functions before fitting begins.
Command line syntax calls the parameter file and is as follows for single thread execution:
python parameter_file.py --nested_sampler dynesty --nested_target_n_effective 512
Additional command line options can be given (see below) e.g.
python parameter_file.py --emcee --nwalkers=128
will cause a fit to be run using emcee with 128 walkers.
A description of the available command line options can be obtained with
python parameter_file.py --help
All of this command line syntax requires that the end of the parameter file have something like the following code block at the end, which reads command line arguments, runs all the build_ methods (see below), conducts the fit, and writes output.
if __name__ == "__main__":
import time
from prospect.fitting import fit_model
from prospect.io import write_results as writer
from prospect import prospect_args
# --- Get the default argument parser ---
parser = prospect_args.get_parser()
# Add custom arguments that controll the build methods
parser.add_argument("--custom_argument_1", ...)
# --- Configure ---
args = parser.parse_args()
config = vars(args)
# allows parameter file text to be stored for model regeneration
config["param_file"] = __file__
# --- Get fitting ingredients ---
obs, model, sps = build_all(**config)
config["sps_libraries"] = sps.ssp.libraries
print(model)
if args.debug:
sys.exit()
# --- Set up output ---
ts = time.strftime("%y%b%d-%H.%M", time.localtime())
hfile = f"{args.outfile}_{ts}_result.h5"
# --- Run the actual fit ---
output = fit_model(obs, model, sps, **config)
print("writing to {}".format(hfile))
writer.write_hdf5(hfile,
config=config,
model=model,
obs=obs,
output["sampling"],
output["optimization"],
sps=sps
)
try:
hfile.close()
except(AttributeError):
pass
A number of default command line options are included with prospector. These options can control the output filenames and format and some details of how the model is built and run. However, most of the default parameters control the fitting backends.
You can inspect the default set of arguments and their default values as follows:
from prospect import prospect_args
parser = prospect_args.get_parser()
parser.print_help()
In the typical parameter file the arguments are converted to a dictionary and passed as keyword arguments to all of the :py:func:`build_*` methods described below.
A user can add custom arguments that will further control the behavior of the model and data building methods. This is done by adding arguments to the parser in the executable part of the parameter file. See the argparse documentation for details on adding custom arguments.
The required methods in a parameter file for building the data and model are:
- :py:meth:`build_obs`: This function will take the command line arguments dictionary as keyword arguments and returns a list of Observation instances (see :doc:`dataformat` .)
- :py:meth:`build_model`: This function will take the command line arguments dictionary dictionary as keyword arguments and return an instance of a :class:`ProspectorParams` subclass, containing information about the parameters of the model (see :doc:`models` .)
- :py:meth:`build_sps`: This function will take the command line arguments dictionary dictionary as keyword arguments and return an sps object, which must have the method :py:meth:`get_spectrum` defined. This object generally includes all the spectral libraries and isochrones necessary to build a model, as well as much of the model building code and as such has a large memory footprint.
- :py:meth:`build_noise`: This function, if present, should add a :py:class:`NoiseModel` object to the spectroscopy and/or photometry. If not present the likelihood will not include covariant noise or jitter and is equivalent to basic \chi^2.
For large galaxy samples we recommend conducting a fit for each object entirely independently on individual CPU cores. However, for a small number of objects or during testing it can be helpful to decrease the elapsed wall time for a single fit. Message Passing Interface (MPI) can be used to parallelize the fit for a single object over many CPU cores. This will reduce the wall time required for a single fit, but will not reduce the total CPU uptime (and when using dynesty might actually increase the total CPU usage).
To use MPI a "pool" of cores must be made available; each core will instantiate the fitting ingredients separately, and a single core in the pool will then conduct the fit, distributing likelihood requests to the other cores in the pool. This requires changes to the final code block that instantiates and runs the fit:
if __name__ == "__main__":
import time
from prospect.fitting import fit_model
from prospect.io import write_results as writer
from prospect import prospect_args
# Get the default argument parser
parser = prospect_args.get_parser()
# Add custom arguments that controll the build methods
parser.add_argument("--custom_argument_1", ...)
# Parse the supplied arguments, convert to a dictionary, and add this file for logging purposes
args = parser.parse_args()
run_params = vars(args)
run_params["param_file"] = __file__
# Build the fit ingredients on each process
obs, model, sps, noise = build_all(**run_params)
run_params["sps_libraries"] = sps.ssp.libraries
# Set up MPI communication
try:
import mpi4py
from mpi4py import MPI
from schwimmbad import MPIPool
mpi4py.rc.threads = False
mpi4py.rc.recv_mprobe = False
comm = MPI.COMM_WORLD
size = comm.Get_size()
withmpi = comm.Get_size() > 1
except ImportError:
print('Failed to start MPI; are mpi4py and schwimmbad installed? Proceeding without MPI.')
withmpi = False
# Evaluate SPS over logzsol grid in order to get necessary data in cache/memory
# for each MPI process. Otherwise, you risk creating a lag between the MPI tasks
# caching SSPs which can slow down the parallelization
if (withmpi) & ('logzsol' in model.free_params):
dummy_obs = dict(filters=None, wavelength=None)
logzsol_prior = model.config_dict["logzsol"]['prior']
lo, hi = logzsol_prior.range
logzsol_grid = np.around(np.arange(lo, hi, step=0.1), decimals=2)
sps.update(**model.params) # make sure we are caching the correct IMF / SFH / etc
for logzsol in logzsol_grid:
model.params["logzsol"] = np.array([logzsol])
_ = model.predict(model.theta, obs=dummy_obs, sps=sps)
# ensure that each processor runs its own version of FSPS
# this ensures no cross-over memory usage
from prospect.fitting import lnprobfn
from functools import partial
lnprobfn_fixed = partial(lnprobfn, sps=sps)
if withmpi:
run_params["using_mpi"] = True
with MPIPool() as pool:
# The dependent processes will run up to this point in the code
if not pool.is_master():
pool.wait()
sys.exit(0)
nprocs = pool.size
# The parent process will oversee the fitting
output = fit_model(obs, model, sps, noise, pool=pool, queue_size=nprocs, lnprobfn=lnprobfn_fixed, **run_params)
else:
# without MPI we don't pass the pool
output = fit_model(obs, model, sps, noise, lnprobfn=lnprobfn_fixed, **run_params)
# Set up an output file and write
ts = time.strftime("%y%b%d-%H.%M", time.localtime())
hfile = f"{args.outfile}_worker{comm.rank()}_{ts}_mcmc.h5"
writer.write_hdf5(hfile,
run_params=run_params,
model=model,
obs=obs,
output["sampling"],
output["optimization"],
sps=sps
)
try:
hfile.close()
except(AttributeError):
pass
Then, to run this file using mpi it can be called from the command line with something like
mpirun -np <number of processors> python parameter_file.py --emcee
# or
mpirun -np <number of processors> python parameter_file.py --nested_sampler dynesty
Note that only model evaluation is parallelizable with dynesty, and many operations (e.g. new point proposal) are still done in serial. This means that single-core fits will always be more efficient in terms of total CPU usage per fit. Having a large ratio of (live points / processors) helps efficiency, the scaling goes as K ln(1 + M/K), where M = number of processes and K = number of live points.
For emcee efficiency is maximized when K/(M-1) is an integer >= 2, where M = number of processes and K = number of walkers. The wall time speedup should be approximately the same as this integer.