The PyMHSS
package provides a set of Python scripts to reproduce the results of simulation experiments and real-world applications presented in Prado, E.B., Nemeth, C. & Sherlock, C. Metropolis-Hastings with Scalable Subsampling. arxiv
(2024).
To install PyMHSS, use pip
or python3 -m pip
:
- Option 1:
pip install PyMHSS
- Option 2:
python3 -m pip install PyMHSS
The repository is organised into three main folders:
- Simulation Experiments: Provides scripts to reproduce figures and tables from both the main paper and supplementary material. For added convenience, script names follow the same sequence as figure and table references in the original text.
- Real-world Applications: Houses datasets and scripts to reproduce results presented in the paper.
- PyMHSS: A Python package that enables running all results mentioned in the paper.
Note
In each folder, the scripts are named following the order of the figures/tables of the MH-SS paper on arxiv.
-
01_Figure_1.py
: acceptance rates and ESS per second for SMH-1, Tuna and RWM. The results are based on synthetic datasets generated from a logistic regression model with$n = 31,622$ observations. -
02_Figure_2.py
: acceptance rates of the proposed method with first-order control variates. For each combination of$\gamma$ ,$n$ and$d$ ,$10$ synthetic datasets are generated from a logistic regression model. -
03_aux_generate_results.py
: generates the simulation results for Figures 3 and 4. -
03_Figures_3_and_4.py
: optimal scaling for the MH-SS algorithms and acceptance rates, where$\alpha = \alpha_1 \times \alpha_2$ , based on a logistic regression target in dimension$d=100$ with$n=30,000$ observations, and with covariates and true coefficients simulated as described in Section 5 of the paper. The efficiency metric (MSJD/E(B)) is plotted as a function of the scaling parameter ($\lambda$ ) and the empirical acceptance rate. -
04_aux_generate_results.py
: generates the simulation results for Figures 5, 6, 7, 8 and 9. -
04_Figures_5_6_7_8_9.py
: ESS per second, ESS divided by E(B), average batch size for MH-SS, SMH and RWM for the logistic regression model. For RWM, the average batch size is$n$ . In all figures, both axes are presented in the logarithm base 10. -
05_Table_1.py
: acceptance rate, average batch size, ESS per second and ESS/E(B) for the Poisson regression model applied to synthetic data in$d=30$ . -
06_Figure_10.py
: Acceptance rates and mean squared jumping distance (MSJD) over E(B) based on a simulation experiment with a logistic regression model with dimension$d = 100$ and$n=100,000$ .
Ideally, we would have liked to have all datasets analysed in the paper within this repository. However, some of them are too big to be stored here. The US census dataset is the only one provided in this repository. The datasets detection of gas mixtures and high-energy particle physics can be found in the UCI machine learning repository; see the links below.
-
usa_pop_survey_data.zip
: The 2018 United States Current Population Survey is a monthly household survey carried out by the U.S. Census Bureau and the U.S. Bureau of Labor Statistics that gathers information on the labour force for the population of the U.S. The data contain variables such as income, education, occupation, participation in welfare programs and health insurance. -
usa_pop_survey.py
: We used a sample of$n = 500,000$ survey participants to model whether or not an individual has a total pre-tax personal income above$25,000$ U.S. dollars for the previous year, based on$10$ predictors.
-
gas_sensor.py
: For a sample of size$n = 250,000$ , we predict whether the concentration of Ethylene in the air, measured in parts per million, is above zero. Though we consider$d = 7$ continuous predictors only (including an intercept), each of which corresponds to a different model sensor, some of them are highly correlated. -
data
: The dataset used in the analysis can be downloaded in this link: detection of gas mixtures.
-
hepmass.py
: For a sample of$n = 1, 000, 000$ observations from the training set, we analyse whether a new particle of unknown mass is observed. The dataset has originally$26$ continuous predictors and is split into a training set of$7$ million observations and a test set of$3.5$ million. -
data
: The dataset contains information about signatures of exotic particles obtained from a high-energy physics experiment. The binary response variable indicates whether a new particle of unknown mass is observed. The dataset can be found in this link: high-energy particle physics.
-
road_casualties_UK.py
: We analyse the consolidated UK road casualties data from$2020$ to$2022$ . We aim to model at the individual level of$n = 298,290$ accidents the number of casualties based on 8 predictors, of which 2 are continuous. In total, the linear predictor in our Poisson model with mean$\log(1 + e^{\eta})$ has$d=28$ parameters (including the intercept). -
data
: The UK Department for Transport publishes annual road safety statistics as part of the Statistics and Registration Service Act 2007. The data include the accident's geographical coordinates, severity, speed limit of the road where the accident took place, details about the vehicles involved, weather conditions, road conditions, as well as time and date. The road casualties dataset can be downloaded from the R packagestats19
asinstall.packages('stats19') library(stats19) if(curl::has_internet()) { dl_stats19(year = 2020, type = "casualty") dl_stats19(year = 2021, type = "casualty") dl_stats19(year = 2021, type = "casualty") casualties_2020 = read_casualties(year = 2020) casualties_2021 = read_casualties(year = 2021) casualties_2022 = read_casualties(year = 2022) }
The PyMHSS
package implements several Markov Chain Monte Carlo (MCMC) algorithms:
- Metropolis-Hastings with Scalable Subsampling (MH-SS)
- Random-walk Metropolis (RWM)
- Scalable Metropolis-Hastings (SMH)
- TunaMH
The main functions of PyMHSS
can be found within the main > PyMHSS > algorithms.py
file.
We implemented all algorithms using two primary strategies:
- For-loop operations: This approach is generally slower but allows for a fair comparison among algorithms. It was used for results presented in the main paper.
- Vectorised operations: A faster strategy that can be more efficient, especially for large-scale simulations.
As detailed in the appendices of our paper, we compared the performance of both strategies and opted to use for-loop operations to maintain fairness in comparing algorithm runtimes. This decision was made due to the design of the SMH algorithm, which rejects proposal based on point-wise evaluations of the likelihood.