Skip to content

Commit

Permalink
Issue/232/remove delight (separate packages for dependency-heavy/slow…
Browse files Browse the repository at this point in the history
… Estimators) (#233)

* move to namespace-type packaging

* fix versioning

* remove bpz and flex from setup

* udpated evaluation demo nb

* remove __main__ as that is now provided by ceci, added pragma to rail/estimation/algos/knnpz.py

* remove unneeded data files

* remvoe stuff for pipelines that should go in a second pr

* fix readthedocs config

* added syntactic sugar to rail.core.stage

* fix up rail/core/stage

* added makedirs call to DataHandle.write

* revert name

* full coverage, and delinting

* Doing float comparison using np.allclose

* added stuff to docs, and fix up docs

* put test data back

* remove unnecessary files

* fix docs

* change package distribution name

* fix setup.py for twine

* update workflows

* added minisom dep

* move qp to core_extras until it is on pypi

* switch to temp versions of qp and hyperbolic

* Added dummy file to examples/estimation/data/AB to make sure directoy gets created

* fix typo

* change to qp-prob

* update docs to get the right qp

* oops

* Address Jeremy and Max's comments

Co-authored-by: John Franklin Crenshaw <41785729+jfcrenshaw@users.noreply.github.com>
Co-authored-by: sschmidt23 <sschmidt@physics.ucdavis.edu>
  • Loading branch information
3 people authored Sep 8, 2022
1 parent ec3d79c commit 3921993
Show file tree
Hide file tree
Showing 78 changed files with 814 additions and 1,910 deletions.
4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8] # flake8 is our last non-pyproject.toml holdout...
max-line-length = 110
max-doc-length = 79
extend-ignore = E203
10 changes: 1 addition & 9 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,18 +28,10 @@ jobs:
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install Delight with Cython setup
run: |
python -m pip install --upgrade pip
python -m pip install cython numpy pytest pylint scipy matplotlib coveralls astropy pycodestyle sphinx
python -m pip install git+https://github.com/LSSTDESC/Delight.git
#python setup.py build_ext --inplace
#python setup.py install
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install .[Full]
pip install git+https://github.com/LSSTDESC/qp.git
pip install .[all]
pip install flake8 pytest pytest-cov mockmpi pytest-timeout
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Test with pytest
Expand Down
24 changes: 24 additions & 0 deletions .github/workflows/pypi.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: Upload Python Package

on:
release:
types: [created]

jobs:
deploy:
runs-on: ubuntu-20.04

steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
run: |
python setup.py sdist bdist_wheel
twine upload dist/*
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ __pycache__/

# Distribution / packaging
.Python
rail/_version.py
rail/core/_version.py
build/
develop-eggs/
dist/
Expand Down
8 changes: 6 additions & 2 deletions clean_nb_cruft.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,19 @@
\rm -rf examples/core/*.out
\rm -rf examples/core/*.pq
\rm -rf examples/core/*.hdf5
\rm -rf examples/core/knnpz.pkl
\rm -rf examples/core/pipe_saved*.yml
\rm -rf examples/creation/output_*
\rm -rf examples/creation/output*
\rm -rf examples/estimation/TEMPZFILE.out
\rm -rf examples/estimation/*.hdf5
\rm -rf examples/estimation/demo_knn.pkl
\rm -rf examples/estimation/tmp
\rm -rf examples/estimation/output*.fits
\rm -rf examples/estimation/output*
\rm -rf examples/evaluation/output*
\rm -rf examples/goldenspike/output_*
\rm -rf examples/goldenspike/*.out
\rm -rf examples/goldenspike/*.pkl
\rm -rf examples/goldenspike/single_*.hdf5
\rm -rf examples/goldenspike/tmp_*.yml
\rm -rf model.tmp model_train_z.tmp output_*.fits
\rm -rf parametersTest.cfg
Expand Down
2 changes: 1 addition & 1 deletion do_cover.sh
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
\rm -rf rail/estimation/data/AB/E*.AB rail/estimation/data/AB/I*.AB rail/estimation/data/AB/S*.AB rail/estimation/data/AB/s*.AB
python -m pytest --cov=./rail --cov-report=html tests
python -m pytest --cov-branch --cov=./rail --cov-report=html tests
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
author = 'LSST DESC RAIL Contributors'

# The short X.Y version
from rail import _version
from rail.core import _version
version = "%i.%i" % (_version.version_tuple[0], _version.version_tuple[1])
# The full version, including alpha/beta/rc tags
release = _version.version
Expand Down Expand Up @@ -200,7 +200,7 @@ def run_apidoc(_):
cur_dir = os.path.normpath(os.path.dirname(__file__))
output_path = os.path.join(cur_dir, 'api')
modules = os.path.normpath(os.path.join(cur_dir, "../rail"))
paramlist = ['--separate', '--no-toc', '-f', '-M', '-o', output_path, modules]
paramlist = ['--separate', '--implicit-namespaces', '--no-toc', '-f', '-M', '-o', output_path, modules]
apidoc_main(paramlist)

def setup(app):
Expand Down
16 changes: 14 additions & 2 deletions docs/doc-config.ini
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,20 @@ core
creation
estimation
evaluation
summarization

DEMO
../examples/creation/basic-creation-demo.ipynb
../examples/core/FileIO_DataStore.ipynb
../examples/core/Pipe_Example.ipynb
../examples/core/Run_Pipe.ipynb
../examples/core/hyperbolic_magnitude_test.ipynb
../examples/core/iterator_test.ipynb
../examples/creation/degradation-demo.ipynb
../examples/creation/example_GridSelection_for_HSC.ipynb
../examples/creation/example_SpecSelection_for_zCOSMOS.ipynb
../examples/creation/posterior-demo.ipynb
../examples/estimation/NZDir.ipynb
#../examples/estimation/RAIL_estimation_demo.ipynb
../examples/estimation/test_sampled_summarizers.ipynb
#../examples/evaluation/demo.ipynb
../examples/goldenspike/goldenspike.ipynb

7 changes: 4 additions & 3 deletions docs/source/citing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,13 @@ Code references:
| `Leistedt & Hogg (2017) <https://ui.adsabs.harvard.edu/abs/2017ApJ...838....5L/abstract>`_
| FlexZBoost:
| `Izbicki & Lee (2017) <https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-11/issue-2/Converting-high-dimensional-regression-to-high-dimensional-conditional-density-estimation/10.1214/17-EJS1302.full>`_
`Dalmasso et al (2020) <https://ui.adsabs.harvard.edu/abs/2020A%26C....3000362D/abstract>`_
| `Izbicki & Lee (2017)
<https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-11/issue-2/Converting-high-dimensional-regression-to-high-dimensional-conditional-density-estimation/10.1214/17-EJS1302.full>`_
| `Dalmasso et al (2020) <https://ui.adsabs.harvard.edu/abs/2020A%26C....3000362D/abstract>`_
| PZFlowPDF:
| J. F. Crenshaw (in prep)
| `Zenodo link <https://zenodo.org/record/6369625#.Ylcpjy-cYW8>`_
| trainZ:
| `Schmidt, Malz et al (2020) <https://ui.adsabs.harvard.edu/abs/2020MNRAS.499.1587S/abstract>`_
213 changes: 212 additions & 1 deletion docs/source/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,21 @@
Contributing
************

The RAIL repository uses an issue-branch-review workflow.
Where to contribute: RAIL packages
==================================

Similar to the installation process, depending on how you want to contribute to RAIL you will be contributing to one or more of the RAIL packages. Given the package structure we imagine three main use cases for contributions:

1. If you are contributing to the core code base, or developing an algorithm that has minimal dependencies, you will probably only be contributing to RAIL, and only need to install the source code for RAIL.
2. If you are contritubing a new algorithm that does depend on a number of other packages beyond numpy, scipy and sklearn, you will probably be making a new rail_<algorithm> package, and eventually adding to the dependencies in rail_hub.
3. If you are using existing algorithms to do studies and build analysis pipelines to do those studies, you will probably only be contriubting to rail_pipelines.



Contribution workflow
---------------------

The RAIL and rail_<xxx> repositories use an issue-branch-review workflow.
When you identify something that should be done, `make an issue <https://github.com/LSSTDESC/RAIL/issues/new>`_
for it.
We ask that if applicable and you are comfortable doing so, you add labels to the issue to
Expand Down Expand Up @@ -31,3 +45,200 @@ When you're ready to merge your branch into the `main` branch,
Once the changes have been approved, you can merge and squash the pull request as well as close its corresponding issue by putting `closes #[#]` in the comment closing the pull request.

To review a pull request, it's a good idea to start by pulling the changes and running the unit tests (see above). If there are no problems with that, you can make suggestions for optional improvements (e.g. adding a one-line comment before a clever block of code or including a demonstration of new functionality in the example notebooks) or request necessary changes (e.g. including an exception for an edge case that will break the code or separating out code that's repeated in multiple places).



Adding a new Rail Stage
=======================

To make it easier to eventually run RAIL algorithms at scale, all of the various algorithms are implemented as `RailStage` python classes. A `RailStage` is intended to take a particular set of inputs and configuration parameters, run a single bit of analysis, and produce one or more output files. The inputs, outputs
and configuration parameters are all defined in particular ways to allow `RailStage` objects to be integrated into larger data analysis pipelines.

Here is an example of a very simple `RailStage`.


.. code-block:: python
class ColumnMapper(RailStage):
"""Utility stage that remaps the names of columns.
Notes
-----
1. This operates on pandas dataframs in parquet files.
2. In short, this does:
`output_data = input_data.rename(columns=self.config.columns, inplace=self.config.inplace)`
"""
name = 'ColumnMapper'
config_options = RailStage.config_options.copy()
config_options.update(chunk_size=100_000, columns=dict, inplace=False)
inputs = [('input', PqHandle)]
outputs = [('output', PqHandle)]
def __init__(self, args, comm=None):
RailStage.__init__(self, args, comm=comm)
def run(self):
data = self.get_data('input', allow_missing=True)
out_data = data.rename(columns=self.config.columns, inplace=self.config.inplace)
if self.config.inplace: #pragma: no cover
out_data = data
self.add_data('output', out_data)
def __call__(self, data: pd.DataFrame) -> pd.DataFrame:
"""Return a table with the columns names changed
Parameters
----------
sample : pd.DataFrame
The data to be renamed
Returns
-------
pd.DataFrame
The degraded sample
"""
self.set_data('input', data)
self.run()
return self.get_handle('output')
This particular example has all of the required pieces and almost nothing else. The required pieces, in the order that
they appear are:

1. The `ColumnMapper(RailStage):` defines a class called `ColumnMapper` and specifies that it inherits from `RailStage`.

2. The `name = ColumnMapper` is required, and should match the class name.

3. The `config_options` lines define the configuration parameters for this class, as well as their default values. Note that here we are copying the configuration parameters from the `RailStage` as well as defining some new ones.

4. The `inputs = [('input', PqHandle)]` and `outputs = [('output', PqHandle)]` define the inputs and outputs, and the expected data types for those, in this case Parquet files.

5. The `__init__` method does any class-specific initialization. In this case there isn't any and the method is superflous.

6. The `run()` method does the actual work, note that it doesn't take any arguments, that it uses methods `self.get_data()` and `self.add_data()` to access the input data and set the output data, and that it uses `self.config` to access the configuration parameters.

7. The `__call__()` method provides an interface for interactive use. It provide a way to pass in data (and in other cases configuraiton parameters) to the class so that they can be used in the run method.


Here is an example of a slightly more complicated `RailStage`.


.. code-block:: python
class NaiveStack(PZSummarizer):
"""Summarizer which simply histograms a point estimate
"""
name = 'NaiveStack'
config_options = PZSummarizer.config_options.copy()
config_options.update(zmin=Param(float, 0.0, msg="The minimum redshift of the z grid"),
zmax=Param(float, 3.0, msg="The maximum redshift of the z grid"),
nzbins=Param(int, 301, msg="The number of gridpoints in the z grid"),
seed=Param(int, 87, msg="random seed"),
nsamples=Param(int, 1000, msg="Number of sample distributions to create"))
outputs = [('output', QPHandle),
('single_NZ', QPHandle)]
def __init__(self, args, comm=None):
PZSummarizer.__init__(self, args, comm=comm)
self.zgrid = None
def run(self):
rng = np.random.default_rng(seed=self.config.seed)
test_data = self.get_data('input')
self.zgrid = np.linspace(self.config.zmin, self.config.zmax, self.config.nzbins + 1)
pdf_vals = test_data.pdf(self.zgrid)
yvals = np.expand_dims(np.sum(np.where(np.isfinite(pdf_vals), pdf_vals, 0.), axis=0), 0)
qp_d = qp.Ensemble(qp.interp, data=dict(xvals=self.zgrid, yvals=yvals))
bvals = np.empty((self.config.nsamples, len(self.zgrid)))
for i in range(self.config.nsamples):
bootstrap_draws = rng.integers(low=0, high=test_data.npdf, size=test_data.npdf)
bvals[i] = np.sum(pdf_vals[bootstrap_draws], axis=0)
sample_ens = qp.Ensemble(qp.interp, data=dict(xvals=self.zgrid, yvals=bvals))
self.add_data('output', sample_ens)
self.add_data('single_NZ', qp_d)
The main difference with this new class is that it inherit from the `PZSummarizer` `RailStage` sub-class. A `PZSummarizer` will take an
ensemble of p(z) distributions for many objects, and summarize them into a single `n(z)` distribution for that ensemble.

A few things to note:

1. We copy the configuration parameters for `PZSummarizer` and then add addtional ones.

2. The `run()` method is implemented here, but the function for interactive use `summarize()` is actually defined in `PZSummarizer`.

3. While we define the `outputs` here, we just use the inputs as defined in `PZSummarizer`.



Adding a new Rail Pipeline
==========================

Here is an example of the first part of the `goldenspike` pipeline defintion.



.. code-block:: python
class GoldenspikePipeline(RailPipeline):
def __init__(self):
RailPipeline.__init__(self)
DS = RailStage.data_store
DS.__class__.allow_overwrite = True
bands = ['u','g','r','i','z','y']
band_dict = {band:f'mag_{band}_lsst' for band in bands}
rename_dict = {f'mag_{band}_lsst_err':f'mag_err_{band}_lsst' for band in bands}
self.flow_engine_train = FlowEngine.build(
flow=flow_file,
n_samples=50,
seed=1235,
output=os.path.join(namer.get_data_dir(DataType.catalog, CatalogType.created), "output_flow_engine_train.pq"),
)
self.lsst_error_model_train = LSSTErrorModel.build(
connections=dict(input=self.flow_engine_train.io.output),
bandNames=band_dict, seed=29,
output=os.path.join(namer.get_data_dir(DataType.catalog, CatalogType.degraded), "output_lsst_error_model_train.pq"),
)
self.inv_redshift = InvRedshiftIncompleteness.build(
connections=dict(input=self.lsst_error_model_train.io.output),
pivot_redshift=1.0,
output=os.path.join(namer.get_data_dir(DataType.catalog, CatalogType.degraded), "output_inv_redshift.pq"),
)
self.line_confusion = LineConfusion.build(
connections=dict(input=self.inv_redshift.io.output),
true_wavelen=5007., wrong_wavelen=3727., frac_wrong=0.05,
output=os.path.join(namer.get_data_dir(DataType.catalog, CatalogType.degraded), "output_line_confusion.pq"),
)
What this is doing is:

1. Defining a class `GoldenspikePipeline` to encapsulate the pipeline and setting up that pipeline.

2. Set up the rail `DataStore` for interactive use, allowing you to overwrite output files, (say if you re-run the pipeline in a notebook cell).

3. Defining some common parameters, e.g., `bands`, `bands_dict` for the pipeline.

4. Defining four stages, and adding them to the pipeline, note that for each stage the syntax is more or less the same. We have to define,

1. The name of the stage, i.e., `self.flow_engine_train` will make a stage called `flow_engine_train` through some python cleverness.

2. The class of the stage, which is specified by which type of stage we ask to build, `FlowEngine.build` will make a `FlowEngine` stage.

3. Any configuration parameters, which are specified as keyword argurments, e.g., `n_samples=50`.

4. Any input connections from other stages, e.g., `connections=dict(input=self.flow_engine_train.io.output),` in the `self.lsst_error_model_train` block will connect the `output` of self.flow_engine_train to the `input` of `self.lsst_error_model_train`. Later in that example we can see how to connect multiple inputs, e.g., one named `input` and another named `model`, as required for an estimator stage.

5. We use the `namer` class and enumerations to ensure that the data end up following our location convenctions.
Loading

0 comments on commit 3921993

Please sign in to comment.