Issue/232/remove delight (separate packages for dependency-heavy/slow…

… Estimators) (#233) * move to namespace-type packaging * fix versioning * remove bpz and flex from setup * udpated evaluation demo nb * remove __main__ as that is now provided by ceci, added pragma to rail/estimation/algos/knnpz.py * remove unneeded data files * remvoe stuff for pipelines that should go in a second pr * fix readthedocs config * added syntactic sugar to rail.core.stage * fix up rail/core/stage * added makedirs call to DataHandle.write * revert name * full coverage, and delinting * Doing float comparison using np.allclose * added stuff to docs, and fix up docs * put test data back * remove unnecessary files * fix docs * change package distribution name * fix setup.py for twine * update workflows * added minisom dep * move qp to core_extras until it is on pypi * switch to temp versions of qp and hyperbolic * Added dummy file to examples/estimation/data/AB to make sure directoy gets created * fix typo * change to qp-prob * update docs to get the right qp * oops * Address Jeremy and Max's comments Co-authored-by: John Franklin Crenshaw <41785729+jfcrenshaw@users.noreply.github.com> Co-authored-by: sschmidt23 <sschmidt@physics.ucdavis.edu>
LSSTDESC · Sep 8, 2022 · 3921993 · 3921993
1 parent ec3d79c
commit 3921993
Show file tree

Hide file tree

Showing 78 changed files with 814 additions and 1,910 deletions.
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,4 @@
+[flake8] # flake8 is our last non-pyproject.toml holdout...
+max-line-length = 110
+max-doc-length = 79
+extend-ignore = E203
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -28,18 +28,10 @@ jobs:
       uses: actions/setup-python@v2
       with:
         python-version: ${{ matrix.python-version }}
-    - name: Install Delight with Cython setup
-      run: |
-        python -m pip install --upgrade pip
-        python -m pip install cython numpy pytest pylint scipy matplotlib coveralls astropy pycodestyle sphinx
-        python -m pip install git+https://github.com/LSSTDESC/Delight.git
-        #python setup.py build_ext --inplace
-        #python setup.py install
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install .[Full]
-        pip install git+https://github.com/LSSTDESC/qp.git
+        pip install .[all]
         pip install flake8 pytest pytest-cov mockmpi pytest-timeout
         if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
     - name: Test with pytest

diff --git a/.github/workflows/pypi.yaml b/.github/workflows/pypi.yaml
@@ -0,0 +1,24 @@
+name: Upload Python Package
+
+on:
+  release:
+    types: [created]
+
+jobs:
+  deploy:
+    runs-on: ubuntu-20.04
+
+    steps:
+    - uses: actions/checkout@v2
+    - uses: actions/setup-python@v2
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install setuptools wheel twine
+    - name: Build and publish
+      env:
+        TWINE_USERNAME: __token__
+        TWINE_PASSWORD: ${{ secrets.PYPI_API_TOKEN }}
+      run: |
+        python setup.py sdist bdist_wheel
+        twine upload dist/*
diff --git a/.gitignore b/.gitignore
@@ -11,7 +11,7 @@ __pycache__/
 
 # Distribution / packaging
 .Python
-rail/_version.py
+rail/core/_version.py
 build/
 develop-eggs/
 dist/

diff --git a/clean_nb_cruft.sh b/clean_nb_cruft.sh
@@ -5,15 +5,19 @@
 \rm -rf examples/core/*.out
 \rm -rf examples/core/*.pq
 \rm -rf examples/core/*.hdf5
+\rm -rf examples/core/knnpz.pkl
 \rm -rf examples/core/pipe_saved*.yml
-\rm -rf examples/creation/output_*
+\rm -rf examples/creation/output*
 \rm -rf examples/estimation/TEMPZFILE.out
+\rm -rf examples/estimation/*.hdf5
 \rm -rf examples/estimation/demo_knn.pkl
 \rm -rf examples/estimation/tmp
-\rm -rf examples/estimation/output*.fits
+\rm -rf examples/estimation/output*
+\rm -rf examples/evaluation/output*
 \rm -rf examples/goldenspike/output_*
 \rm -rf examples/goldenspike/*.out
 \rm -rf examples/goldenspike/*.pkl
+\rm -rf examples/goldenspike/single_*.hdf5
 \rm -rf examples/goldenspike/tmp_*.yml
 \rm -rf model.tmp model_train_z.tmp output_*.fits
 \rm -rf parametersTest.cfg

diff --git a/do_cover.sh b/do_cover.sh
@@ -1,2 +1,2 @@
 \rm -rf rail/estimation/data/AB/E*.AB rail/estimation/data/AB/I*.AB rail/estimation/data/AB/S*.AB rail/estimation/data/AB/s*.AB
-python -m pytest --cov=./rail --cov-report=html tests
+python -m pytest --cov-branch --cov=./rail --cov-report=html tests
diff --git a/docs/conf.py b/docs/conf.py
@@ -35,7 +35,7 @@
 author = 'LSST DESC RAIL Contributors'
 
 # The short X.Y version
-from rail import _version
+from rail.core import _version
 version = "%i.%i" % (_version.version_tuple[0], _version.version_tuple[1])
 # The full version, including alpha/beta/rc tags
 release = _version.version
@@ -200,7 +200,7 @@ def run_apidoc(_):
     cur_dir = os.path.normpath(os.path.dirname(__file__))
     output_path = os.path.join(cur_dir, 'api')
     modules = os.path.normpath(os.path.join(cur_dir, "../rail"))
-    paramlist = ['--separate', '--no-toc', '-f', '-M', '-o', output_path, modules]
+    paramlist = ['--separate', '--implicit-namespaces', '--no-toc', '-f', '-M', '-o', output_path, modules]
     apidoc_main(paramlist)
 
 def setup(app):

diff --git a/docs/doc-config.ini b/docs/doc-config.ini
@@ -3,8 +3,20 @@ core
 creation
 estimation
 evaluation
-summarization
 
 DEMO
-../examples/creation/basic-creation-demo.ipynb
+../examples/core/FileIO_DataStore.ipynb
+../examples/core/Pipe_Example.ipynb
+../examples/core/Run_Pipe.ipynb
+../examples/core/hyperbolic_magnitude_test.ipynb
+../examples/core/iterator_test.ipynb
 ../examples/creation/degradation-demo.ipynb
+../examples/creation/example_GridSelection_for_HSC.ipynb
+../examples/creation/example_SpecSelection_for_zCOSMOS.ipynb
+../examples/creation/posterior-demo.ipynb
+../examples/estimation/NZDir.ipynb
+#../examples/estimation/RAIL_estimation_demo.ipynb
+../examples/estimation/test_sampled_summarizers.ipynb
+#../examples/evaluation/demo.ipynb
+../examples/goldenspike/goldenspike.ipynb
+
diff --git a/docs/source/citing.rst b/docs/source/citing.rst
@@ -37,12 +37,13 @@ Code references:
 | `Leistedt & Hogg (2017) <https://ui.adsabs.harvard.edu/abs/2017ApJ...838....5L/abstract>`_
 
 | FlexZBoost:
-| `Izbicki & Lee (2017)  <https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-11/issue-2/Converting-high-dimensional-regression-to-high-dimensional-conditional-density-estimation/10.1214/17-EJS1302.full>`_
-`Dalmasso et al (2020) <https://ui.adsabs.harvard.edu/abs/2020A%26C....3000362D/abstract>`_
+| `Izbicki & Lee (2017)
+<https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-11/issue-2/Converting-high-dimensional-regression-to-high-dimensional-conditional-density-estimation/10.1214/17-EJS1302.full>`_
+| `Dalmasso et al (2020) <https://ui.adsabs.harvard.edu/abs/2020A%26C....3000362D/abstract>`_
 
 | PZFlowPDF:
 | J. F. Crenshaw (in prep)
 | `Zenodo link <https://zenodo.org/record/6369625#.Ylcpjy-cYW8>`_
-   
+
 | trainZ:
 | `Schmidt, Malz et al (2020) <https://ui.adsabs.harvard.edu/abs/2020MNRAS.499.1587S/abstract>`_
diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst
@@ -2,7 +2,21 @@
 Contributing
 ************
 
-The RAIL repository uses an issue-branch-review workflow.
+Where to contribute: RAIL packages
+==================================
+
+Similar to the installation process, depending on how you want to contribute to RAIL you will be contributing to one or more of the RAIL packages.  Given the package structure we imagine three main use cases for contributions:
+
+1. If you are contributing to the core code base, or developing an algorithm that has minimal dependencies, you will probably only be contributing to RAIL, and only need to install the source code for RAIL.
+2. If you are contritubing a new algorithm that does depend on a number of other packages beyond numpy, scipy and sklearn, you will probably be making a new rail_<algorithm> package, and eventually adding to the dependencies in rail_hub.
+3. If you are using existing algorithms to do studies and build analysis pipelines to do those studies, you will probably only be contriubting to rail_pipelines.
+
+
+
+Contribution workflow
+---------------------
+
+The RAIL and rail_<xxx> repositories use an issue-branch-review workflow.
 When you identify something that should be done, `make an issue <https://github.com/LSSTDESC/RAIL/issues/new>`_
 for it.   
 We ask that if applicable and you are comfortable doing so, you add labels to the issue to
@@ -31,3 +45,200 @@ When you're ready to merge your branch into the `main` branch,
 Once the changes have been approved, you can merge and squash the pull request as well as close its corresponding issue by putting `closes #[#]` in the comment closing the pull request.
 
 To review a pull request, it's a good idea to start by pulling the changes and running the unit tests (see above). If there are no problems with that, you can make suggestions for optional improvements (e.g. adding a one-line comment before a clever block of code or including a demonstration of new functionality in the example notebooks) or request necessary changes (e.g. including an exception for an edge case that will break the code or separating out code that's repeated in multiple places).
+
+
+
+Adding a new Rail Stage
+=======================
+
+To make it easier to eventually run RAIL algorithms at scale, all of the various algorithms are implemented as `RailStage` python classes.   A `RailStage` is intended to take a particular set of inputs and configuration parameters, run a single bit of analysis, and produce one or more output files.  The inputs, outputs
+and configuration parameters are all defined in particular ways to allow `RailStage` objects to be integrated into larger data analysis pipelines.
+
+Here is an example of a very simple `RailStage`.
+
+
+.. code-block:: python
+
+    class ColumnMapper(RailStage):
+        """Utility stage that remaps the names of columns.
+
+	Notes
+	-----
+        1. This operates on pandas dataframs in parquet files.
+
+        2. In short, this does:
+        `output_data = input_data.rename(columns=self.config.columns, inplace=self.config.inplace)`
+
+        """
+        name = 'ColumnMapper'
+	
+        config_options = RailStage.config_options.copy()
+        config_options.update(chunk_size=100_000, columns=dict, inplace=False)
+
+	inputs = [('input', PqHandle)]
+        outputs = [('output', PqHandle)]
+
+        def __init__(self, args, comm=None):
+            RailStage.__init__(self, args, comm=comm)
+
+        def run(self):
+            data = self.get_data('input', allow_missing=True)
+            out_data = data.rename(columns=self.config.columns, inplace=self.config.inplace)
+            if self.config.inplace:  #pragma: no cover
+                out_data = data
+            self.add_data('output', out_data)
+
+        def __call__(self, data: pd.DataFrame) -> pd.DataFrame:
+            """Return a table with the columns names changed
+
+            Parameters
+            ----------
+            sample : pd.DataFrame
+                The data to be renamed
+
+            Returns
+            -------
+            pd.DataFrame
+                The degraded sample
+            """
+            self.set_data('input', data)
+            self.run()
+            return self.get_handle('output')
+
+	    
+This particular example has all of the required pieces and almost nothing else.  The required pieces, in the order that
+they appear are:
+
+1.  The `ColumnMapper(RailStage):` defines a class called `ColumnMapper` and specifies that it inherits from `RailStage`.
+
+2.  The `name = ColumnMapper` is required, and should match the class name.
+
+3.  The `config_options` lines define the configuration parameters for this class, as well as their default values.  Note that here we are copying the configuration parameters from the `RailStage` as well as defining some new ones.
+
+4.  The `inputs = [('input', PqHandle)]` and `outputs = [('output', PqHandle)]`  define the inputs and outputs, and the expected data types for those, in this case Parquet files.
+
+5.  The `__init__` method does any class-specific initialization.  In this case there isn't any and the method is superflous.
+
+6.  The `run()` method does the actual work, note that it doesn't take any arguments, that it uses methods `self.get_data()` and `self.add_data()` to access the input data and set the output data, and that it uses `self.config` to access the configuration parameters.
+
+7.  The `__call__()` method provides an interface for interactive use.  It provide a way to pass in data (and in other cases configuraiton parameters) to the class so that they can be used in the run method.
+
+
+Here is an example of a slightly more complicated `RailStage`.
+
+
+.. code-block:: python
+		
+    class NaiveStack(PZSummarizer):
+        """Summarizer which simply histograms a point estimate
+        """
+
+        name = 'NaiveStack'
+        config_options = PZSummarizer.config_options.copy()
+        config_options.update(zmin=Param(float, 0.0, msg="The minimum redshift of the z grid"),
+                              zmax=Param(float, 3.0, msg="The maximum redshift of the z grid"),
+                              nzbins=Param(int, 301, msg="The number of gridpoints in the z grid"),
+                              seed=Param(int, 87, msg="random seed"),
+                              nsamples=Param(int, 1000, msg="Number of sample distributions to create"))
+        outputs = [('output', QPHandle),
+                   ('single_NZ', QPHandle)]
+
+        def __init__(self, args, comm=None):
+            PZSummarizer.__init__(self, args, comm=comm)
+            self.zgrid = None
+
+        def run(self):
+            rng = np.random.default_rng(seed=self.config.seed)
+            test_data = self.get_data('input')
+            self.zgrid = np.linspace(self.config.zmin, self.config.zmax, self.config.nzbins + 1)
+            pdf_vals = test_data.pdf(self.zgrid)
+            yvals = np.expand_dims(np.sum(np.where(np.isfinite(pdf_vals), pdf_vals, 0.), axis=0), 0)
+            qp_d = qp.Ensemble(qp.interp, data=dict(xvals=self.zgrid, yvals=yvals))
+
+            bvals = np.empty((self.config.nsamples, len(self.zgrid)))
+            for i in range(self.config.nsamples):
+                bootstrap_draws = rng.integers(low=0, high=test_data.npdf, size=test_data.npdf)
+                bvals[i] = np.sum(pdf_vals[bootstrap_draws], axis=0)
+            sample_ens = qp.Ensemble(qp.interp, data=dict(xvals=self.zgrid, yvals=bvals))
+
+            self.add_data('output', sample_ens)
+            self.add_data('single_NZ', qp_d)
+
+
+The main difference with this new class is that it inherit from the `PZSummarizer` `RailStage` sub-class.  A `PZSummarizer` will take an
+ensemble of p(z) distributions for many objects, and summarize them into a single `n(z)` distribution for that ensemble.
+
+A few things to note:
+
+1.   We copy the configuration parameters for `PZSummarizer` and then add addtional ones.
+
+2.   The `run()` method is implemented here, but the function for interactive use `summarize()` is actually defined in `PZSummarizer`.
+
+3.   While we define the `outputs` here, we just use the inputs as defined in `PZSummarizer`.
+
+
+
+Adding a new Rail Pipeline
+==========================
+
+Here is an example of the first part of the `goldenspike` pipeline defintion.
+
+
+
+.. code-block:: python
+
+    class GoldenspikePipeline(RailPipeline):
+
+        def __init__(self):
+            RailPipeline.__init__(self)
+
+            DS = RailStage.data_store
+            DS.__class__.allow_overwrite = True
+            bands = ['u','g','r','i','z','y']
+            band_dict = {band:f'mag_{band}_lsst' for band in bands}
+            rename_dict = {f'mag_{band}_lsst_err':f'mag_err_{band}_lsst' for band in bands}
+
+            self.flow_engine_train = FlowEngine.build(
+                flow=flow_file,
+                n_samples=50,
+                seed=1235,
+                output=os.path.join(namer.get_data_dir(DataType.catalog, CatalogType.created), "output_flow_engine_train.pq"),
+            )
+
+            self.lsst_error_model_train = LSSTErrorModel.build(
+                connections=dict(input=self.flow_engine_train.io.output),    
+                bandNames=band_dict, seed=29,
+                output=os.path.join(namer.get_data_dir(DataType.catalog, CatalogType.degraded), "output_lsst_error_model_train.pq"),
+            )
+
+            self.inv_redshift = InvRedshiftIncompleteness.build(
+                connections=dict(input=self.lsst_error_model_train.io.output),
+                pivot_redshift=1.0,
+                output=os.path.join(namer.get_data_dir(DataType.catalog, CatalogType.degraded), "output_inv_redshift.pq"),
+            )
+
+            self.line_confusion = LineConfusion.build(
+                connections=dict(input=self.inv_redshift.io.output),
+                true_wavelen=5007., wrong_wavelen=3727., frac_wrong=0.05,
+                output=os.path.join(namer.get_data_dir(DataType.catalog, CatalogType.degraded), "output_line_confusion.pq"),
+            )
+
+What this is doing is:
+
+1.  Defining a class `GoldenspikePipeline` to encapsulate the pipeline and setting up that pipeline.
+
+2.  Set up the rail `DataStore` for interactive use, allowing you to overwrite output files, (say if you re-run the pipeline in a notebook cell).
+
+3.  Defining some common parameters, e.g., `bands`, `bands_dict` for the pipeline.
+
+4.  Defining four stages, and adding them to the pipeline, note that for each stage the syntax is more or less the same.  We have to define,
+
+    1.  The name of the stage, i.e., `self.flow_engine_train` will make a stage called `flow_engine_train` through some python cleverness.
+
+    2.  The class of the stage, which is specified by which type of stage we ask to build, `FlowEngine.build` will make a `FlowEngine` stage.
+
+    3.  Any configuration parameters, which are specified as keyword argurments, e.g., `n_samples=50`.
+
+    4.  Any input connections from other stages, e.g., `connections=dict(input=self.flow_engine_train.io.output),` in the `self.lsst_error_model_train` block will connect the `output` of self.flow_engine_train to the `input` of `self.lsst_error_model_train`.  Later in that example we can see how to connect multiple inputs, e.g., one named `input` and another named `model`, as required for an estimator stage.
+
+    5.  We use the `namer` class and enumerations to ensure that the data end up following our location convenctions.