ci(code): upload initial version (#1)

reanahub · Sep 18, 2024 · 29dd98c · 29dd98c
1 parent 19d69d5
commit 29dd98c
Show file tree

Hide file tree

Showing 3 changed files with 209 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -4,3 +4,115 @@
 [![image](https://img.shields.io/badge/discourse-forum-blue.svg)](https://forum.reana.io)
 [![image](https://img.shields.io/github/license/reanahub/reana-demo-dask-coffea.svg)](https://github.com/reanahub/reana-demo-dask-coffea/blob/master/LICENSE)
 [![image](https://www.reana.io/static/img/badges/launch-on-reana-at-cern.svg)](https://reana.cern.ch/launch?url=https%3A%2F%2Fgithub.com%2Freanahub%2Freana-demo-dask-coffea&specification=reana.yaml&name=reana-demo-dask-coffea)
+
+## About
+
+## Analysis structure
+
+Making a research data analysis reproducible basically means to provide "runnable
+recipes" addressing (1) where is the input data, (2) what software was used to analyse
+the data, (3) which computing environments were used to run the software and (4) which
+computational workflow steps were taken to run the analysis. This will permit to
+instantiate the analysis on the computational cloud and run the analysis to obtain (5)
+output results.
+
+
+### 1. Input data
+
+In this example, we are using the file whose url is given below which is hosted at eospublic.
+- `root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root`
+
+### 2. Analysis code
+
+The analysis code consists of a single python file called `analysis.py` which connects to a dask cluster and then conducts the analysis.
+
+### 3. Compute environment
+
+In order to be able to rerun the analysis even several years in the future, we need to "encapsulate the current compute environment". We shall achieve this by preparing a [Docker](https://www.docker.com/)  container image for our analysis steps.
+
+This example makes use of the coffea platform and the specific image for the platform we are using in this example can be found [here](https://hub.docker.com/r/coffeateam/coffea-dask-cc7).
+
+### 4. Analysis workflow
+
+The analysis workflow is simple and consists of a single step. We simply run the script `python analysis.py` to run the example. However, realize that the actual analysis is relatively heavy and parallelized by dask behind the scenes. As a user, the task graphs and the parallel steps are hidden to us.
+
+### 5. Output results
+
+The example produces the given histogram as an output.
+![](https://github.com/user-attachments/assets/e52c2391-626d-4556-90ca-75248516cc95)
+
+
+## Running the example on REANA cloud
+
+There are two ways to execute this analysis example on REANA.
+
+If you would like to simply launch this analysis example on the REANA instance at CERN
+and inspect its results using the web interface, please click on the following
+badge:
+
+[![Launch with Serial on REANA@CERN badge](https://www.reana.io/static/img/badges/launch-with-serial-on-reana-at-cern.svg)](https://reana.cern.ch/launch?url=https://github.com/reanahub/reana-demo-dask-coffea&specification=reana.yaml&name=reana-demo-dask-coffea)
+
+If you would like a step-by-step guide on how to use the REANA command-line client to
+launch this analysis example, please read on.
+
+We start by creating a [reana.yaml](reana.yaml) file describing the above analysis
+structure with its inputs, code, runtime environment, computational workflow steps and
+expected outputs:
+
+```yaml
+version: 0.9.3
+inputs:
+  files:
+    - codes/analysis.py
+workflow:
+  type: serial
+  resources:
+    dask:
+      image: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
+      cores: 2
+      memory: "4Gi"
+      single_worker_cores: 0.5
+      single_worker_memory: "1Gi"
+  specification:
+    steps:
+      - name: mystep
+        environment: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
+        commands:
+        - python codes/analysis.py
+outputs:
+  files:
+    - histogram.png
+```
+
+In this example we are using a simple Serial workflow engine to represent our sequential
+computational workflow steps.
+
+We can now install the REANA command-line client, run the analysis and download the
+resulting plots:
+
+```console
+$ # create new virtual environment
+$ virtualenv ~/.virtualenvs/reana
+$ source ~/.virtualenvs/reana/bin/activate
+$ # install REANA client
+$ pip install reana-client
+$ # connect to some REANA cloud instance
+$ export REANA_SERVER_URL=https://reana.cern.ch/
+$ export REANA_ACCESS_TOKEN=XXXXXXX
+$ # create new workflow
+$ reana-client create -n myanalysis
+$ export REANA_WORKON=myanalysis
+$ # upload input code, data and workflow to the workspace
+$ reana-client upload
+$ # start computational workflow
+$ reana-client start
+$ # ... should be finished in about 5 minutes
+$ reana-client status
+$ # list workspace files
+$ reana-client ls
+$ # download output results
+$ reana-client download
+```
+
+Please see the [REANA-Client](https://reana-client.readthedocs.io/) documentation for
+more detailed explanation of typical `reana-client` usage scenarios.
diff --git a/codes/analysis.py b/codes/analysis.py
@@ -0,0 +1,76 @@
+import numpy as np
+import hist
+import coffea.processor as processor
+import awkward as ak
+from coffea.nanoevents import schemas
+import matplotlib.pyplot as plt
+from dask.distributed import Client
+import os
+
+
+# This program plots an event-level variable (in this case, MET, but switching it is as easy as a dict-key change). It also demonstrates an easy use of the book-keeping cutflow tool, to keep track of the number of events processed.
+
+
+# The processor class bundles our data analysis together while giving us some helpful tools.  It also leaves looping and chunks to the framework instead of us.
+class Processor(processor.ProcessorABC):
+    def __init__(self):
+        # Bins and categories for the histogram are defined here. For format, see https://coffeateam.github.io/coffea/stubs/coffea.hist.hist_tools.Hist.html && https://coffeateam.github.io/coffea/stubs/coffea.hist.hist_tools.Bin.html
+        dataset_axis = hist.axis.StrCategory(
+            name="dataset", label="", categories=[], growth=True
+        )
+        MET_axis = hist.axis.Regular(
+            name="MET", label="MET [GeV]", bins=50, start=0, stop=100
+        )
+
+        # The accumulator keeps our data chunks together for histogramming. It also gives us cutflow, which can be used to keep track of data.
+        self.output = processor.dict_accumulator(
+            {
+                "MET": hist.Hist(dataset_axis, MET_axis),
+                "cutflow": processor.defaultdict_accumulator(int),
+            }
+        )
+
+    def process(self, events):
+        # This is where we do our actual analysis. The dataset has columns similar to the TTree's; events.columns can tell you them, or events.[object].columns for deeper depth.
+        dataset = events.metadata["dataset"]
+        MET = events.MET.pt
+
+        # We can define a new key for cutflow (in this case 'all events'). Then we can put values into it. We need += because it's per-chunk (demonstrated below)
+        self.output["cutflow"]["all events"] += ak.size(MET)
+        self.output["cutflow"]["number of chunks"] += 1
+
+        # This fills our histogram once our data is collected. The hist key ('MET=') will be defined in the bin in __init__.
+        self.output["MET"].fill(dataset=dataset, MET=MET)
+        return self.output
+
+    def postprocess(self, accumulator):
+        pass
+
+
+DASK_SCHEDULER_URI = os.getenv("DASK_SCHEDULER_URI", "tcp://127.0.0.1:8080")
+client = Client(DASK_SCHEDULER_URI)
+
+fileset = {
+    "SingleMu": [
+        "root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root"
+    ]
+}
+
+executor = processor.DaskExecutor(client=client)
+
+run = processor.Runner(
+    executor=executor, schema=schemas.NanoAODSchema, savemetrics=True
+)
+
+output, metrics = run(fileset, "Events", processor_instance=Processor())
+
+# Generates a 1D histogram from the data output to the 'MET' key. fill_opts are optional, to fill the graph (default is a line).
+output["MET"].plot1d()
+
+
+# Easy way to print all cutflow dict values. Can just do print(output['cutflow']["KEY_NAME"]) for one.
+for key, value in output["cutflow"].items():
+    print(key, value)
+
+# Save the histogram plot to a file (e.g., 'histogram.png')
+plt.savefig("histogram.png")
diff --git a/reana.yaml b/reana.yaml
@@ -0,0 +1,21 @@
+inputs:
+  files:
+    - codes/analysis.py
+workflow:
+  type: serial
+  resources:
+    dask:
+      image: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
+      cores: 2
+      memory: "4Gi"
+      single_worker_cores: 0.5
+      single_worker_memory: "1Gi"
+  specification:
+    steps:
+      - name: mystep
+        environment: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
+        commands:
+        - python codes/analysis.py
+outputs:
+  files:
+    - histogram.png