-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(analysis): add initial version (#1) #1
Conversation
29dd98c
to
f5f8df9
Compare
f5f8df9
to
730d702
Compare
reana.yaml
Outdated
version: 0.9.3 | ||
inputs: | ||
files: | ||
- codes/analysis.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directory should be named just code
.
But, considering that Dask provides also workflow, so to speak, and that we don't have any other input files or data files, we could simply hos the sole analysis.py
file in the root directory.
reana.yaml
Outdated
@@ -0,0 +1,18 @@ | |||
version: 0.9.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove version
clause that do not really serve anything and some people were confused by its meaning. (We might be removing it from everywhere later.)
reana.yaml
Outdated
image: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049 | ||
specification: | ||
steps: | ||
- name: mystep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can name the step "process".
environment: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049 | ||
commands: | ||
- python codes/analysis.py | ||
outputs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please introduce also some behavioural tests for the histogram output file presence and for log messages.
all events 53446198
number of chunks 534
README.md
Outdated
expected outputs: | ||
|
||
```yaml | ||
version: 0.9.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After you modify reana.yaml
, please update this section accordingly.
730d702
to
845ec04
Compare
845ec04
to
faacaad
Compare
@@ -4,3 +4,114 @@ | |||
[![image](https://img.shields.io/badge/discourse-forum-blue.svg)](https://forum.reana.io) | |||
[![image](https://img.shields.io/github/license/reanahub/reana-demo-dask-coffea.svg)](https://github.com/reanahub/reana-demo-dask-coffea/blob/master/LICENSE) | |||
[![image](https://www.reana.io/static/img/badges/launch-on-reana-at-cern.svg)](https://reana.cern.ch/launch?url=https%3A%2F%2Fgithub.com%2Freanahub%2Freana-demo-dask-coffea&specification=reana.yaml&name=reana-demo-dask-coffea) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are suggestions for the README file:
diff --git a/README.md b/README.md
index 9ac7e7a..4b329fc 100644
--- a/README.md
+++ b/README.md
@@ -7,57 +7,73 @@
## About
-## Analysis structure
+This [REANA](http://www.reana.io/) reproducible analysis example provides a
+simple example how to run Dask workflows using Coffea. The example was adapted
+from
+[Coffea Casa tutorials](https://github.com/CoffeaTeam/coffea-casa-tutorials/blob/master/examples/example1.ipynb)
+repository.
-Making a research data analysis reproducible basically means to provide "runnable
-recipes" addressing (1) where is the input data, (2) what software was used to analyse
-the data, (3) which computing environments were used to run the software and (4) which
-computational workflow steps were taken to run the analysis. This will permit to
-instantiate the analysis on the computational cloud and run the analysis to obtain (5)
-output results.
+## Analysis structure
+Making a research data analysis reproducible basically means to provide
+"runnable recipes" addressing (1) where is the input data, (2) what software was
+used to analyse the data, (3) which computing environments were used to run the
+software and (4) which computational workflow steps were taken to run the
+analysis. This will permit to instantiate the analysis on the computational
+cloud and run the analysis to obtain (5) output results.
### 1. Input data
-In this example, we are using the file whose url is given below which is hosted at eospublic.
-- `root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root`
+In this example, we are using a single CMS open data set file
+`Run2012B_SingleMu.root` which is hosted at EOSPUBLIC XRootD server.
### 2. Analysis code
-The analysis code consists of a single python file called `analysis.py` which connects to a dask cluster and then conducts the analysis.
+The analysis code consists of a single Python file called `analysis.py` which
+connects to a Dask cluster and then conducts the analysis and prints MET
+histogram.
### 3. Compute environment
-In order to be able to rerun the analysis even several years in the future, we need to "encapsulate the current compute environment". We shall achieve this by preparing a [Docker](https://www.docker.com/) container image for our analysis steps.
+In order to be able to rerun the analysis even several years in the future, we
+need to "encapsulate the current compute environment". We shall achieve this by
+preparing a [Docker](https://www.docker.com/) container image for our analysis
+steps.
-This example makes use of the coffea platform and the specific image for the platform we are using in this example can be found [here](https://hub.docker.com/r/coffeateam/coffea-dask-cc7).
+This example makes use of the Coffea platform image with the specific version
+0.7.22. The container image can be found on Docker Hub at
+[docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049](https://hub.docker.com/r/coffeateam/coffea-dask-cc7).
### 4. Analysis workflow
-The analysis workflow is simple and consists of a single step. We simply run the script `python analysis.py` to run the example. However, realize that the actual analysis is relatively heavy and parallelized by dask behind the scenes. As a user, the task graphs and the parallel steps are hidden to us.
+The analysis workflow is simple and consists of a single command. We simply run
+the script `python analysis.py` to run the example. The command will then use
+the Dask behind the scenes to possibly launch parallel computations. As a user,
+we do not have to specify the computational graph ourselves; the Dask library
+will take care of dispatching computations.
### 5. Output results
-The example produces the given histogram as an output.
-![](https://github.com/user-attachments/assets/e52c2391-626d-4556-90ca-75248516cc95)
+The example produces the following MET event-level histogram as an output.
+![](https://github.com/user-attachments/assets/e52c2391-626d-4556-90ca-75248516cc95)
## Running the example on REANA cloud
There are two ways to execute this analysis example on REANA.
-If you would like to simply launch this analysis example on the REANA instance at CERN
-and inspect its results using the web interface, please click on the following
-badge:
+If you would like to simply launch this analysis example on the REANA instance
+at CERN and inspect its results using the web interface, please click on the
+following badge:
-[![Launch with Serial on REANA@CERN badge](https://www.reana.io/static/img/badges/launch-with-serial-on-reana-at-cern.svg)](https://reana.cern.ch/launch?url=https://github.com/reanahub/reana-demo-dask-coffea&specification=reana.yaml&name=reana-demo-dask-coffea)
+[![Launch on REANA@CERN badge](https://www.reana.io/static/img/badges/launch-on-reana-at-cern.svg)](https://reana.cern.ch/launch?url=https://github.com/reanahub/reana-demo-dask-coffea&specification=reana.yaml&name=reana-demo-dask-coffea)
-If you would like a step-by-step guide on how to use the REANA command-line client to
-launch this analysis example, please read on.
+If you would like a step-by-step guide on how to use the REANA command-line
+client to launch this analysis example, please read on.
-We start by creating a [reana.yaml](reana.yaml) file describing the above analysis
-structure with its inputs, code, runtime environment, computational workflow steps and
-expected outputs:
+We start by creating a [reana.yaml](reana.yaml) file describing the above
+analysis structure with its inputs, code, runtime environment, computational
+workflow steps and expected outputs:
```yaml
inputs:
@@ -73,7 +89,7 @@ workflow:
- name: process
environment: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
commands:
- - python analysis.py
+ - python analysis.py
outputs:
files:
- histogram.png
@@ -83,11 +99,11 @@ tests:
- tests/workspace-files.feature
-In this example we are using a simple Serial workflow engine to represent our sequential
-computational workflow steps.
+In this example we are using a simple Serial workflow engine to launch our
+Dask-based computations.
-We can now install the REANA command-line client, run the analysis and download the
-resulting plots:
+We can now install the REANA command-line client, run the analysis and download
+the resulting plots:
$ # create new virtual environment
@@ -113,5 +129,6 @@ $ # download output results
$ reana-client download
-Please see the REANA-Client documentation for
-more detailed explanation of typical reana-client
usage scenarios.
\ No newline at end of file
+Please see the REANA-Client
+documentation for more detailed explanation of typical reana-client
usage
+scenarios.
faacaad
to
9339c1c
Compare
9339c1c
to
386c896
Compare
386c896
to
5a5bffc
Compare
5a5bffc
to
eca77b8
Compare
eca77b8
to
40d3344
Compare
40d3344
to
3965f04
Compare
3965f04
to
53cf1ba
Compare
53cf1ba
to
9a07769
Compare
This pull request implements the initial version of the dask demo example.