Skip to content

Commit

Permalink
Merge pull request #36 from COBREXA/mk-distributed-proj
Browse files Browse the repository at this point in the history
add notes about forwarding the project configuration
  • Loading branch information
exaexa authored May 16, 2024
2 parents b2ce5d9 + f55302f commit a8180fd
Show file tree
Hide file tree
Showing 10 changed files with 96 additions and 75 deletions.
31 changes: 16 additions & 15 deletions docs/src/distributed/1_functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,17 +9,17 @@ run the large parallelizable analyses on multiple CPU cores and multiple
computers connected through the network. Ultimately, the approach scales to
thousands of computing nodes in large HPC facilities.

You may run your analyses in parallel to gain speed-ups. The usual workflow in
Users may run the analyses in parallel to gain speed-ups. The usual workflow in
`COBREXA.jl` is quite straightforward:

1. Import the `Distributed` package and add worker processes, e.g. using
`addprocs`.
2. Pick an analysis function that can be parallelized (such as `screen`
or `flux_variability_analysis`) and prepare it to work on your data.
or `flux_variability_analysis`) and prepare it to work on the data.
3. Pass the desired set of worker IDs to the function using `workers=` argument,
in the simplest form using e.g. `screen(..., workers=workers())`.
4. Worker communication will be managed automatically, and you will get results
"as usual", just appropriately faster.
4. Worker communication will be managed automatically, and the results will be
computed "as usual", just appropriately faster.

Specific documentation is available about [running parallel analysis
locally](2_parallel.md) and [running distributed analysis in HPC clusters](3_slurm.md).
Expand All @@ -46,17 +46,18 @@ range of use-cases that can thus be parallelized very easily:
## Mitigating parallel inefficiencies

Ideally, the speedup gained by parallel processing should be proportional to
the amount of hardware you add as the workers. You should be aware of factors
that reduce the parallel efficiency, which can be summarized as follows:
the amount of hardware one add as the workers. To reach that, it is beneficial
to be aware of factors that reduce the parallel efficiency, which can be
summarized as follows:

- Parallelization within single runs of the linear solver is typically not
supported (and if it is, it may be inefficient for common problem sizes).
Normally, you want to parallelize the analyzes that comprise multiple
Normally, we want to parallelize the analyzes that comprise multiple
independent runs of the solvers.
- Some analysis function, such as [`flux_variability_analysis`](@ref), have
serial parts that can not be parallelized by default. Usually, you may avoid
the inefficiency by precomputing the serial analysis parts without involving
the cluster of the workers.
serial parts that can not be parallelized by default. Usually, pipelines may
avoid the inefficiency by precomputing the serial analysis parts without
involving the cluster of the workers.
- Frequent worker communication may vastly reduce the efficiency of parallel
processing; typically this happens if the time required for individual
analysis steps is smaller than the network round-trip-time to the worker
Expand All @@ -68,10 +69,10 @@ that reduce the parallel efficiency, which can be summarized as follows:

!!! note "Cost of the distribution and parallelization overhead"
Before allocating extra resources into the distributed execution, always
check that your tasks are properly parallelizable and sufficiently large
to saturate your computation resources, so that the invested energy is not
check that the tasks are properly parallelizable and sufficiently large
to saturate the computation resources, so that the invested energy is not
wasted.
[Amdahl's](https://en.wikipedia.org/wiki/Amdahl's_law) and
[Gustafson's](https://en.wikipedia.org/wiki/Gustafson%27s_law) laws can
give you a better overview of the sources and consequences of the
parallelization inefficiencies and the costs of the resulting overhead.
[Gustafson's](https://en.wikipedia.org/wiki/Gustafson%27s_law) laws give a
better overview of the sources and consequences of the parallelization
inefficiencies, and the costs of the resulting overhead.
27 changes: 18 additions & 9 deletions docs/src/distributed/2_parallel.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@

# Local parallel processing

To run an analysis in parallel, you first need to load the `Distributed`
package and add a few worker processes. For example, you may start 5 local
To run an analysis in parallel, we first need to load the `Distributed`
package and add a few worker processes. For example, we may start 5 local
processes (that may utilize 5 CPUs) as follows

```julia
Expand All @@ -12,10 +12,10 @@ addprocs(5)

!!! note "`Distributed.jl` installation"
`Distributed.jl` usually comes pre-installed with Julia distribution, but
you may still need to "enable" it by typing `] add Distributed`.
one may still need to "enable" it by typing `] add Distributed`.

You may check that the workers are really there, using `workers()`. In this
case, it should give you a vector of _worker IDs_, very likely equal to
To check that the workers are really there, use `workers()`. In this
case, it should return a vector of _worker IDs_, very likely equal to
`[2,3,4,5,6]`.

Each of the processes contains a self-sufficient image of Julia that can act
Expand All @@ -24,15 +24,24 @@ process with loaded `COBREXA.jl` and a simple solver such as GLPK may consume
around 500MB of RAM, which should be taken into account when planning the
analysis scale.

Packages (COBREXA and your selected solver) must be loaded at all processes,
which you can ensure using the "everywhere" macro (from `Distributed` package):
!!! warning "Using Julia environments with Distributed"
In certain conditions, the Distributed package does not properly forward
the project configuration to the workers, resulting to package version
mismatches and other problems. For pipelines that run in custom project
folders, use the following form of `addprocs` instead:
```julia
addprocs(5, exeflags=`--project=$(Base.active_project())`)
```

Packages (COBREXA and the selected solver) must be loaded at all processes,
which may ensured using the "everywhere" macro (from `Distributed` package):
```julia
@everywhere using COBREXA, GLPK
```

Utilizing the prepared worker processes is then straightforward: You pass the
Utilizing the prepared worker processes is then straightforward: We pass the
list of workers to the selected analysis function using the `workers` keyword
argument, and the parallel processing is automatically orchestrated for you:
argument, and the parallel processing is orchestrated automatically:

```julia
model = load_model("e_coli_core.xml")
Expand Down
60 changes: 35 additions & 25 deletions docs/src/distributed/3_slurm.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,22 @@ relatively complex tasks:

Fortunately, the package
[`ClusterManagers.jl`](https://github.com/JuliaParallel/ClusterManagers.jl)
does that for us. For simplicily, here we assume that your HPC is scheduled by
does that for us. For simplicity, here we assume that the HPC is scheduled by
[Slurm](https://slurm.schedmd.com/), but other scheduling environments are
supported in a very similar way.

## Interacting with Slurm

Adding of the Slurm-provided is done as follows:
- you import the `ClusterManagers` package
- you find how many processes to spawn from the environment from `SLURM_NTASKS`
environment variable
- you use the function `addprocs_slurm` to precisely connect to your allocated
Utilization of the Slurm-provided resources is enabled as follows:
- first, import the `ClusterManagers` package
- find how many processes to spawn from the environment, typically from
`SLURM_NTASKS` environment variable
- use the function `addprocs_slurm` to precisely connect to the allocated
computational resources

After adding the Slurm workers, you may continue as if the workers were added
using normal `addprocs` --- typically you load the model and (for example) run
the `flux_variability_analysis` as if you would use the [local
After adding the Slurm workers, one may continue as if the workers were added
using normal `addprocs` --- typically, we can load the model and (for example) run
the `flux_variability_analysis` as if we would use the [local
workers](2_parallel.md).

The Julia script that does a parallel analysis in a Slurm cluster may look as
Expand All @@ -57,41 +57,51 @@ results = flux_variability_analysis(..., workers=workers())
[package documentation](https://github.com/JuliaParallel/ClusterManagers.jl/blob/master/README.md)
for details.

## Wrapping your script in a Slurm batch job
!!! warning "Using Julia environments with Distributed"
Sometimes the project configuration is not forwarded to the workers
automatically, resulting to package version mismatches and other problems.
When utilizing custom project folders (by running Julia with `julia
--project=...`), use the following form of `addprocs_slurm` instead:
```julia
addprocs_slurm(available_workers, exeflags=`--project=$(Base.active_project())`)
```

To be able to submit your script for later processing using the [`sbatch` Slurm
command](https://slurm.schedmd.com/sbatch.html), you need to wrap it in a small
## Wrapping a pipeline script in a Slurm batch job

To be able to submit a script for later processing using the [`sbatch` Slurm
command](https://slurm.schedmd.com/sbatch.html), we need to wrap it in a small
"batch" script that tells Slurm how many resources the process needs.

Assuming you have a Julia computation script written down in `myJob.jl` and
saved on your HPC cluster's access node, the corresponding Slurm batch script
Assuming we have a Julia computation script written down in `myJob.jl` and
saved on the HPC cluster's access node, the corresponding Slurm batch script
(let's call it `myJob.sbatch`) may look as follows:

```sh
#!/bin/bash -l
#SBATCH -n 100 # the job will require 100 individual workers
#SBATCH -n 100 # the job will use 100 individual worker processes
#SBATCH -c 1 # each worker will sit on a single CPU
#SBATCH -t 30 # the whole job will take less than 30 minutes
#SBATCH -J myJob # the name of the job
#SBATCH -J myJob # the name of the job (for own reference)

module load lang/Julia # add Julia to the environment (this may differ on different clusters and installations)
module load lang/Julia # add Julia to the environment (this may differ on different clusters and installations!)

julia myJob.jl
```

To run the computation, run `sbatch myJob.sbatch` on the cluster access node.
The job will be scheduled and eventually executed. You may watch the output of
commands `sacct` and `squeue` in the meantime, to see the progress.
The job will be scheduled and eventually executed. It is possible to watch the
output of commands `sacct` and `squeue` in the meantime, to see the progress.

Remember that you need to explicitly save the result of your Julia script
Remember that it is necessary to explicitly save the result of the Julia script
computation to files, to be able to retrieve them later. Standard outputs of
the jobs are often mangled and discarded. If you still want to collect the
standard output of your Julia script, you may change the last line of the batch
script to
the jobs are often mangled and/or discarded. If we would still want to collect
the standard output of the Julia script, we might need to change the last line
of the batch script as follows:

```sh
julia myJob.jl > myJob.log
```

and collect the output from `myJob.log` later. This is convenient especially if
your script logs various computation details using `@info` and similar macros.
...and collect the output from `myJob.log` later. This is convenient especially
if the script prints out various computation details using `@info` and similar
macros.
2 changes: 1 addition & 1 deletion docs/src/examples/02-flux-balance-analysis.jl
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ solution.objective
solution.fluxes.PFK

#md #!!! tip "Browsing the model structure"
#md # After typing `solution.` in the julia REPL, one can press [tab] to quickly see what is in the next level of the tree. Unfortunately (due to typesystem limitations) this currently works only for the topmost level of the tree.
#md # After typing `solution.` in the Julia REPL, one can press [tab] to quickly see what is in the next level of the tree. Unfortunately (due to type system limitations) this currently works only for the topmost level of the tree.

# ...or make a "table" of all fluxes through all reactions:

Expand Down
14 changes: 7 additions & 7 deletions docs/src/examples/05-enzyme-constrained-models.jl
Original file line number Diff line number Diff line change
Expand Up @@ -132,8 +132,8 @@ ecoli_core_reaction_kcats # units = 1/s
# isozymes that can catalyze a reaction. A turnover number needs to be assigned
# to each isozyme, as shown below. Additionally, some enzymes are composed of
# multiple subunits, which differ in subunit stoichiometry. This also needs to
# be accounted for. Assuming a stoichiometry of 1 for everything seems to be
# okay if you do not have more information.
# be accounted for. Assuming a stoichiometry of 1 for everything tends to work
# just right OK if there is no better information available.

reaction_isozymes = Dict{String,Dict{String,Isozyme}}() # a mapping from reaction IDs to isozyme IDs to isozyme structs.
for rid in A.reactions(model)
Expand All @@ -151,7 +151,7 @@ for rid in A.reactions(model)
end

#md #!!! tip "Turnover number units"
#md # Take care with the units of the turnover numbers. In literature they are usually reported in 1/s. However, flux units are typically mmol/gDW/h, suggesting that you should rescale the turnover numbers to 1/h if you want to use the conventional flux units.
#md # Take care with the units of the turnover numbers. In literature they are usually reported in 1/s. However, flux units are typically mmol/gDW/h, suggesting to rescale the turnover numbers to 1/h in order to use the conventional flux units.

# ## Enzyme molar masses

Expand All @@ -163,7 +163,7 @@ end
#md # <details><summary><strong>Gene product masses</strong></summary>
#md # ```
# This data is downloaded from Uniprot for E. coli K12, gene mass in kDa. To
# obtain these data yourself, go to [Uniprot](https://www.uniprot.org/) and
# obtain these data manually, go to [Uniprot](https://www.uniprot.org/) and
# search using these terms: `reviewed:yes AND organism:"Escherichia coli
# (strain K12) [83333]"`.
const ecoli_core_gene_product_masses = Dict(
Expand Down Expand Up @@ -314,7 +314,7 @@ const ecoli_core_gene_product_masses = Dict(
ecoli_core_gene_product_masses # unit kDa = kg/mol

#md #!!! tip "Molar mass units"
#md # Just as with the turnover numbers, take extreme care about the units of the molar masses. In literature they are usually reported in Da or kDa (g/mol). However, as noted above, flux units are typically mmol/gDW/h. Since the enzyme kinetic equation is `v = k * e` (where `k` is the turnover number) it suggests that the enzyme variable will have units of mmol/gDW. The molar masses come into play when setting the capacity limitations, e.g. usually a sum over all enzymes weighted by their molar masses as `e * M`. Thus, if your capacity limitation has units of g/gDW, then the molar masses must have units of g/mmol (i.e., kDa).
#md # Just as with the turnover numbers, take extreme care about the units of the molar masses. In literature they are usually reported in Da or kDa (g/mol). However, as noted above, flux units are typically mmol/gDW/h. Since the enzyme kinetic equation is `v = k * e` (where `k` is the turnover number) it suggests that the enzyme variable will have units of mmol/gDW. The molar masses come into play when setting the capacity limitations, e.g. usually a sum over all enzymes weighted by their molar masses as `e * M`. Thus, if the capacity limitation has units of g/gDW, then the molar masses must have units of g/mmol (i.e., kDa).

# ## Capacity limitation

Expand Down Expand Up @@ -343,8 +343,8 @@ ec_solution.objective

# One can also observe many interesting thing, e.g. the amount of gene product
# material required for the system to run. Importantly, the units of these
# values depends on the units you used to set the turnover numbers and protein
# molar masses.
# values depend on the units used to set the turnover numbers and protein molar
# masses.

ec_solution.gene_product_amounts

Expand Down
2 changes: 1 addition & 1 deletion docs/src/examples/06-mmdf.jl
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ model = load_model("e_coli_core.json")
# We will need ΔᵣG⁰ data for each reaction we want to include in the
# thermodynamic model. To generate this data manually, use
# [eQuilibrator](https://equilibrator.weizmann.ac.il/). To generate
# automatically, you may use the
# automatically, it is possible to use the
# [eQuilibrator.jl](https://github.com/stelmo/Equilibrator.jl) package.

reaction_standard_gibbs_free_energies = Dict{String,Float64}( # units of the energies are kJ/mol
Expand Down
8 changes: 4 additions & 4 deletions docs/src/examples/08-community-models.jl
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and #src
# limitations under the License. #src

# # Community FBA models
# # Community FBA models (TODO)

using COBREXA

Expand Down Expand Up @@ -91,9 +91,9 @@ end

# ## Inspecting the interfaces
#
# Not all interfaces are made equally! Fortunately, it is simple to create your
# own interface, by just manually assigning reactions to semantic groups using
# ConstraintTrees.
# Not all interfaces are made equally! Fortunately, it is simple to create a
# custom interface, by just manually assigning reactions to semantic groups
# using ConstraintTrees.

# Some work:
flux_balance_constraints(ecoli1, interface = :sbo).interface
Expand Down
6 changes: 3 additions & 3 deletions docs/src/examples/11-sampling.jl
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
# See the License for the specific language governing permissions and #src
# limitations under the License. #src

# # Flux sampling
# # Flux sampling (TODO)

using COBREXA

Expand All @@ -28,8 +28,8 @@ import JSONFBCModels, GLPK

model = load_model("e_coli_core.json")

# note here: this needs the optimizer to generate warmup. If you have warmup,
# you can do without one.
# note here: this needs the optimizer to generate warmup. If we have warmup,
# we can do without the optimizer
s = flux_sample(
model,
optimizer = GLPK.Optimizer,
Expand Down
5 changes: 3 additions & 2 deletions docs/src/examples/12-screening.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,9 @@
# limitations under the License. #src

# # Screening through many model variants
# [`screen`](@ref) is a function that you can use to run many model/simulation
# variants (ideally on an HPC) efficiently.
#
# [`screen`](@ref) is a function that runs many model/simulation variants
# (ideally on an HPC) efficiently.

using COBREXA

Expand Down
16 changes: 8 additions & 8 deletions docs/src/structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ With ConstraintTrees, the typical workflow in COBREXA is as follows:
- possibly, multiple types and groups of raw data can be soaked into the
constraint tree
3. Analysis functionality of COBREXA is used to solve the system described by
the constraitn tree, and extract useful information from the solutions.
the constraint tree, and extract useful information from the solutions.

COBREXA mainly provides functionality to make this workflow easy to use for
many various purposes:
Expand Down Expand Up @@ -91,13 +91,13 @@ many various purposes:
- [`optimized_values`](@ref)
- [`constraints_variability`](@ref)

!!! tip "Exploring and customizing the frontend analysis functions"
If you want to know which builder function is used to create or modify some
kind of constraint tree in COBREXA, use the "link to source code" feature
in the frontend function's individual documentation. The source code of
front-end functions is written to be as easily re-usable as possible -- you
can simply copy-paste it into your program, and immediately start building
your own specialized and customized front-end functions.
!!! tip "Exploring and customizing the front-end analysis functions"
To know which builder function is used to create or modify some kind of
constraint tree in COBREXA, use the "link to source code" feature in the
front-end function's individual documentation. The source code of front-end
functions is written to be as easily re-usable as possible -- one can
simply copy-paste it into the program, and immediately start building
specialized and customized front-end functions.

Technical description of the constraint tree functionality, together with
examples of basic functionality and many useful utility functions is available
Expand Down

0 comments on commit a8180fd

Please sign in to comment.