Merge pull request #36 from COBREXA/mk-distributed-proj

add notes about forwarding the project configuration
COBREXA · May 16, 2024 · a8180fd · a8180fd
2 parents b2ce5d9 + f55302f
commit a8180fd
Show file tree

Hide file tree

Showing 10 changed files with 96 additions and 75 deletions.
diff --git a/docs/src/distributed/1_functions.md b/docs/src/distributed/1_functions.md
@@ -9,17 +9,17 @@ run the large parallelizable analyses on multiple CPU cores and multiple
 computers connected through the network. Ultimately, the approach scales to
 thousands of computing nodes in large HPC facilities.
 
-You may run your analyses in parallel to gain speed-ups. The usual workflow in
+Users may run the analyses in parallel to gain speed-ups. The usual workflow in
 `COBREXA.jl` is quite straightforward:
 
 1. Import the `Distributed` package and add worker processes, e.g. using
    `addprocs`.
 2. Pick an analysis function that can be parallelized (such as `screen`
-   or `flux_variability_analysis`) and prepare it to work on your data.
+   or `flux_variability_analysis`) and prepare it to work on the data.
 3. Pass the desired set of worker IDs to the function using `workers=` argument,
    in the simplest form using e.g. `screen(...,  workers=workers())`.
-4. Worker communication will be managed automatically, and you will get results
-   "as usual", just appropriately faster.
+4. Worker communication will be managed automatically, and the results will be
+   computed "as usual", just appropriately faster.
 
 Specific documentation is available about [running parallel analysis
 locally](2_parallel.md) and [running distributed analysis in HPC clusters](3_slurm.md).
@@ -46,17 +46,18 @@ range of use-cases that can thus be parallelized very easily:
 ## Mitigating parallel inefficiencies
 
 Ideally, the speedup gained by parallel processing should be proportional to
-the amount of hardware you add as the workers. You should be aware of factors
-that reduce the parallel efficiency, which can be summarized as follows:
+the amount of hardware one add as the workers. To reach that, it is beneficial
+to be aware of factors that reduce the parallel efficiency, which can be
+summarized as follows:
 
 - Parallelization within single runs of the linear solver is typically not
   supported (and if it is, it may be inefficient for common problem sizes).
-  Normally, you want to parallelize the analyzes that comprise multiple
+  Normally, we want to parallelize the analyzes that comprise multiple
   independent runs of the solvers.
 - Some analysis function, such as [`flux_variability_analysis`](@ref), have
-  serial parts that can not be parallelized by default. Usually, you may avoid
-  the inefficiency by precomputing the serial analysis parts without involving
-  the cluster of the workers.
+  serial parts that can not be parallelized by default. Usually, pipelines may
+  avoid the inefficiency by precomputing the serial analysis parts without
+  involving the cluster of the workers.
 - Frequent worker communication may vastly reduce the efficiency of parallel
   processing; typically this happens if the time required for individual
   analysis steps is smaller than the network round-trip-time to the worker
@@ -68,10 +69,10 @@ that reduce the parallel efficiency, which can be summarized as follows:
 
 !!! note "Cost of the distribution and parallelization overhead"
     Before allocating extra resources into the distributed execution, always
-    check that your tasks are properly parallelizable and sufficiently large
-    to saturate your computation resources, so that the invested energy is not
+    check that the tasks are properly parallelizable and sufficiently large
+    to saturate the computation resources, so that the invested energy is not
     wasted.
     [Amdahl's](https://en.wikipedia.org/wiki/Amdahl's_law) and
-    [Gustafson's](https://en.wikipedia.org/wiki/Gustafson%27s_law) laws can
-    give you a better overview of the sources and consequences of the
-    parallelization inefficiencies and the costs of the resulting overhead.
+    [Gustafson's](https://en.wikipedia.org/wiki/Gustafson%27s_law) laws give a
+    better overview of the sources and consequences of the parallelization
+    inefficiencies, and the costs of the resulting overhead.
diff --git a/docs/src/distributed/2_parallel.md b/docs/src/distributed/2_parallel.md
@@ -1,8 +1,8 @@
 
 # Local parallel processing
 
-To run an analysis in parallel, you first need to load the `Distributed`
-package and add a few worker processes. For example, you may start 5 local
+To run an analysis in parallel, we first need to load the `Distributed`
+package and add a few worker processes. For example, we may start 5 local
 processes (that may utilize 5 CPUs) as follows
 
 ```julia
@@ -12,10 +12,10 @@ addprocs(5)
 
 !!! note "`Distributed.jl` installation"
     `Distributed.jl` usually comes pre-installed with Julia distribution, but
-    you may still need to "enable" it by typing `] add Distributed`.
+    one may still need to "enable" it by typing `] add Distributed`.
 
-You may check that the workers are really there, using `workers()`. In this
-case, it should give you a vector of _worker IDs_, very likely equal to
+To check that the workers are really there, use `workers()`. In this
+case, it should return a vector of _worker IDs_, very likely equal to
 `[2,3,4,5,6]`.
 
 Each of the processes contains a self-sufficient image of Julia that can act
@@ -24,15 +24,24 @@ process with loaded `COBREXA.jl` and a simple solver such as GLPK may consume
 around 500MB of RAM, which should be taken into account when planning the
 analysis scale.
 
-Packages (COBREXA and your selected solver) must be loaded at all processes,
-which you can ensure using the "everywhere" macro (from `Distributed` package):
+!!! warning "Using Julia environments with Distributed"
+    In certain conditions, the Distributed package does not properly forward
+    the project configuration to the workers, resulting to package version
+    mismatches and other problems. For pipelines that run in custom project
+    folders, use the following form of `addprocs` instead:
+    ```julia
+    addprocs(5, exeflags=`--project=$(Base.active_project())`)
+    ```
+
+Packages (COBREXA and the selected solver) must be loaded at all processes,
+which may ensured using the "everywhere" macro (from `Distributed` package):
 ```julia
 @everywhere using COBREXA, GLPK
 ```
 
-Utilizing the prepared worker processes is then straightforward: You pass the
+Utilizing the prepared worker processes is then straightforward: We pass the
 list of workers to the selected analysis function using the `workers` keyword
-argument, and the parallel processing is automatically orchestrated for you:
+argument, and the parallel processing is orchestrated automatically:
 
 ```julia
 model = load_model("e_coli_core.xml")

diff --git a/docs/src/distributed/3_slurm.md b/docs/src/distributed/3_slurm.md
@@ -15,22 +15,22 @@ relatively complex tasks:
 
 Fortunately, the package
 [`ClusterManagers.jl`](https://github.com/JuliaParallel/ClusterManagers.jl)
-does that for us. For simplicily, here we assume that your HPC is scheduled by
+does that for us. For simplicity, here we assume that the HPC is scheduled by
 [Slurm](https://slurm.schedmd.com/), but other scheduling environments are
 supported in a very similar way.
 
 ## Interacting with Slurm
 
-Adding of the Slurm-provided is done as follows:
-- you import the `ClusterManagers` package
-- you find how many processes to spawn from the environment from `SLURM_NTASKS`
-  environment variable
-- you use the function `addprocs_slurm` to precisely connect to your allocated
+Utilization of the Slurm-provided resources is enabled as follows:
+- first, import the `ClusterManagers` package
+- find how many processes to spawn from the environment, typically from
+  `SLURM_NTASKS` environment variable
+- use the function `addprocs_slurm` to precisely connect to the allocated
   computational resources
 
-After adding the Slurm workers, you may continue as if the workers were added
-using normal `addprocs` --- typically you load the model and (for example) run
-the `flux_variability_analysis` as if you would use the [local
+After adding the Slurm workers, one may continue as if the workers were added
+using normal `addprocs` --- typically, we can load the model and (for example) run
+the `flux_variability_analysis` as if we would use the [local
 workers](2_parallel.md).
 
 The Julia script that does a parallel analysis in a Slurm cluster may look as
@@ -57,41 +57,51 @@ results = flux_variability_analysis(..., workers=workers())
     [package documentation](https://github.com/JuliaParallel/ClusterManagers.jl/blob/master/README.md)
     for details.
 
-## Wrapping your script in a Slurm batch job
+!!! warning "Using Julia environments with Distributed"
+    Sometimes the project configuration is not forwarded to the workers
+    automatically, resulting to package version mismatches and other problems.
+    When utilizing custom project folders (by running Julia with `julia
+    --project=...`), use the following form of `addprocs_slurm` instead:
+    ```julia
+    addprocs_slurm(available_workers, exeflags=`--project=$(Base.active_project())`)
+    ```
 
-To be able to submit your script for later processing using the [`sbatch` Slurm
-command](https://slurm.schedmd.com/sbatch.html), you need to wrap it in a small
+## Wrapping a pipeline script in a Slurm batch job
+
+To be able to submit a script for later processing using the [`sbatch` Slurm
+command](https://slurm.schedmd.com/sbatch.html), we need to wrap it in a small
 "batch" script that tells Slurm how many resources the process needs.
 
-Assuming you have a Julia computation script written down in `myJob.jl` and
-saved on your HPC cluster's access node, the corresponding Slurm batch script
+Assuming we have a Julia computation script written down in `myJob.jl` and
+saved on the HPC cluster's access node, the corresponding Slurm batch script
 (let's call it `myJob.sbatch`) may look as follows:
 
 ```sh
 #!/bin/bash -l
-#SBATCH -n 100           # the job will require 100 individual workers
+#SBATCH -n 100           # the job will use 100 individual worker processes
 #SBATCH -c 1             # each worker will sit on a single CPU
 #SBATCH -t 30            # the whole job will take less than 30 minutes
-#SBATCH -J myJob         # the name of the job
+#SBATCH -J myJob         # the name of the job (for own reference)
 
-module load lang/Julia   # add Julia to the environment (this may differ on different clusters and installations)
+module load lang/Julia   # add Julia to the environment (this may differ on different clusters and installations!)
 
 julia myJob.jl
 ```
 
 To run the computation, run `sbatch myJob.sbatch` on the cluster access node.
-The job will be scheduled and eventually executed. You may watch the output of
-commands `sacct` and `squeue` in the meantime, to see the progress.
+The job will be scheduled and eventually executed. It is possible to watch the
+output of commands `sacct` and `squeue` in the meantime, to see the progress.
 
-Remember that you need to explicitly save the result of your Julia script
+Remember that it is necessary to explicitly save the result of the Julia script
 computation to files, to be able to retrieve them later. Standard outputs of
-the jobs are often mangled and discarded. If you still want to collect the
-standard output of your Julia script, you may change the last line of the batch
-script to
+the jobs are often mangled and/or discarded. If we would still want to collect
+the standard output of the Julia script, we might need to change the last line
+of the batch script as follows:
 
 ```sh
 julia myJob.jl > myJob.log
 ```
 
-and collect the output from `myJob.log` later. This is convenient especially if
-your script logs various computation details using `@info` and similar macros.
+...and collect the output from `myJob.log` later. This is convenient especially
+if the script prints out various computation details using `@info` and similar
+macros.
diff --git a/docs/src/examples/02-flux-balance-analysis.jl b/docs/src/examples/02-flux-balance-analysis.jl
@@ -62,7 +62,7 @@ solution.objective
 solution.fluxes.PFK
 
 #md #!!! tip "Browsing the model structure"
-#md #    After typing `solution.` in the julia REPL, one can press [tab] to quickly see what is in the next level of the tree. Unfortunately (due to typesystem limitations) this currently works only for the topmost level of the tree.
+#md #    After typing `solution.` in the Julia REPL, one can press [tab] to quickly see what is in the next level of the tree. Unfortunately (due to type system limitations) this currently works only for the topmost level of the tree.
 
 # ...or make a "table" of all fluxes through all reactions:
 

diff --git a/docs/src/examples/05-enzyme-constrained-models.jl b/docs/src/examples/05-enzyme-constrained-models.jl
@@ -132,8 +132,8 @@ ecoli_core_reaction_kcats # units = 1/s
 # isozymes that can catalyze a reaction. A turnover number needs to be assigned
 # to each isozyme, as shown below. Additionally, some enzymes are composed of
 # multiple subunits, which differ in subunit stoichiometry. This also needs to
-# be accounted for. Assuming a stoichiometry of 1 for everything seems to be
-# okay if you do not have more information.
+# be accounted for. Assuming a stoichiometry of 1 for everything tends to work
+# just right OK if there is no better information available.
 
 reaction_isozymes = Dict{String,Dict{String,Isozyme}}() # a mapping from reaction IDs to isozyme IDs to isozyme structs.
 for rid in A.reactions(model)
@@ -151,7 +151,7 @@ for rid in A.reactions(model)
 end
 
 #md #!!! tip "Turnover number units"
-#md #    Take care with the units of the turnover numbers. In literature they are usually reported in 1/s. However, flux units are typically mmol/gDW/h, suggesting that you should rescale the turnover numbers to 1/h if you want to use the conventional flux units.
+#md #    Take care with the units of the turnover numbers. In literature they are usually reported in 1/s. However, flux units are typically mmol/gDW/h, suggesting to rescale the turnover numbers to 1/h in order to use the conventional flux units.
 
 # ## Enzyme molar masses
 
@@ -163,7 +163,7 @@ end
 #md # <details><summary><strong>Gene product masses</strong></summary>
 #md # ```
 # This data is downloaded from Uniprot for E. coli K12, gene mass in kDa. To
-# obtain these data yourself, go to [Uniprot](https://www.uniprot.org/) and
+# obtain these data manually, go to [Uniprot](https://www.uniprot.org/) and
 # search using these terms: `reviewed:yes AND organism:"Escherichia coli
 # (strain K12) [83333]"`.
 const ecoli_core_gene_product_masses = Dict(
@@ -314,7 +314,7 @@ const ecoli_core_gene_product_masses = Dict(
 ecoli_core_gene_product_masses # unit kDa = kg/mol
 
 #md #!!! tip "Molar mass units"
-#md #    Just as with the turnover numbers, take extreme care about the units of the molar masses. In literature they are usually reported in Da or kDa (g/mol). However, as noted above, flux units are typically mmol/gDW/h. Since the enzyme kinetic equation is `v = k * e` (where `k` is the turnover number) it suggests that the enzyme variable will have units of mmol/gDW. The molar masses come into play when setting the capacity limitations, e.g. usually a sum over all enzymes weighted by their molar masses as `e * M`. Thus, if your capacity limitation has units of g/gDW, then the molar masses must have units of g/mmol (i.e., kDa).
+#md #    Just as with the turnover numbers, take extreme care about the units of the molar masses. In literature they are usually reported in Da or kDa (g/mol). However, as noted above, flux units are typically mmol/gDW/h. Since the enzyme kinetic equation is `v = k * e` (where `k` is the turnover number) it suggests that the enzyme variable will have units of mmol/gDW. The molar masses come into play when setting the capacity limitations, e.g. usually a sum over all enzymes weighted by their molar masses as `e * M`. Thus, if the capacity limitation has units of g/gDW, then the molar masses must have units of g/mmol (i.e., kDa).
 
 # ## Capacity limitation
 
@@ -343,8 +343,8 @@ ec_solution.objective
 
 # One can also observe many interesting thing, e.g. the amount of gene product
 # material required for the system to run. Importantly, the units of these
-# values depends on the units you used to set the turnover numbers and protein
-# molar masses.
+# values depend on the units used to set the turnover numbers and protein molar
+# masses.
 
 ec_solution.gene_product_amounts
 

diff --git a/docs/src/examples/06-mmdf.jl b/docs/src/examples/06-mmdf.jl
@@ -49,7 +49,7 @@ model = load_model("e_coli_core.json")
 # We will need ΔᵣG⁰ data for each reaction we want to include in the
 # thermodynamic model. To generate this data manually, use
 # [eQuilibrator](https://equilibrator.weizmann.ac.il/). To generate
-# automatically, you may use the
+# automatically, it is possible to use the
 # [eQuilibrator.jl](https://github.com/stelmo/Equilibrator.jl) package.
 
 reaction_standard_gibbs_free_energies = Dict{String,Float64}( # units of the energies are kJ/mol

diff --git a/docs/src/examples/08-community-models.jl b/docs/src/examples/08-community-models.jl
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and       #src
 # limitations under the License.                                            #src
 
-# # Community FBA models
+# # Community FBA models (TODO)
 
 using COBREXA
 
@@ -91,9 +91,9 @@ end
 
 # ## Inspecting the interfaces
 #
-# Not all interfaces are made equally! Fortunately, it is simple to create your
-# own interface, by just manually assigning reactions to semantic groups using
-# ConstraintTrees.
+# Not all interfaces are made equally! Fortunately, it is simple to create a
+# custom interface, by just manually assigning reactions to semantic groups
+# using ConstraintTrees.
 
 # Some work:
 flux_balance_constraints(ecoli1, interface = :sbo).interface

diff --git a/docs/src/examples/11-sampling.jl b/docs/src/examples/11-sampling.jl
@@ -14,7 +14,7 @@
 # See the License for the specific language governing permissions and       #src
 # limitations under the License.                                            #src
 
-# # Flux sampling
+# # Flux sampling (TODO)
 
 using COBREXA
 
@@ -28,8 +28,8 @@ import JSONFBCModels, GLPK
 
 model = load_model("e_coli_core.json")
 
-# note here: this needs the optimizer to generate warmup. If you have warmup,
-# you can do without one.
+# note here: this needs the optimizer to generate warmup. If we have warmup,
+# we can do without the optimizer
 s = flux_sample(
     model,
     optimizer = GLPK.Optimizer,

diff --git a/docs/src/examples/12-screening.jl b/docs/src/examples/12-screening.jl
@@ -15,8 +15,9 @@
 # limitations under the License.                                            #src
 
 # # Screening through many model variants
-# [`screen`](@ref) is a function that you can use to run many model/simulation
-# variants (ideally on an HPC) efficiently.
+#
+# [`screen`](@ref) is a function that runs many model/simulation variants
+# (ideally on an HPC) efficiently.
 
 using COBREXA
 

diff --git a/docs/src/structure.md b/docs/src/structure.md
@@ -31,7 +31,7 @@ With ConstraintTrees, the typical workflow in COBREXA is as follows:
    - possibly, multiple types and groups of raw data can be soaked into the
      constraint tree
 3. Analysis functionality of COBREXA is used to solve the system described by
-   the constraitn tree, and extract useful information from the solutions.
+   the constraint tree, and extract useful information from the solutions.
 
 COBREXA mainly provides functionality to make this workflow easy to use for
 many various purposes:
@@ -91,13 +91,13 @@ many various purposes:
   - [`optimized_values`](@ref)
   - [`constraints_variability`](@ref)
 
-!!! tip "Exploring and customizing the frontend analysis functions"
-    If you want to know which builder function is used to create or modify some
-    kind of constraint tree in COBREXA, use the "link to source code" feature
-    in the frontend function's individual documentation. The source code of
-    front-end functions is written to be as easily re-usable as possible -- you
-    can simply copy-paste it into your program, and immediately start building
-    your own specialized and customized front-end functions.
+!!! tip "Exploring and customizing the front-end analysis functions"
+    To know which builder function is used to create or modify some kind of
+    constraint tree in COBREXA, use the "link to source code" feature in the
+    front-end function's individual documentation. The source code of front-end
+    functions is written to be as easily re-usable as possible -- one can
+    simply copy-paste it into the program, and immediately start building
+    specialized and customized front-end functions.
 
 Technical description of the constraint tree functionality, together with
 examples of basic functionality and many useful utility functions is available