diff --git a/annotations/README.md b/annotations/README.md index 0b676dbc3..da9c7d44c 100644 --- a/annotations/README.md +++ b/annotations/README.md @@ -1,5 +1,5 @@ # Parallelizability Study & Annotation Language -Quick Jump: [Parallelizability](#main-parallelizability-classes) | [study](#parallelizability-study-of-commands-in-gnu--posix) | [example 1](#a-simple-example-chmod) | [example 1](#another-example-cut) | [howto](#how-to-annotate-a-command) | [issues](#Issues) +Quick Jump: [Parallelizability](#main-parallelizability-classes) | [study](#parallelizability-study-of-commands-in-gnu--posix) | [Annotation Examples](#annotation-examples) | [howto](#how-to-annotate-a-command) | [adding custom aggregators](#adding-custom-aggregators) | [issues](#issues) PaSh includes (i) a parallelizability study of commands in POSIX and GNU Coreutils, and @@ -47,94 +47,19 @@ Annotations can be thought of as defining a bidirectional correspondence between Since command behaviors (and correspondence) can change based on their arguments, annotations contain a sequence of predicates. Each predicate is accompanied by information that instantiates the correspondence between a command and a dataflow node. -## A Simple Example: `chmod` +## Annotation Examples -As a first example, below we present the annotations for `chmod`. +* `chmod`: As a first example, [see the annotation for `chmod`](./chmod.json). + The annotation for `chmod` is very simple, since it only needs to establish that `chmod` is side-effectful and therefore cannot be translated to a dataflow node. -```json -{ - "command": "chmod", - "cases": [ - { - "predicate": "default", - "class": "side-effectful" - } - ] -} -``` - -The annotation for `chmod` is very simple, since it only needs to establish that `chmod` is side-effectful and therefore cannot be translated to a dataflow node. - -## Another Example: `cut` - -As another example, below we present the annotations for `cut`. - -```json -{ - "command": "cut", - "cases": [ - { - "predicate": { - "operator": "or", - "operands": [ - { - "operator": "val_opt_eq", - "operands": [ - "-d", - "\n" - ] - }, - { - "operator": "exists", - "operands": [ - "-z" - ] - } - ] - }, - "class": "pure", - "inputs": [ - "args[:]" - ], - "outputs": [ - "stdout" - ] - }, - { - "predicate": "default", - "class": "stateless", - "inputs": [ - "args[:]" - ], - "outputs": [ - "stdout" - ] - } - ], - "options": [ - "stdin-hyphen", - "empty-args-stdin" - ], - "short-long": [ - { - "short": "-d", - "long": "--delimiter" - }, - { - "short": "-z", - "long": "--zero-terminated" - } - ] -} -``` - -The annotation for `cut` has two cases, each of which consists of a predicate on its arguments, and then an assignment of its parallelizability class, inputs, and outputs. -The first predicate indicates that `cut` is "pure" -- _i.e._, not parallelizable but representable as a dataflow node -- if the value accompanying the `-d` option is `\n` or if it was used with the `-z` flag. -In both of these cases, newlines do not represent data item boundaries, but are rather used internally by the command, making it unsafe to parallelize by splitting on line boundaries. -In all other cases (see the "default" case) the command is stateless. -Inputs are always assigned to the non-option arguments and the output is always stdout. -The option "stdin-hyphen" indicates that a non-option argument that is just a dash `-` represents the stdin, and the option “empty-args-stdin” indicates that if non-option arguments are empty, then the command reads from its stdin. -The list identified by "short-long" contains a correspondence of short and long argument names for this command. +* `cut`: As another example, [see the annotation for `cut`](./cut.json). + It has two cases, each of which consists of a predicate on its arguments, and then an assignment of its parallelizability class, inputs, and outputs. + The first predicate indicates that `cut` is "pure" -- _i.e._, not parallelizable but representable as a dataflow node -- if the value accompanying the `-d` option is `\n` or if it was used with the `-z` flag. + In both of these cases, newlines do not represent data item boundaries, but are rather used internally by the command, making it unsafe to parallelize by splitting on line boundaries. + In all other cases (see the "default" case) the command is stateless. + Inputs are always assigned to the non-option arguments and the output is always stdout. + The option "stdin-hyphen" indicates that a non-option argument that is just a dash `-` represents the stdin, and the option “empty-args-stdin” indicates that if non-option arguments are empty, then the command reads from its stdin. + The list identified by "short-long" contains a correspondence of short and long argument names for this command. ## How to Annotate a Command @@ -178,7 +103,7 @@ For more details, here is an early version of the annotation language: [//]: # (TODO: 1. update language spec; 2. put all annotations in a directory) -## Mini-tutorial: Adding Custom Aggregators +## Adding Custom Aggregators For this tutorial, let's assume you want to parallelize [a simple `ann-agg.sh` script](https://github.com/binpash/pash/blob/main/evaluation/tests/ann-agg.sh). @@ -186,21 +111,12 @@ Let's also assume there are no annotations or aggregators for the commands `test Note that normally these two commands would be annotated as `stateless`, as their aggregator is simply the con`cat`enation function; however, we will now annotate them as `parallelizable_pure` and provide "custom" aggregation commands that simply concatenate their input streams. -*Step 1: Implement aggregators and their annotations*: - -An aggregator is usually either binary or _n_-ary: - it takes as input two or _n_ file names (or paths) and outputs results to the standard out. -An aggregator may also take additional flags---for example, flags that configure its operation or flags that were provided to the original command. - -We will implement `test_one`'s aggregator as [a shell script](https://github.com/binpash/pash/blob/main/runtime/agg/opt/concat.sh) that internally uses the Unix `cat` command to concatenate any number of input streams. - -We will implement `test_two`'s aggregator as [a Python script](https://github.com/binpash/pash/blob/main/runtime/agg/py/cat.py) that concatenates any number of inputs streams. +*Step 1: Implement aggregators and their annotations.* +An aggregator is usually either binary or _n_-ary: it takes as input two or _n_ file names (or paths) and outputs results to the standard out. An aggregator may also take additional flags---for example, flags that configure its operation or flags that were provided to the original command. We implement `test_one`'s aggregator as [a shell script](https://github.com/binpash/pash/blob/main/runtime/agg/opt/concat.sh) that internally uses the Unix `cat` command to concatenate any number of input streams. We implement `test_two`'s aggregator as [a Python script](https://github.com/binpash/pash/blob/main/runtime/agg/py/cat.py) that concatenates any number of inputs streams. -For PaSh to be able to hook these aggregators correctly, _i.e._, so that it can instantiate them as command invocations, we also need to add their annotations in [annotations/custom_aggregators](https://github.com/binpash/pash/tree/main/annotations/custom_aggregators). -Below are the two annotation files named [`annotations/custom_aggregators/cat.py.json`](./custom_aggregators/cat.py.json) and [`annotations/custom_aggregators/concat.json`](./custom_aggregators/concat.json). (FIXME: relative path? **Until this is fixed, prefix aggregator names with `pagg-` to avoid name clashes!**) -The most important information in these files is (i) the aggregation command's `name`, and (ii) its treatment of inputs (both taking `["args[:]"]`), and outputs (both outputing to `["stdout"]`). +For PaSh to be able to hook these aggregators correctly, _i.e._, so that it can instantiate them as command invocations, we also need to add their annotations in [annotations/custom_aggregators](https://github.com/binpash/pash/tree/main/annotations/custom_aggregators). Below are the two annotation files named [`annotations/custom_aggregators/cat.py.json`](./custom_aggregators/cat.py.json) and [`annotations/custom_aggregators/concat.json`](./custom_aggregators/concat.json). (FIXME: relative path? **Until this is fixed, prefix aggregator names with `pagg-` to avoid name clashes!**.) The most important information in these files is (i) the aggregation command's `name`, and (ii) its treatment of inputs (both taking `["args[:]"]`), and outputs (both outputing to `["stdout"]`). -*Step 2: Point commands to their custom aggregators*: +*Step 2: Point commands to their custom aggregators.* Add two new annotation files in `$PASH_TOP/annotations` with names `test_one.json` and `test_two.json`, so that they point to the right aggregator commands. Apart from providing the correct command `name`, the two key properties are the `class` (which should be `parallelizable_pure`) and the `rel_path` (which should point to the aggregator programs we just implemented---ideally, relative to `$PASH_TOP`). @@ -210,7 +126,7 @@ Note that path is relative with respect to `$PASH_TOP` and therefore refers to ` Here is the annotation for [`test_two.json`](./test_two.json), pointing to `runtime/agg/py/cat.py` (i.e., implying `$PASH_TOP/runtime/agg/py/cat.py`). The annotations also specifies that the aggregator should be called with the `-a` flag, in addition to any other flags provided to the original command. -**More complex aggregators**: +*More complex aggregators.* Suppose we want to parallelize a new script called [ann-agg-2.sh](https://github.com/binpash/pash/blob/main/evaluation/tests/ann-agg.sh). This script contains two new commands `test_uniq_1` and `test_uniq_2`. Their annotations are in files [annotations/test_uniq_1](./test_uniq_1.json) and [annotations/test_uniq_2.json](./test_uniq_2.json). diff --git a/annotations/chmod.json b/annotations/chmod.json index 5e6cb1603..df7d3ccd7 100644 --- a/annotations/chmod.json +++ b/annotations/chmod.json @@ -1,12 +1,9 @@ { - "command": "chmod", - "cases": - [ - { - "predicate": "default", - "class": "side-effectful", - "inputs": ["stdin"], - "outputs": ["stdout"] - } - ] + "command": "chmod", + "cases": [ + { + "predicate": "default", + "class": "side-effectful" + } + ] } diff --git a/annotations/cut.json b/annotations/cut.json index 364ae91f3..ef3d53ad3 100644 --- a/annotations/cut.json +++ b/annotations/cut.json @@ -1,33 +1,56 @@ { - "command": "cut", - "cases": - [ - { - "predicate": - { - "operator": "and", - "operands": - [ - { - "operator": "val_opt_eq", - "operands": ["-d", "--delimiter", "\n"] - }, - { - "operator": "exists", - "operands": ["-f", "--fields"] - } - ] - }, - "class": "pure", - "inputs": ["stdin"], - "outputs": ["stdout"], - "comments": "Stateless in all cases with exception in case where newline is a delimiter." - }, - { - "predicate": "default", - "class": "stateless", - "inputs": ["stdin"], - "outputs": ["stdout"] - } - ] + "command": "cut", + "cases": [ + { + "predicate": { + "operator": "or", + "operands": [ + { + "operator": "val_opt_eq", + "operands": [ + "-d", + "\n" + ] + }, + { + "operator": "exists", + "operands": [ + "-z" + ] + } + ] + }, + "class": "pure", + "inputs": [ + "args[:]" + ], + "outputs": [ + "stdout" + ] + }, + { + "predicate": "default", + "class": "stateless", + "inputs": [ + "args[:]" + ], + "outputs": [ + "stdout" + ] + } + ], + "options": [ + "stdin-hyphen", + "empty-args-stdin" + ], + "short-long": [ + { + "short": "-d", + "long": "--delimiter" + }, + { + "short": "-z", + "long": "--zero-terminated" + } + ] } diff --git a/docs/README.md b/docs/README.md index d426faf60..3a382efad 100644 --- a/docs/README.md +++ b/docs/README.md @@ -6,7 +6,7 @@ Quick Jump: [using pash](#using-pash) | [videos](#videos--video-presentations) | The following resources offer overviews of important PaSh components. * Short tutorial: [introduction](./tutorial#introduction), [installation](./tutorial#installation), [execution](./tutorial#running-scripts), and [next steps](./tutorial#what-next) -* Annotations: [parallelizability](../annotations#main-parallelizability-classes), [study](../annotations#parallelizability-study-of-commands-in-gnu--posix), [example 1](../annotations#a-simple-example-chmod), [example 2](../annotations#another-example-cut), [howto](../annotations#how-to-annotate-a-command) +* Annotations: [parallelizability](../annotations#main-parallelizability-classes), [study](../annotations#parallelizability-study-of-commands-in-gnu--posix), [annotation example](../annotations#annotation-examples), [howto](../annotations#how-to-annotate-a-command), [adding custom aggregators](../annotations#adding-custom-aggregators) * Compiler: [intro](../compiler#introduction), [overview](../compiler#compiler-overview), [details](../compiler#zooming-into-fragments), [earlier versions](../compiler#earlier-versions) * Runtime: [split](../runtime#stream-splitting), [eager](../runtime#eager-stream-polling), [cleanup](../runtime#cleanup-logic), [aggregate](../runtime#aggregators) * Scripts: [one-liners](../evaluation/benchmarks/#common-unix-one-liners), [unix50](../evaluation/benchmarks/#unix-50-from-bell-labs), [weather analysis](../evaluation/benchmarks/#noaa-weather-analysis), [web indexing](../evaluation/benchmarks/#wikipedia-web-indexing)