Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor fixes around annotations #305

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 18 additions & 102 deletions annotations/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Parallelizability Study & Annotation Language
Quick Jump: [Parallelizability](#main-parallelizability-classes) | [study](#parallelizability-study-of-commands-in-gnu--posix) | [example 1](#a-simple-example-chmod) | [example 1](#another-example-cut) | [howto](#how-to-annotate-a-command) | [issues](#Issues)
Quick Jump: [Parallelizability](#main-parallelizability-classes) | [study](#parallelizability-study-of-commands-in-gnu--posix) | [Annotation Examples](#annotation-examples) | [howto](#how-to-annotate-a-command) | [adding custom aggregators](#adding-custom-aggregators) | [issues](#issues)

PaSh includes
(i) a parallelizability study of commands in POSIX and GNU Coreutils, and
Expand Down Expand Up @@ -47,94 +47,19 @@ Annotations can be thought of as defining a bidirectional correspondence between
Since command behaviors (and correspondence) can change based on their arguments, annotations contain a sequence of predicates.
Each predicate is accompanied by information that instantiates the correspondence between a command and a dataflow node.

## A Simple Example: `chmod`
## Annotation Examples

As a first example, below we present the annotations for `chmod`.
* `chmod`: As a first example, [see the annotation for `chmod`](./chmod.json).
The annotation for `chmod` is very simple, since it only needs to establish that `chmod` is side-effectful and therefore cannot be translated to a dataflow node.

```json
{
"command": "chmod",
"cases": [
{
"predicate": "default",
"class": "side-effectful"
}
]
}
```

The annotation for `chmod` is very simple, since it only needs to establish that `chmod` is side-effectful and therefore cannot be translated to a dataflow node.

## Another Example: `cut`

As another example, below we present the annotations for `cut`.

```json
{
"command": "cut",
"cases": [
{
"predicate": {
"operator": "or",
"operands": [
{
"operator": "val_opt_eq",
"operands": [
"-d",
"\n"
]
},
{
"operator": "exists",
"operands": [
"-z"
]
}
]
},
"class": "pure",
"inputs": [
"args[:]"
],
"outputs": [
"stdout"
]
},
{
"predicate": "default",
"class": "stateless",
"inputs": [
"args[:]"
],
"outputs": [
"stdout"
]
}
],
"options": [
"stdin-hyphen",
"empty-args-stdin"
],
"short-long": [
{
"short": "-d",
"long": "--delimiter"
},
{
"short": "-z",
"long": "--zero-terminated"
}
]
}
```

The annotation for `cut` has two cases, each of which consists of a predicate on its arguments, and then an assignment of its parallelizability class, inputs, and outputs.
The first predicate indicates that `cut` is "pure" -- _i.e._, not parallelizable but representable as a dataflow node -- if the value accompanying the `-d` option is `\n` or if it was used with the `-z` flag.
In both of these cases, newlines do not represent data item boundaries, but are rather used internally by the command, making it unsafe to parallelize by splitting on line boundaries.
In all other cases (see the "default" case) the command is stateless.
Inputs are always assigned to the non-option arguments and the output is always stdout.
The option "stdin-hyphen" indicates that a non-option argument that is just a dash `-` represents the stdin, and the option “empty-args-stdin” indicates that if non-option arguments are empty, then the command reads from its stdin.
The list identified by "short-long" contains a correspondence of short and long argument names for this command.
* `cut`: As another example, [see the annotation for `cut`](./cut.json).
It has two cases, each of which consists of a predicate on its arguments, and then an assignment of its parallelizability class, inputs, and outputs.
The first predicate indicates that `cut` is "pure" -- _i.e._, not parallelizable but representable as a dataflow node -- if the value accompanying the `-d` option is `\n` or if it was used with the `-z` flag.
In both of these cases, newlines do not represent data item boundaries, but are rather used internally by the command, making it unsafe to parallelize by splitting on line boundaries.
In all other cases (see the "default" case) the command is stateless.
Inputs are always assigned to the non-option arguments and the output is always stdout.
The option "stdin-hyphen" indicates that a non-option argument that is just a dash `-` represents the stdin, and the option “empty-args-stdin” indicates that if non-option arguments are empty, then the command reads from its stdin.
The list identified by "short-long" contains a correspondence of short and long argument names for this command.

## How to Annotate a Command

Expand Down Expand Up @@ -178,29 +103,20 @@ For more details, here is an early version of the annotation language:

[//]: # (TODO: 1. update language spec; 2. put all annotations in a directory)

## Mini-tutorial: Adding Custom Aggregators
## Adding Custom Aggregators

For this tutorial, let's assume you want to parallelize [a simple `ann-agg.sh` script](https://github.com/binpash/pash/blob/main/evaluation/tests/ann-agg.sh).

Let's also assume there are no annotations or aggregators for the commands `test_one` and `test_two`.
Note that normally these two commands would be annotated as `stateless`, as their aggregator is simply the con`cat`enation function;
however, we will now annotate them as `parallelizable_pure` and provide "custom" aggregation commands that simply concatenate their input streams.

*Step 1: Implement aggregators and their annotations*:

An aggregator is usually either binary or _n_-ary:
it takes as input two or _n_ file names (or paths) and outputs results to the standard out.
An aggregator may also take additional flags---for example, flags that configure its operation or flags that were provided to the original command.

We will implement `test_one`'s aggregator as [a shell script](https://github.com/binpash/pash/blob/main/runtime/agg/opt/concat.sh) that internally uses the Unix `cat` command to concatenate any number of input streams.

We will implement `test_two`'s aggregator as [a Python script](https://github.com/binpash/pash/blob/main/runtime/agg/py/cat.py) that concatenates any number of inputs streams.
*Step 1: Implement aggregators and their annotations.*
An aggregator is usually either binary or _n_-ary: it takes as input two or _n_ file names (or paths) and outputs results to the standard out. An aggregator may also take additional flags---for example, flags that configure its operation or flags that were provided to the original command. We implement `test_one`'s aggregator as [a shell script](https://github.com/binpash/pash/blob/main/runtime/agg/opt/concat.sh) that internally uses the Unix `cat` command to concatenate any number of input streams. We implement `test_two`'s aggregator as [a Python script](https://github.com/binpash/pash/blob/main/runtime/agg/py/cat.py) that concatenates any number of inputs streams.

For PaSh to be able to hook these aggregators correctly, _i.e._, so that it can instantiate them as command invocations, we also need to add their annotations in [annotations/custom_aggregators](https://github.com/binpash/pash/tree/main/annotations/custom_aggregators).
Below are the two annotation files named [`annotations/custom_aggregators/cat.py.json`](./custom_aggregators/cat.py.json) and [`annotations/custom_aggregators/concat.json`](./custom_aggregators/concat.json). (FIXME: relative path? **Until this is fixed, prefix aggregator names with `pagg-` to avoid name clashes!**)
The most important information in these files is (i) the aggregation command's `name`, and (ii) its treatment of inputs (both taking `["args[:]"]`), and outputs (both outputing to `["stdout"]`).
For PaSh to be able to hook these aggregators correctly, _i.e._, so that it can instantiate them as command invocations, we also need to add their annotations in [annotations/custom_aggregators](https://github.com/binpash/pash/tree/main/annotations/custom_aggregators). Below are the two annotation files named [`annotations/custom_aggregators/cat.py.json`](./custom_aggregators/cat.py.json) and [`annotations/custom_aggregators/concat.json`](./custom_aggregators/concat.json). (FIXME: relative path? **Until this is fixed, prefix aggregator names with `pagg-` to avoid name clashes!**.) The most important information in these files is (i) the aggregation command's `name`, and (ii) its treatment of inputs (both taking `["args[:]"]`), and outputs (both outputing to `["stdout"]`).

*Step 2: Point commands to their custom aggregators*:
*Step 2: Point commands to their custom aggregators.*
Add two new annotation files in `$PASH_TOP/annotations` with names `test_one.json` and `test_two.json`, so that they point to the right aggregator commands.
Apart from providing the correct command `name`, the two key properties are the `class` (which should be `parallelizable_pure`) and the `rel_path` (which should point to the aggregator programs we just implemented---ideally, relative to `$PASH_TOP`).

Expand All @@ -210,7 +126,7 @@ Note that path is relative with respect to `$PASH_TOP` and therefore refers to `
Here is the annotation for [`test_two.json`](./test_two.json), pointing to `runtime/agg/py/cat.py` (i.e., implying `$PASH_TOP/runtime/agg/py/cat.py`).
The annotations also specifies that the aggregator should be called with the `-a` flag, in addition to any other flags provided to the original command.

**More complex aggregators**:
*More complex aggregators.*
Suppose we want to parallelize a new script called [ann-agg-2.sh](https://github.com/binpash/pash/blob/main/evaluation/tests/ann-agg.sh).
This script contains two new commands `test_uniq_1` and `test_uniq_2`.
Their annotations are in files [annotations/test_uniq_1](./test_uniq_1.json) and [annotations/test_uniq_2.json](./test_uniq_2.json).
Expand Down
17 changes: 7 additions & 10 deletions annotations/chmod.json
Original file line number Diff line number Diff line change
@@ -1,12 +1,9 @@
{
"command": "chmod",
"cases":
[
{
"predicate": "default",
"class": "side-effectful",
"inputs": ["stdin"],
"outputs": ["stdout"]
}
]
"command": "chmod",
"cases": [
{
"predicate": "default",
"class": "side-effectful"
}
]
}
85 changes: 54 additions & 31 deletions annotations/cut.json
Original file line number Diff line number Diff line change
@@ -1,33 +1,56 @@
{
"command": "cut",
"cases":
[
{
"predicate":
{
"operator": "and",
"operands":
[
{
"operator": "val_opt_eq",
"operands": ["-d", "--delimiter", "\n"]
},
{
"operator": "exists",
"operands": ["-f", "--fields"]
}
]
},
"class": "pure",
"inputs": ["stdin"],
"outputs": ["stdout"],
"comments": "Stateless in all cases with exception in case where newline is a delimiter."
},
{
"predicate": "default",
"class": "stateless",
"inputs": ["stdin"],
"outputs": ["stdout"]
}
]
"command": "cut",
"cases": [
{
"predicate": {
"operator": "or",
"operands": [
{
"operator": "val_opt_eq",
"operands": [
"-d",
"\n"
]
},
{
"operator": "exists",
"operands": [
"-z"
]
}
]
},
"class": "pure",
"inputs": [
"args[:]"
],
"outputs": [
"stdout"
]
},
{
"predicate": "default",
"class": "stateless",
"inputs": [
"args[:]"
],
"outputs": [
"stdout"
]
}
],
"options": [
"stdin-hyphen",
"empty-args-stdin"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this is not implemented at the moment (and is very crucial for all the commands that read from stdin if they have no arguments). In its current form, the annotation leads to test failures because uses of cut that read from stdin are not supported (that is why previously cut's annotation only reads from stdin).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the reason why tests fail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK there are probably multiple reasons why the tests are failing:P

],
"short-long": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not yet supported by PaSh. We can open an issue or add a test with a long version of one of these flags and use that in a PR to implement that.

{
"short": "-d",
"long": "--delimiter"
},
{
"short": "-z",
"long": "--zero-terminated"
}
]
}
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Quick Jump: [using pash](#using-pash) | [videos](#videos--video-presentations) |
The following resources offer overviews of important PaSh components.

* Short tutorial: [introduction](./tutorial#introduction), [installation](./tutorial#installation), [execution](./tutorial#running-scripts), and [next steps](./tutorial#what-next)
* Annotations: [parallelizability](../annotations#main-parallelizability-classes), [study](../annotations#parallelizability-study-of-commands-in-gnu--posix), [example 1](../annotations#a-simple-example-chmod), [example 2](../annotations#another-example-cut), [howto](../annotations#how-to-annotate-a-command)
* Annotations: [parallelizability](../annotations#main-parallelizability-classes), [study](../annotations#parallelizability-study-of-commands-in-gnu--posix), [annotation example](../annotations#annotation-examples), [howto](../annotations#how-to-annotate-a-command), [adding custom aggregators](../annotations#adding-custom-aggregators)
* Compiler: [intro](../compiler#introduction), [overview](../compiler#compiler-overview), [details](../compiler#zooming-into-fragments), [earlier versions](../compiler#earlier-versions)
* Runtime: [split](../runtime#stream-splitting), [eager](../runtime#eager-stream-polling), [cleanup](../runtime#cleanup-logic), [aggregate](../runtime#aggregators)
* Scripts: [one-liners](../evaluation/benchmarks/#common-unix-one-liners), [unix50](../evaluation/benchmarks/#unix-50-from-bell-labs), [weather analysis](../evaluation/benchmarks/#noaa-weather-analysis), [web indexing](../evaluation/benchmarks/#wikipedia-web-indexing)
Expand Down