Skip to content

Internals: jq Assigment Operators

Nico Williams edited this page Jul 12, 2023 · 10 revisions

Assignments in the compiler and in the block representation

The jq assigment operators =, //=, <op>= (e.g., +=, -=, etc.), and |= are very special. They're not like assignments in most languages -- they are just another kind of jq expression that produces zero, one, or more values, but the values produced are the input with the changes denoted by the right-hand side (RHS) to the left-hand side (LHS) of the input to the assignment.

The LHS is very special: it is a path expression (TODO: add wiki page about path expressions), which is an expression consisting only of sub-expressions like .a, if/then/else with path expressions as the actions, and/or calls to functions whose bodies are path expressions.

The RHS is some expression which, in the case of |= receives the current value at the LHS in ., while in the other cases the RHS receives . (the input to the whole assignment expression). The latter can be confusing.

Inspecting src/parser.y is instructive.

First we have //= and <op>=

Exp "//=" Exp {
  $$ = gen_definedor_assign($1, $3);
} |
static block gen_definedor_assign(block object, block val) {
  block tmp = gen_op_var_fresh(STOREV, "tmp");
  return BLOCK(gen_op_simple(DUP),
               val, tmp,
               gen_call("_modify", BLOCK(gen_lambda(object),
                                         gen_lambda(gen_definedor(gen_noop(),
                                                                  gen_op_bound(LOADV, tmp))))));
}
Exp "+=" Exp {
  $$ = gen_update($1, $3, '+');
} |
static block gen_update(block object, block val, int optype) {
  block tmp = gen_op_var_fresh(STOREV, "tmp");
  return BLOCK(gen_op_simple(DUP),
               val,
               tmp,
               gen_call("_modify", BLOCK(gen_lambda(object),
                                         gen_lambda(gen_binop(gen_noop(),
                                                              gen_op_bound(LOADV, tmp),
                                                              optype)))));
}

Having val before the gen_call("_modify", ...) is the reason that the RHS of //= gets the . of the LHS as its value, the reason that it's evaluated every time, and also the reason that the assignment is done once per-value output by the RHS.

Compare to |= which is coded like this:

Exp "|=" Exp {
  $$ = gen_call("_modify", BLOCK(gen_lambda($1), gen_lambda($3)));
} |

Ok, let's translate all of this to English:

  • First |=: gen_call("_modify", BLOCK(gen_lambda($1), gen_lambda($3))); means: "generate a call to _modify with the lhs ($1) as the first argument and the rhs ($3) as the second argument (note that jq function arguments are lambdas, thus the gen_lambda()s).

  • Now gen_definedor_assign() and gen_update() (which are very similar):

    • the DUP is memory management -- ignore for this analysis
    • val is the RHS, and we will invoke it immediately
    • store the val output(s) (RHS) in tmp (a gensym'ed $binding)
    • call _modify (the heart of modify-assign operators) with the input to the LHS as the first argument and a second argument that amounts to . // $tmp where $tmp is the gensym'ed binding mentione above

The difference between //= and other op= assignments is that // is block-coded in gen_definedor() while the ops are builtins like _plus. // could have been jq-coded, but it's not.

jq-coded assignment helpers

The jq-coded builtin _assign implements the jq-coded part of the = assignment operator:

def _assign(paths; $value): reduce path(paths) as $p (.; setpath($p; $value));

_assign is pretty self-explanatory. All it does is reduce over the paths setting the given value at each path. It helps to first see the yacc/bison/compiler side of things (see above).

The jq-coded builtin _modify implements the jq-coded part of all the other assignment operators:

def _modify(paths; update):
    reduce path(paths) as $p ([., []];
        . as $dot
      | null
      | label $out
      | ($dot[0] | getpath($p)) as $v
      | (
          (   $$$$v
            | update
            | (., break $out) as $v
            | $$$$dot
            | setpath([0] + $p; $v)
          ),
          (
              $$$$dot
            | setpath([1, (.[1] | length)]; $p)
          )
        )
    ) | . as $dot | $dot[0] | delpaths($dot[1]);

The $$$$v thing is an internal-only hack where evaluating $$$$v produces $v's value, but also sets $v to null so that the next invocation of $$$$v or $v produces null. This is done to avoid holding on to a reference that would cause copy-on-write behavior that would make _modify accidentally quadratic.

In English we're reducing over the paths using an array as the reduction stat containing . and an initially empty array of paths to delete. For any path for which update produces a value, we take the first value and alter . to set that value at that path. For any path for which update produces no value (empty), we add that path to the array of paths to delete. Once the reduction completes we then delete all the paths queued up for deletion. We delay deletions because otherwise we risk deleting array elements incorrectly because we generally traverse array elements from the first to the last, but if we delete any non-last element then the indices of the remaining elements will decrement, which in turn causes subsequent deletions to be off.

What's really tricky here is that we need to make sure we have just one reference to the reduction state when we get to setpath([0] + $p; $v) (where update produced a value) or setpath([1, (.[1] | length)]; $p) (where update was empty so we're queuing up a deletion of that path). We also need to have only one reference to the value to be altered.

TBD: Explain more.

Clone this wiki locally