-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Internals: block representation
The parser (src/parser.y
) emits a ”block“ representation of the parsed program. The compiler mutates this representation, and then generates bytecode from it. The block
representation of a jq program is very similar to the bytecode that one sees when using the --debug-dump-disasm
option to the jq executable, but it has more structure and more information. A work-in-progress may add a --debug-dump-block
option to the jq executable that will print a JSON representation of a program's parsed and compiled block
form, which might then be useful for such things as writing syntax highlighters.
The block
representation of a jq program is a tree structure resembling an AST, but some syntactic information is lost. In general, all places in src/parser.y
where gen_noop()
is used to generate a block
will lose information. For example, .
becomes a noop block
. Also, note that gen_noop()
doesn't return a block with a single NOOP
instruction -- gen_noop()
returns an empty block
, which is why parser rule actions that output a gen_noop()
will lose information: there's no inst
object (see below) in which to preserve any information.
The C type for this is block
, a typedef of a very simple C struct:
struct inst;
typedef struct inst inst;
typedef struct block {
inst* first;
inst* last;
} block;
A block
is a tuple pointing to the first and last instructions of an instruction chain. The struct inst
type is opaque, private to src/compile.c
, but we can describe it here, with some simplification, as a struct with these fields:
struct inst* next;
struct inst* prev;
opcode op;
struct {
uint16_t intval;
struct inst* target;
jv constant;
const struct cfunction* cfunc;
} imm;
struct locfile* locfile;
location source;
struct inst* bound_by;
char* symbol;
block subfn; // used by CLOSURE_CREATE (body of function)
block arglist; // used by CLOSURE_CREATE (formals) and CALL_JQ (arguments)
struct bytecode* compiled;
int bytecode_pos; // position just after this insn
The next
and prev
fields are used to make the instructions a linked list.
The op
field is the instruction opcode.
The imm
field is the instruction's immediate operand, if it expects one. Instructions with branches (e.g., JUMP
, JUMP_F
) point to the target of the branch via imm.target
. intval
is used for various internal purposes, such as counting how many frames to the left of a $varname
provides the binding for $varname
, so that at run-time the interpreter can search backwards through that many frames (note: not literally frames on the stack, but frames in a list on the stack). The other fiels of imm
are self-evident.
The locfile
and source
fields refer to a source jq program and start/end byte offsets into that program.
The bound_by
field points to an inst
that this instruction will refer to. For example, a CALL_JQ
will use this to point to the function that should be called.
The symbol
will have the name that this instruction is an implementation for, if any.
The subfn
field is a block
representation of a function's body, which includes its sub-functions, thus the field's name. arglist
has the argument closure bodies in a function call.
The compiled
field points to bytecode once the block
has been compiled.
The bytecode_pos
has the resolved address of the next instruction once this block
has been compiled.
- Home
- FAQ
- jq Language Description
- Cookbook
- Modules
- Parsing Expression Grammars
- Docs for Oniguruma Regular Expressions (RE.txt)
- Advanced Topics
- Guide for Contributors
- How To
- C API
- jq Internals
- Tips
- Development