Skip to content

Latest commit

 

History

History
382 lines (298 loc) · 14.2 KB

tokenization.md

File metadata and controls

382 lines (298 loc) · 14.2 KB

Tokenization

Implementations must act as if they used the following state machine to tokenize POSIX-compliant dotenv files.

The state machine must start in the assignment list state.

Most states consume a single character, which may have various side effects, and either switches the state machine to a new state to reconsume the current input character, or switches it to a new state to consume the next character, or stays in the same state to consume the next character. Some states have more complicated behavior and can consume several characters before switching to another state.

Definitions

Next input character

The next input character is the first character in the input stream that has not yet been consumed or explicitly ignored. Initially, the next input character is the first character in the input.

Current input character

The current input character is the last character to have been consumed.

Reconsume

When a state says to reconsume a matched character in a specified state, that means to switch to that state, but when it attempts to consume the next input character, provide it with the current input character instead.

Temporary buffer

The temporary buffer is a string of codepoints that is initially empty.

Flush the temporary buffer

When a state says to flush the temporary buffer:

  • If the temporary buffer is not empty:
    • Create a new token.
    • If the state says to flush the temporary buffer as a <kind> token, set the token's kind to the specified kind. Otherwise, set the token's kind to Characters.
    • Set the token's value to the contents of the temporary buffer.
    • Emit the newly created token.
    • Set the temporary buffer to the empty string.

Stack of return states

The stack of return states is a stack of states, used in some states to return to the state they were invoked from. It is initially empty.

Return state

The return state is the state that is currently on top of the stack of return states.

When a state says to switch to the return state:

  • pop a state off the stack of return states
  • switch to the state returned by the previous step

When a state says to reconsume in the return state:

  • pop a state off the stack of return states
  • reconsume in the state returned by the previous step

Quoting level

The quoting level is an unsigned integer that is initially zero.

When a state says to increment the quoting level, set the quoting level to its current value plus one.

When a state says to decrement the quoting level, set the quoting level to its current value minus one.

ASCII upper alpha

An ASCII upper alpha is a code point in the range U+0041 (A) to U+005A (Z), inclusive.

ASCII lower alpha

An ASCII lower alpha is a code point in the range U+0061 (a) to U+007A (z), inclusive.

ASCII alpha

An ASCII alpha is an ASCII upper alpha or ASCII lower alpha.

ASCII digit

An ASCII digit is a code point in the range U+0030 (0) to U+0039 (9), inclusive.

State machine

The tokenizer state machine consists of the states defined in the following subsections.

Assignment list state

Consume the next input character.

Comment state

Consume the next input character.

  • EOF:
    • Emit an EOF token.
  • U+000A LINEFEED:
  • anything else:
    • Ignore the character

Assignment name state

Consume the next input character.

Assignment value state

Consume the next input character.

Assignment value escape state

Consume the next input character.

Single-quoted state

Consume the next input character.

Double-quoted state

Consume the next input character.

Double-quoted escape state

Consume the next input character.

Dollar state

Consume the next input character.

Simple expansion state

Consume the next input character.

Complex expansion start state

Consume the next input character.

Complex expansion state

Consume the next input character.

Expansion operator state

Consume the next input character.

Expansion value state

Consume the next input character.

Expansion value escape state

Consume the next input character.