Currently, integers are scanned then the substring is parsed. This is ideal for parsers which segment into words first, but leads to a lot of logic duplication for scanning parsers. Since a scanner can be applied to a segment, scanning is the more general approach. The following details generalized integer parsing while scanning.
TODO: Scan until the token boundary first, detect the base, then parse it.
TODO: Change it to forbidding digit separators when juxtaposition is enabled and
using the first "_" to resolve ambiguity. This also enables base suffixes with
juxtaposition (e.g., 111b_slide
).
TODO: Handle "0b"/"0B" as digits for "h"/"H" suffix.
Signs
- Initialize negative to false.
- Initialize sign to
None
inenum { None, Neg, Pos, Multiple }
- Initialize sign count to 0.
- Loop:
- If bump '-':
- Set negative to not negative.
- Set sign to: Match sign: Case
None
:Neg
; Otherwise:Multiple
. - Error unless
signs
includesNeg
.
- Else if bump '+':
- Set sign to: Match sign: Case
None
:Pos
; Otherwise:Multiple
. - Error unless
signs
includesPos
.
- Set sign to: Match sign: Case
- Else: Break.
- Increment sign count.
- If strip whitespace: Error unless
space_locations
includesAfterSign
.
- If bump '-':
- If sign count > 1: Error unless
signs
includesMultiple
.
Base prefix
- Initialize base style to
None
. - Initialize marked base to
None
. - If bump "0b" or "0B": Set base style to "0b" or "0B" prefix. Set marked base to 2.
- Else if bump "0o" or "OO": Set base style to "0o" or "OO" prefix. Set marked base to 8.
- Else if bump "0x" or "0X": Set base style to "0x" or "0X" prefix. Set marked base to 16.
Digits
- Initialize digits array to empty.
- Initialize inferred base to 0.
- Initialize last was digit sep to false.
- Initialize digit sep after octal zero to false.
- Initialize first letter pos to
None
. - While not EOF:
- Initialize ch to bump
- Match ch:
- Case '0':
- Case '1': Set inferred base to max of itself and 2.
- Case '2' to '7': Set inferred base to max of itself and 8.
- Case '8' to '9': Set inferred base to max of itself and 10.
- Case 'a' to 'f' or 'A' to 'F':
- If ch is 'b' or 'B'
and base style is
None
and inferred base <= 2 and peek is not any of '0' to '9', 'a' to 'z', 'A' to 'Z', '_', or ''':- Set base style to "b" or "B" suffix. Set marked base to 2.
- Break.
- Else:
- Set inferred base to max of itself and 16.
- If first letter pos is
None
: Set first letter pos to current pos - 1.
- If ch is 'b' or 'B'
and base style is
- Case '_' or ''':
- If ch is '_': Error unless
digit_sep
includesUnderscore
. - If ch is ''':
- If
quote_digit_sep
not in config: Backtrack 1. Break. - Error unless
digit_sep
includesSingleQuote
.
- If
- If digits array is empty:
- If base style is
None
: Error. - Else: Error unless
digit_sep_location
includesAfterBasePrefix
.
- If base style is
- Else if
leading_zero_octal
in config and digits array is [0]:- Set digit sep after octal zero to true.
- If last was digit sep: Error unless
digit_sep_location
includesMultipleAdjacent
. - Continue.
- If ch is '_': Error unless
- Default:
- Backtrack 1. Break.
- Push ch to digits array.
- If last was digit sep and digits is not empty:
- Error unless
digit_sep_location
includesAfterDigits
.
- Error unless
Base suffix
- If base style is
None
and if peek is 'o', 'O', 'h', or 'H' and peek after is not any of '0' to '9', 'a' to 'z', 'A' to 'Z', '_', or ''':- Bump.
- Set base style to "o", "O", "h", or "H" suffix. Set marked base to 8 or 16.
- If base style is "h" or "H" suffix
and first element of digits array is in 'a' to 'f' or 'A' to 'F':
- Error if
suffix_decimal_first
.
- Error if
- Error unless
base_styles
includes base style.
Juxtaposed word
- If
juxtapose_word
in config and base style isNone
and peek is 'a' to 'z' or 'A' to 'Z':- Backtrack to first letter pos. Pop excess from digits array.
- Adjust inferred base to 2, 8, or 10.
Leading zero octal
- If
leading_zero_octal
in config and base style isNone
and first element of digits array is 0:- If digits array is [0]: Set marked base to 10.
- Else:
- Pop front of digits array.
- Set base style to '0' prefix. Set marked base to 8.
- If digit sep after octal zero: Error unless
digit_sep_location
includesAfterOctalLeadingZero
.
Suffix and empty prefix conflict
- If digits array is empty:
- If base style is "0b" prefix and
base_styles
contains "b" suffix or base style is "0B" prefix andbase_styles
contains "B" suffix or base style is "0o" prefix andbase_styles
contains "o" suffix or base style is "0O" prefix andbase_styles
contains "O" suffix:- Set base style to "b", "B", "o", or "O" suffix. Set marked base to 2 or 8.
- Set digits array to [0].
- If base style is "0b" prefix and
Validate base
- If marked base < inferred base:
- Error with out of range digit.
There should also be a simple algorithm, which handles the common case of
[-+]?[0-9]+
. If it fails, switch to the full algorithm. If at least one digit
was read, jump to digit reading in the full algorithm, reusing the digit buffer
and leading zero count.
signs: EnumSet<SignStyle>
withenum SignStyle { Neg, Pos, Multiple }
.base_styles: EnumSet<BaseStyle>
as currently.suffix_decimal_first: bool
.digit_seps: EnumSet<DigitSep>
withenum DigitSep { Underscore, SingleQuote }
.digit_sep_location: EnumSet<DigitSepLocation>
withenum DigitSepLocation { AfterBasePrefix, AfterOctalLeadingZero, AfterDigits, MultipleAdjacent }
.spaces
configures the whitespace characters allowed. Probably an enum set, since everything will use a subset of Unicode whitespace plus NUL.space_locations: EnumSet<SpaceLocation>
withenum SpaceLocation { AfterSign }
.
This algorithm only works if base_styles
includes Decimal
(i.e.,
unprefixed).
Only these three options affect the parsing behavior; the rest above determine which constructs are errors.
leading_zero_octal: bool
determines whether a leading '0' denotes octal.quote_digit_sep: bool
determines whether to parse single quote ''' digit separators.juxtapose_word: bool
, when enabled, prefers the shortest valid decimal integer interpretation if there is trailing text. This is for supporting integers fused to a mnemonic as in wsf.parens: bool
determines whether to parse parens.
An 'r'/'R' suffix would be compatible with all integer syntaxes. It could denote
a "raw" integer, where the leading zeros are significant in the binary
serialization. Add field raw_suffix: bool
to IntegerSyntax
.
Other bases up to 62 would require a prefix and could not be inferred.
A fused mnemonic begins at the first ASCII letter or the first underscore, whichever is first. However, only if what would be the mnemonic has a digit invalid for the base.
Fused mnemonics cannot start with an ASCII digit. A mnemonic that starts with an ASCII letter can only be used with decimal integers.
Suppose the mnemonic xchg
can be fused. Then 0xchg
is unclear: it could be
either hex 0xc
fused with hg
or 0
fused with xchg
. For simplicity, 0x
at the start should always be interpreted as a hex prefix. For the latter
behavior, the user should write 0_xchg
, 00xchg
, or possibly 0 xchg
.
Suppose the mnemonic add
can be fused. It's entirely hex digits, so cannot be
used after a hex integer.
Suppose the mnemonic div
can be fused. It would be partially read as hex
digits.
Haskell read :: String -> Integer
allows wrapping parentheses and whitespace.
Although the result is a single value and the parentheses could be included in
the integer token, it is more extensible to make another token kind like
syn::ExprParen
and to
represent whitespace with separate tokens. (Burghard quotes could be generalized
to variant of the same token.)
Then, the only remaining position whitespace can occur within an integer is between the sign and digits. If that pattern occurs, it would be represented as a unary operator instead; likewise for repeated unary signs. This then makes it more compatible with languages with infix operators.
- "0101b": binary 0b101.
- "0101b123slide": decimal 0101 followed by "b123slide".
- "255add": decimal 255 followed by "add"
- "0xffadd": hex 0xffadd.
- "0xff_add": hex 0xffadd.
- "0xchg": hex 0xc followed by error "hg" (even if fused enabled).
- "0_xchg": decimal 0 followed by "xchg".
- "00xchg": decimal 00 followed by "xchg".
A test system that runs the parser on all possible configurations would be useful. However, there are around 2^29 combinations of config options, so this would not be practical. Instead, it could be fuzzed. For each component of the output and every kind of error, a small routine could be written that would compute only that, to be used for comparison.