Name		Name	Last commit message	Last commit date
parent directory ..
src		src
Cargo.toml		Cargo.toml
README.md		README.md

README.md

Step 3.4: Regular expressions and custom parsers

Regular expressions

To operate with regular expressions there is the regex crate in Rust ecosystem, which is kinda a default choice to go with in most cases.

A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a few features like look around and backreferences. In exchange, all searches execute in linear time with respect to the size of the regular expression and search text. Much of the syntax and implementation is inspired by RE2.

If you need additional features (like look around and backreferences), consider using:

fancy-regex crate, building additional functionality on top of the regex crate.
pcre2 crate, providing a safe high level Rust binding to PCRE2 library.
hyperscan crate, wrapping a Hyperscan library.

Compile only once

Important to know, that in Rust regular expression needs to be compiled before we can use it. The compilation is not cheap. So, the following code introduces a performance problem:

fn is_email(email: &str) -> bool {
    let re = Regex::new(".+@.+").unwrap();  // compiles every time the function is called
    re.is_match(email)
}

To omit unnecessary performance penalty we should compile regular expression once and reuse its compilation result. This is easily achieved by using the once_cell crate both in global and/or local scopes:

static REGEX_EMAIL: Regex = once_cell::sync::Lazy::new(|| {
    Regex::new(".+@.+").unwrap()
}); // compiles once on the first use

fn is_email(email: &str) -> bool {
    REGEX_EMAIL.is_match(email)
}

This may feel different with how regular expressions are used in other programming languages, because some of them implicitly cache compilation results and/or do not expose compilation API at all (like PHP). But if your background is a language like Go or Java, this concept should be familiar to you.

Custom parsers

If regular expressions are not powerful enough for your parsing problem, then you are ended up with writing your own parser. Rust ecosystem has numerous crates to help with that:

Parser combinators:
- nom crate, nearly the most performant among others, and especially good for parsing binary stuff (byte/bit-oriented).
- chumsky crate, focusing on high-quality errors and ergonomics over performance.
- combine crate, inspired by the Parsec library in Haskell.
- pom crate, providing PEG parser combinators created using operator overloading without macros.
- chomp crate, a fast monadic-style parser combinator library.
Parser generators:
- peg crate, a simple yet flexible parser generator that makes it easy to write robust parsers, based on the Parsing Expression Grammar formalism.
- pest crate, with a focus on accessibility, correctness, and performance, using PEG (parsing expression grammar) as an input and deriving parser's code for it.
- lalrpop crate, generating LR(1) parser code from custom grammar files.
- parsel crate, a library for generating parsers directly from syntax tree node types.

For better understanding parsing problem and approaches, along with some examples, read through the following articles:

Task

Estimated time: 1 day

Given the following Rust fmt syntax grammar:

format_string := text [ maybe_format text ] *
maybe_format := '{' '{' | '}' '}' | format
format := '{' [ argument ] [ ':' format_spec ] [ ws ] * '}'
argument := integer | identifier

format_spec := [[fill]align][sign]['#']['0'][width]['.' precision]type
fill := character
align := '<' | '^' | '>'
sign := '+' | '-'
width := count
precision := count | '*'
type := '' | '?' | 'x?' | 'X?' | identifier
count := parameter | integer
parameter := argument '$'
In the above grammar,

text must not contain any '{' or '}' characters,

ws is any character for which char::is_whitespace returns true, has no semantic meaning and is completely optional,

integer is a decimal integer that may contain leading zeroes and must fit into an usize and

identifier is an IDENTIFIER_OR_KEYWORD (not an IDENTIFIER) as defined by the Rust language reference.

Implement a parser to parse sign, width and precision from a given input (assumed to be a format_spec).

Provide implementations in two flavours: regex-based and via building a custom parser.

Prove your implementation correctness with tests.

Questions

After completing everything above, you should be able to answer (and understand why) the following questions:

How does regex crate achieve linear time complexity? In what price?
How to avoid regular expression recompilation in Rust? Why is it important?
Which are the common kinds of libraries for writing custom parses in Rust? Which benefits does each one have?
What advantages does libraries give for writing a custom parser? Are they mandatory? When does it make sense to avoid using a library for implementing a parser?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3_4_regex_parsing

3_4_regex_parsing

README.md

Step 3.4: Regular expressions and custom parsers

Regular expressions

Compile only once

Custom parsers

Task

Questions

Files

3_4_regex_parsing

Directory actions

More options

Directory actions

More options

Latest commit

History

3_4_regex_parsing

Folders and files

parent directory

README.md

Step 3.4: Regular expressions and custom parsers

Regular expressions

Compile only once

Custom parsers

Task

Questions