To operate with regular expressions there is the regex
crate in Rust ecosystem, which is kinda a default choice to go with in most cases.
A Rust library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a few features like look around and backreferences. In exchange, all searches execute in linear time with respect to the size of the regular expression and search text. Much of the syntax and implementation is inspired by RE2.
If you need additional features (like look around and backreferences), consider using:
fancy-regex
crate, building additional functionality on top of theregex
crate.pcre2
crate, providing a safe high level Rust binding to PCRE2 library.hyperscan
crate, wrapping a Hyperscan library.
Important to know, that in Rust regular expression needs to be compiled before we can use it. The compilation is not cheap. So, the following code introduces a performance problem:
fn is_email(email: &str) -> bool {
let re = Regex::new(".+@.+").unwrap(); // compiles every time the function is called
re.is_match(email)
}
To omit unnecessary performance penalty we should compile regular expression once and reuse its compilation result. This is easily achieved by using the once_cell
crate both in global and/or local scopes:
static REGEX_EMAIL: Regex = once_cell::sync::Lazy::new(|| {
Regex::new(".+@.+").unwrap()
}); // compiles once on the first use
fn is_email(email: &str) -> bool {
REGEX_EMAIL.is_match(email)
}
This may feel different with how regular expressions are used in other programming languages, because some of them implicitly cache compilation results and/or do not expose compilation API at all (like PHP). But if your background is a language like Go or Java, this concept should be familiar to you.
If regular expressions are not powerful enough for your parsing problem, then you are ended up with writing your own parser. Rust ecosystem has numerous crates to help with that:
- Parser combinators:
nom
crate, nearly the most performant among others, and especially good for parsing binary stuff (byte/bit-oriented).chumsky
crate, focusing on high-quality errors and ergonomics over performance.combine
crate, inspired by the Parsec library in Haskell.pom
crate, providing PEG parser combinators created using operator overloading without macros.chomp
crate, a fast monadic-style parser combinator library.
- Parser generators:
peg
crate, a simple yet flexible parser generator that makes it easy to write robust parsers, based on the Parsing Expression Grammar formalism.pest
crate, with a focus on accessibility, correctness, and performance, using PEG (parsing expression grammar) as an input and deriving parser's code for it.lalrpop
crate, generating LR(1) parser code from custom grammar files.parsel
crate, a library for generating parsers directly from syntax tree node types.
For better understanding parsing problem and approaches, along with some examples, read through the following articles:
- Laurence Tratt: Which Parsing Approach?
- Richard L. Apodaca: A Beginner's Guide to Parsing in Rust
- Eshan Singh: Practical Parsing in Rust with nom
- Nazmul Idris: Guide to parsing with nom
- Brian Kung: Building a CEDICT parser in Rust with Nom
- The Nom Guide (Nominomicon)
- Aleksey Kladov: Resilient LL Parsing Tutorial
- Aleksey Kladov: Simple but Powerful Pratt Parsing
- Aleksey Kladov: From Pratt to Dijkstra
Estimated time: 1 day
Given the following Rust fmt
syntax grammar:
format_string := text [ maybe_format text ] * maybe_format := '{' '{' | '}' '}' | format format := '{' [ argument ] [ ':' format_spec ] [ ws ] * '}' argument := integer | identifier format_spec := [[fill]align][sign]['#']['0'][width]['.' precision]type fill := character align := '<' | '^' | '>' sign := '+' | '-' width := count precision := count | '*' type := '' | '?' | 'x?' | 'X?' | identifier count := parameter | integer parameter := argument '$'
In the above grammar,
text
must not contain any'{'
or'}'
characters,ws
is any character for whichchar::is_whitespace
returnstrue
, has no semantic meaning and is completely optional,integer
is a decimal integer that may contain leading zeroes and must fit into anusize
andidentifier
is anIDENTIFIER_OR_KEYWORD
(not anIDENTIFIER
) as defined by the Rust language reference.
Implement a parser to parse sign
, width
and precision
from a given input (assumed to be a format_spec
).
Provide implementations in two flavours: regex
-based and via building a custom parser.
Prove your implementation correctness with tests.
After completing everything above, you should be able to answer (and understand why) the following questions:
- How does
regex
crate achieve linear time complexity? In what price? - How to avoid regular expression recompilation in Rust? Why is it important?
- Which are the common kinds of libraries for writing custom parses in Rust? Which benefits does each one have?
- What advantages does libraries give for writing a custom parser? Are they mandatory? When does it make sense to avoid using a library for implementing a parser?