This is the reference documentation for lexy
.
Inputs and Encodings
An Input
defines the input that will be parsed by lexy
.
It has a corresponding Encoding
that controls, among other things, its character type and whether certain rules are available.
The Input
itself is unchanging and it produces a Reader
which remembers the current position of the input during parsing.
Encodings
lexy/encoding.hpp
namespace lexy
{
struct default_encoding;
struct ascii_encoding;
struct utf8_encoding;
struct utf16_encoding;
struct utf32_encoding;
struct raw_encoding;
template <typename CharT>
using deduce_encoding = /* see below */;
enum class encoding_endianness;
}
An Encoding
is a set of pre-defined policy classes that determine the text encoding of an input.
Each encoding has a primary character type, which is the character type of the input.
It can also have a secondary character type, which the input should accept, but internally convert to the primary character type.
For example, lexy::utf8_encoding’s primary character type is `char8_t
, but it also accepts char
.
The encoding also has an integer type, which can store either any valid character (code unit to be precise) or a special EOF value, similar to std::char_traits
.
For some encodings, the integer type can be the same as the character type as not all values are valid code units.
This allows optimizations.
Certain rules can require a certain encodings.
For example, lexy::dsl::code_point
does not work with the lexy::default_encoding
, and lexy::dsl::encode
requires lexy::raw_encoding
.
The supported encodings
lexy::default_encoding
-
The encoding that will be used when no other encoding is specified. Its character type is
char
and it can work with any 8-bit encoding (ASCII, UTF-8, extended ASCII etc.). Only use this encoding if you don’t know the exact encoding of your input. lexy::ascii_encoding
-
Assumes the input is valid ASCII. Its character type is
char
. lexy::utf8_encoding
-
Assumes the input is valid UTF-8. Its character type is
char8_t
, but it also acceptschar
. lexy::utf16_encoding
-
Assumes the input is valid UTF-16. Its character type is
char16_t
, but it also acceptswchar_t
on Windows. lexy::utf32_encoding
-
Assumes the input is valid UTF-32. Its character type is
char32_t
, but it also acceptswchar_t
on Linux. lexy::raw_encoding
-
Does not assume the input is text. Its character type is
unsigned char
, but it also acceptschar
. Use this encoding if you’re not parsing text or if you’re parsing text consisting of multiple encodings.
If you specify an encoding that does not match the inputs actual encoding, e.g. you say it is UTF-8 but in reality it is some Windows code page, the library will handle it by generating parse errors. The worst that can happen is that you’ll get an unexpected EOF error because the input contains the character that is used to signal EOF in the encoding. |
Deducing encoding
If you don’t specify an encoding for your input, lexy
can sometimes deduce it by matching the character type to the primary character type.
For example, a string of char8_t
will be deduce it to be lexy::utf8_encoding
.
If the character type is char
, lexy
will deduce lexy::default_encoding
(unless that has been overriden by a build option).
Encoding endianness
enum class encoding_endianness
{
little,
big,
bom,
};
In-memory, UTF-16 and UTF-32 come in two flavors: big and little endian.
Which version is used, can be specified with the encoding_endianness
enumeration.
This is only relevant when e.g. reading data from files.
- little
-
The encoding is written using little endian. For single-byte encodings, this has no effect.
- big
-
The encoding is written using big endian. For single-byte encodings, this has no effect.
- bom
-
The endianness is determined using the byte-order mark (BOM) of the encoding. If no BOM is present, defaults to big endian as per Unicode recommendation. For UTF-8, this will skip the optional BOM, but has otherwise no effect. For non-Unicode encodings, this has no effect.
The pre-defined Inputs
Null input
lexy/input/null_input.hpp
namespace lexy
{
template <typename Encoding = default_encoding>
class null_input
{
public:
constexpr Reader reader() const& noexcept;
};
template <typename Encoding = default_encoding>
using null_lexeme = lexeme_for<null_input<Encoding>>;
template <typename Tag, typename Encoding = default_encoding>
using null_error = error_for<null_input<Encoding>, Tag>;
template <typename Production, typename Encoding = default_encoding>
using null_error_context = error_context<Production, null_input<Encoding>>;
}
The class lexy::null_input
is an input that is always empty.
Range input
lexy/input/range_input.hpp
namespace lexy
{
template <typename Encoding, typename Iterator, typename Sentinel = Iterator>
class range_input
{
public:
using encoding = Encoding;
using char_type = typename encoding::char_type;
using iterator = Iterator;
constexpr range_input() noexcept;
constexpr range_input(Iterator begin, Sentinel end) noexcept;
constexpr iterator begin() const noexcept;
constexpr iterator end() const noexcept;
constexpr Reader reader() const& noexcept;
};
}
The class lexy::range_input
is an input that represents the range [begin, end)
.
CTAD can be used to deduce the encoding from the value type of the iterator.
The input is a lightweight view and does not own any data. |
Use lexy::string_input instead if the range is contiguous.
|
Example
Using the range input to parse content from a list.
std::list<char8_t> list = …;
// Create the input, deducing the encoding.
auto input = lexy::range_input(list.begin(), list.end());
String input
lexy/input/string_input.hpp
namespace lexy
{
template <typename Encoding = default_encoding>
class string_input
{
public:
using encoding = Encoding;
using char_type = typename encoding::char_type;
using iterator = const char_type*;
constexpr string_input() noexcept;
template <typename CharT>
constexpr string_input(const CharT* begin, const CharT* end) noexcept;
template <typename CharT>
constexpr string_input(const CharT* data, std::size_t size) noexcept;
template <typename View>
constexpr explicit string_input(const View& view) noexcept;
constexpr iterator begin() const noexcept;
constexpr iterator end() const noexcept;
constexpr Reader reader() const& noexcept;
};
template <typename Encoding, typename CharT>
constexpr auto zstring_input(const CharT* str) noexcept;
template <typename CharT>
constexpr auto zstring_input(const CharT* str) noexcept;
template <typename Encoding = default_encoding>
using string_lexeme = lexeme_for<string_input<Encoding>>;
template <typename Tag, typename Encoding = default_encoding>
using string_error = error_for<string_input<Encoding>, Tag>;
template <typename Production, typename Encoding = default_encoding>
using string_error_context = error_context<Production, string_input<Encoding>>;
} // namespace lexy
The class lexy::string_input
is an input that represents the string view defined by the constructors.
CTAD can be used to deduce the encoding from the character type.
The input is a lightweight view and does not own any data.
Use lexy::buffer if you want an owning version.
|
Pointer constructor
template <typename CharT>
constexpr string_input(const CharT* begin, const CharT* end) noexcept; (1)
template <typename CharT>
constexpr string_input(const CharT* data, std::size_t size) noexcept; (2)
1 | The input is the contiguous range [begin, end) . |
2 | The input is the contiguous range [data, data + size) . |
CharT
must be the primary or secondary character type of the encoding.
View constructor
template <typename View>
constexpr explicit string_input(const View& view) noexcept;
The input is given by the View
, which requires a .data()
and .size()
member.
The character type of the View
must be the primary or secondary character type of the encoding.
Null-terminated string functions
template <typename Encoding, typename CharT>
constexpr auto zstring_input(const CharT* str) noexcept; (1)
template <typename CharT>
constexpr auto zstring_input(const CharT* str) noexcept; (2)
1 | Use the specified encoding. |
2 | Deduce the encoding from the character type. |
The input is given by the range [str, end)
, where end
is a pointer to the first null character of the string.
The return type is an appropriate lexy::string_input
instantiation.
Example
Using the string input to parse content from a std::string
.
std::string str = …;
auto input = lexy::string_input(str);
Using the string input to parse content from a string literal.
auto input = lexy::zstring_input(u"Hello World!");
Buffer Input
lexy/input/buffer.hpp
namespace lexy
{
template <typename Encoding = default_encoding,
typename MemoryResource = /* default resource */>
class buffer
{
public:
using encoding = Encoding;
using char_type = typename encoding::char_type;
class builder;
constexpr buffer() noexcept;
constexpr explicit buffer(MemoryResource* resource) noexcept;
template <typename CharT>
explicit buffer(const CharT* data, std::size_t size,
MemoryResource* resource = /* default resource */);
template <typename CharT>
explicit buffer(const CharT* begin, const CharT* end,
MemoryResource* resource = /* default resource */);
template <typename View>
explicit buffer(const View& view,
MemoryResource* resource = /* default resource */);
buffer(const buffer& other, MemoryResource* resource);
const char_type* begin() const noexcept;
const char_type* end() const noexcept;
const char_type* data() const noexcept;
bool empty() const noexcept;
std::size_t size() const noexcept;
std::size_t length() const noexcept;
Reader reader() const& noexcept;
};
template <typename Encoding, encoding_endianness Endianness>
constexpr auto make_buffer;
template <typename Encoding = default_encoding,
typename MemoryResource = /* default resource */>
using buffer_lexeme = lexeme_for<buffer<Encoding, MemoryResource>>;
template <typename Tag, typename Encoding = default_encoding,
typename MemoryResource = /* default resource */>
using buffer_error = error_for<buffer<Encoding, MemoryResource>, Tag>;
template <typename Production, typename Encoding = default_encoding,
typename MemoryResource = /* default resource */>
using buffer_error_context = error_context<Production, buffer<Encoding, MemoryResource>>;
}
The class lexy::buffer
is an immutable, owning variant of lexy::string_input
.
The memory for the input is allocated using the MemoryResource
, which is a class with the same interface as std::pmr::memory_resource
.
By default, it uses a new
and delete
for the allocation, just like std::pmr::new_delete_resource
.
Construction of the buffer is just like lexy::string_input
, except for the additional MemoryResource
parameter.
Once a memory resource has been specified, it will not propagate on assignment.
As the buffer owns the input, it can terminate it with the EOF character for encodings that have the same character and integer type. This eliminates the "is the reader at eof?"-branch during parsing. |
Builder
class builder
{
public:
explicit builder(std::size_t size,
MemoryResource* resource = /* default resource */);
char_type* data() const noexcept;
std::size_t size() const noexcept;
buffer finish() && noexcept;
};
The builder
class separates the allocation and copying of the buffer data.
This allows, for example, writing into the immutable buffer from a file.
The constructor allocates memory for size
characters, then data()
gives a mutable pointer to that memory.
Make buffer
struct /* unspecified */
{
auto operator()(const void* memory, std::size_t size) const;
template <typename MemoryResource>
auto operator()(const void* memory, std::size_t size, MemoryResource* resource) const;
};
template <typename Encoding, encoding_endianness Endianness>
constexpr auto make_buffer = /* unspecified */;
lexy::make_buffer
is a function object that constructs a lexy::buffer
of the specified encoding from raw memory.
If necessary, it will take care of the endianness conversion as instructed by the lexy::encoding_endianness
enumeration.
Any BOM, if present, will not be part of the input.
Example
Using a buffer to parse content from a std::string
using UTF-8.
This enables the sentinel optimization.
std::string str = …;
auto input = lexy::buffer<lexy::utf8_encoding>(str);
Using a buffer to parse a memory-mapped file containing little endian UTF-16.
auto ptr = mmap(…);
constexpr auto make_utf16_little = lexy::make_buffer<lexy::utf16_encoding,
lexy::encoding_endianness::little>;
auto input = make_utf16_little(ptr, length);
File Input
lexy/input/file.hpp
namespace lexy
{
enum class file_error
{
os_error,
file_not_found,
permission_denied,
};
template <typename Encoding = default_encoding,
encoding_endianness Endian = encoding_endianness::bom,
typename MemoryResource>
auto read_file(const char* path,
MemoryResource* resource = /* default resource */)
-> result<buffer<Encoding, MemoryResource>, file_error>;
}
The function lexy::read_file()
reads the file at the specified path using the specified encoding and endianness.
On success, it returns a lexy::result
containing a lexy::buffer
with the file contents.
On failure, it returns a lexy::result
containing the error code.
Example
Reading UTF-16 from a file with a BOM.
auto result = lexy::read_file<lexy::utf16_encoding>("input.txt");
if (!result)
throw my_file_read_error_exception(result.error()); (1)
auto input = std::move(result).value(); (2)
1 | Throw an exception giving it the lexy::file_error . |
2 | Move the buffer out of the result and use it as input. |
Shell Input
lexy/input/shell.hpp
namespace lexy
{
template <typename Encoding = default_encoding>
struct default_prompt;
template <typename Prompt = default_prompt<>>
class shell
{
public:
using encoding = typename Prompt::encoding;
using char_type = typename encoding::char_type;
using prompt_type = Prompt;
shell();
explicit shell(Prompt prompt);
bool is_open() const noexcept;
Input prompt_for_input();
class writer;
template <typename... Args>
writer write_message(Args&&... args);
Prompt& get_prompt() noexcept;
const Prompt& get_prompt() const noexcept;
};
template <typename Prompt = default_prompt<>>
using shell_lexeme = /* unspecified */;
template <typename Tag, typename Prompt = default_prompt<>>
using shell_error = /* unspecified */;
template <typename Production, typename Prompt = default_prompt<>>
using shell_error_context = /* unspecified */;
}
The class lexy::shell
creates an interactive shell to ask for user input and write messages out.
The exact behavior is controlled by the Prompt
.
By default, it uses lexy::default_prompt
which reads from stdin
and writes to stdout
.
The interface of a Prompt is currently experimental.
Refer to lexy::default_prompt if you want to write your own.
|
State
bool is_open() const noexcept;
A shell is initially open and can receive input, but the user can close the shell.
For lexy::default_prompt
, the shell is closed if the user enters EOF e.g. by pressing Ctrl+D under Linux.
is_open()
returns false
if the user has closed it, and true
otherwise.
Input
Input prompt_for_input();
A shell object is not itself an Input
, but it can be used to create one.
Calling prompt_for_input()
will ask the user to enter some input, and then return an unspecified Input
type that refers to that input.
If parsing reaches the end of the input and the shell is still open, it will automatically ask the user for continuation input that will be appended to the current input.
Once parsing of the input is done, prompt_for_input()
can be called again to request new input from the user.
Calling prompt_for_input() again will invalidate all memory used by the previous input.
|
The lexy::default_prompt
asks for input by display > ` and reading an entire line from `stdin
.
If continuation input is requested, it will display `. ` and reads another line.
Output
class writer
{
public:
// non-copyable
template <typename CharT>
writer& operator()(const CharT* str, std::size_t length);
template <typename CharT>
writer& operator()(const CharT* str);
template <typename CharT>
writer& operator()(CharT c);
writer& operator()(lexy::lexeme_for</* input type */> lexeme);
};
template <typename... Args>
writer write_message(Args&&... args);
Calling write_message()
will prepare the prompt for displaying a message and returns a writer
function object that can be used to specify the contents of the message.
The arguments of write_message()
are forwarded to the prompt and can be used to distinguish between e.g. normal and error messages.
The writer
can be invoked multiple times to give different parts of the message; the entire message is written out when the writer is destroyed.
A writer
can only write messages whose character type are the primary or secondary character type of the encoding.
Using lexy::default_prompt
does not require any message arguments and it will simply write the message to stdout
, appending a newline at the end.
Example
An interactive REPL.
lexy::shell<> shell;
while (shell.is_open())
{
auto input = shell.prompt_for_input(); (1)
auto result = lexy::parse<expression>(input, …); (2)
if (result)
shell.write_message()(result.value()); (3)
}
1 | Ask the user to enter more input. |
2 | Parse the input, requesting continuation input if necessary. |
3 | Write the result. |
For a full example, see examples/shell.cpp
.
Command-line argument Input
lexy/input/argv_input.hpp
namespace lexy
{
class argv_sentinel;
class argv_iterator;
constexpr argv_iterator argv_begin(int argc, char* argv[]) noexcept;
constexpr argv_iterator argv_end(int argc, char* argv[]) noexcept;
template <typename Encoding = default_encoding>
class argv_input
{
public:
using encoding = Encoding;
using char_type = typename encoding::char_type;
using iterator = argv_iterator;
constexpr argv_input() = default;
constexpr argv_input(argv_iterator begin, argv_iterator end) noexcept;
constexpr argv_input(int argc, char* argv[]) noexcept;
constexpr Reader reader() const& noexcept;
};
template <typename Encoding = default_encoding>
using argv_lexeme = lexeme_for<argv_input<Encoding>>;
template <typename Tag, typename Encoding = default_encoding>
using argv_error = error_for<argv_input<Encoding>, Tag>;
template <typename Production, typename Encoding = default_encoding>
using argv_error_context = error_context<Production, argv_input<Encoding>>;
}
The class lexy::argv_input
is an input that uses the command-line arguments passed to main()
.
It excludes argv[0]
, which is the executable name, and includes \0
as a separator between command line arguments.
The input is a lightweight view and does not own any data. |
Command-line iterators
class argv_sentinel;
class argv_iterator;
constexpr argv_iterator argv_begin(int argc, char* argv[]) noexcept;
constexpr argv_iterator argv_end(int argc, char* argv[]) noexcept;
The lexy::argv_iterator
is a bidirectional iterator iterating over the command-line arguments excluding the initial argument which is the executable name.
It can be created using argv_begin()
and argv_end()
.
Example
Use the command line arguments as input.
int main(int argc, char* argv[])
{
auto input = lexy::argv_input(argc, argv);
…
}
If the program is invoked with ./a.out a 123 b
, the input will be a\0123\0b
.
Lexemes
A lexeme is a the part of the input matched by a rule.
Lexeme
lexy/lexeme.hpp
namespace lexy
{
template <typename Reader>
class lexeme
{
public:
using encoding = typename Reader::encoding;
using char_type = typename encoding::char_type;
using iterator = typename Reader::iterator;
constexpr lexeme() noexcept;
constexpr lexeme(iterator begin, iterator end) noexcept;
constexpr explicit lexeme(const Reader& reader, iterator begin) noexcept
: lexeme(begin, reader.cur())
{}
constexpr bool empty() const noexcept;
constexpr iterator begin() const noexcept;
constexpr iterator end() const noexcept;
// Only if the iterator is a pointer.
constexpr const char_type* data() const noexcept;
// Only if the iterator has `operator-`.
constexpr std::size_t size() const noexcept;
// Only if the iterator has `operator[]`.
constexpr char_type operator[](std::size_t idx) const noexcept;
};
template <typename Input>
using lexeme_for = lexeme<input_reader<Input>>;
}
The class lexy::lexeme
represents a sub-range of the input.
For convenience, most inputs also provide convenience typedefs that can be used instead of lexy::lexeme_for
.
Code point
lexy/encoding.hpp
namespace lexy
{
class code_point
{
public:
constexpr code_point() noexcept;
constexpr explicit code_point(char32_t value) noexcept;
constexpr char32_t value() const noexcept;
constexpr bool is_valid() const noexcept;
constexpr bool is_surrogate() const noexcept;
constexpr bool is_scalar() const noexcept;
constexpr bool is_ascii() const noexcept;
constexpr bool is_bmp() const noexcept;
friend constexpr bool operator==(code_point lhs, code_point rhs) noexcept;
friend constexpr bool operator!=(code_point lhs, code_point rhs) noexcept;
};
}
The class lexy::code_point
represents a single code point from the input.
It is merely a wrapper over a char32_t
that contains the numerical code.
Constructors
constexpr code_point() noexcept; (1)
constexpr explicit code_point(char32_t value) noexcept; (2)
1 | Creates an invalid code point. |
2 | Creates the specified code point. The value will be returned from value() unchanged. |
Validity
constexpr bool is_valid() const noexcept; (1)
constexpr bool is_surrogate() const noexcept; (2)
constexpr bool is_scalar() const noexcept; (3)
1 | Returns true if the code point is less than 0x10’FFFF , false otherwise. |
2 | Returns true if the code point is a UTF-16 surrogate, false otherwise. |
3 | Returns true if the code point is valid and not a surrogate, false otherwise. |
Category
constexpr bool is_ascii() const noexcept; (1)
constexpr bool is_bmp() const noexcept; (2)
1 | Returns true if the code point is ASCII (7-bit value), false otherwise. |
2 | Returns true if the code point is in the Unicode BMP (16-bit value), false otherwise. |
Writing custom Inputs
Input
conceptclass Input
{
public:
Reader reader() const&;
};
An Input
is just a class with a reader()
member function that returns a Reader
to the beginning of the input.
The type alias lexy::input_reader<Reader>
returns the type of the corresponding reader.
The interface of a Reader is currently experimental.
Refer to the comments in lexy/input/base.hpp .
|
Matching, parsing and validating
Production
conceptstruct Production
{
static constexpr auto rule = …;
static constexpr auto whitespace = …; // optional
static constexpr auto value = …; // optional
};
A Production
is type containing a rule and optional callbacks that produce the value.
A grammar contains an entry production where parsing begins and all productions referenced by it.
It is recommended to put all productions of a grammar into a separate namespace. |
By passing the entry production of the grammar to lexy::match()
, lexy::parse()
, or lexy::validate()
, the production is parsed.
Matching
lexy/match.hpp
namespace lexy
{
template <typename Production, typename Input>
constexpr bool match(const Input& input);
}
The function lexy::match()
matches the Production
on the given input
.
If the production accepts the input, returns true
, otherwise, returns false
.
It will discard any values produced and does not give detailed information about why the production did not accept the input.
A production does not necessarily need to consume the entire input for it to match.
Add lexy::dsl::eof to the end if the production should consume the entire input.
|
Validating
lexy/validate.hpp
namespace lexy
{
template <typename Production, typename Input, typename Callback>
constexpr auto validate(const Input& input, Callback callback)
-> result</* see below */>;
}
The function lexy::validate()
validates that the Production
matches on the given input
.
The return value is an lexy::result<void, E>
, where E
is the return type of callback
.
If the production accepts the input, returns an empty optional, otherwise, invokes the callback with the error information (see Error handling) and returns its result.
It will discard any values produced.
A production does not necessarily need to consume the entire input for it to match.
Add lexy::dsl::eof to the end if the production should consume the entire input.
|
Parsing
lexy/parse.hpp
namespace lexy
{
template <typename Production, typename Input, typename Callback>
constexpr auto parse(const Input& input, Callback callback)
-> result</* see below */>;
template <typename Production, typename Input, typename State, typename Callback>
constexpr auto parse(const Input& input, State&& state, Callback callback)
-> result</* see below */>;
}
The function lexy::parse()
parses the Production
on the given input
.
The return value is a lexy::result<T, E>
, where T
is the return type of the Production::value
or Production::list
callback,
and E
is the return type of the callback
.
If the production accepts the input, invokes Production::value
(see below) with the produced values and returns their result.
Otherwise, invokes callback
with the error information (see Error handling) and returns its result.
The return value on success is determined using Production::value
depending on three cases:
-
Production::rule
does not contain a list. Then all arguments will be forwarded toProduction::value
as a callback whose result is returned. TheProduction::value
callback must be present. -
Production::rule
contains a list and no other rule produces a value. ThenProduction::value
will be used as sink for the list values. IfProduction::value
is also a callback that accepts the result of the sink as argument, it will be invoked with the sink result and the processed result returned. Otherwise, the result of the sink is the final result. -
Production::rule
contains a list and other rules produce values as well. ThenProduction::value
will be used as sink for the list values. The sink result will be added to the other values in order and everything forwarded toProduction::value
as a callback. The callback result is then returned.
The callback operator>> is useful for case 3 to create a combined callback and sink with the desired behavior.
|
The second overload of lexy::parse()
allows passing an arbitrary state argument.
This will be made available to the lexy::dsl::parse_state
and lexy::dsl::parse_state_member
rules which can forward it to the Production::value
callback.
Result
lexy/result.hpp
namespace lexy
{
struct result_empty_t {};
constexpr auto result_empty = result_empty_t{};
struct result_value_t {};
constexpr auto result_value = result_value_t{};
struct result_error_t {};
constexpr auto result_error = result_error_t{};
template <typename T, typename E>
class result
{
public:
using value_type = /* see below */;
using error_type = /* see below */;
constexpr result(result_empty_t);
template <typename... Args>
constexpr result(result_value_t, Args&&... args);
template <typename... Args>
constexpr result(result_error_t, Args&&... args);
template <typename U>
constexpr explicit result(const result<U, E>& other);
template <typename U>
constexpr explicit result(result<U, E>&& other);
template <typename Arg>
constexpr explicit result(Arg&& arg);
constexpr explicit operator bool() const noexcept;
constexpr bool has_value() const noexcept;
constexpr bool has_error() const noexcept;
static constexpr bool has_void_value() noexcept;
static constexpr bool has_void_error() noexcept;
constexpr value_type& value() & noexcept;
constexpr const value_type& value() const& noexcept;
constexpr value_type&& value() && noexcept;
constexpr const value_type&& value() const&& noexcept;
constexpr error_type& error() & noexcept;
constexpr const error_type& error() const& noexcept;
constexpr error_type&& error() && noexcept;
constexpr const error_type&& error() const&& noexcept;
};
}
The class lexy::result<T, E>
stores either a value T
or an error E
(or nothing) and is used to return the result of parsing.
T
and E
can be void
; in that case it is internally translated to the tag types result_value_t
or result_error_t
, respectively, which is reflected in the value_type
and error_type
typedefs as well.
lexy::result<T, void> is like std::optional<T> and lexy::result<void, void> is like bool .
|
Once a result is created containing a value or error, it can never change that state.
lexy::result was created for use by the library only.
While it can be used as a general purpose result monad (which we leverage for lexy::read_file() ), it is better to us a designated library for it.
|
Every lexy::result object returned by the library is never empty.
The empty state is just used internally and not exposed to the user (unless of course, the user explicitly creates an empty result).
|
Creation
constexpr result(result_empty_t); (1)
template <typename... Args>
constexpr result(result_value_t, Args&&... args); (2)
template <typename... Args>
constexpr result(result_error_t, Args&&... args); (3)
1 | Creates a result that is empty. |
2 | Creates a result containing the value constructed by forwarding the arguments. |
3 | Creates a result containing the error constructed by forwarding the arguments. |
Conversion
template <typename U>
constexpr explicit result(const result<U, E>& other); (1)
template <typename U>
constexpr explicit result(result<U, E>&& other); (2)
template <typename Arg>
constexpr explicit result(Arg&& arg); (3)
1 | Converts an errored result<U, E> to a result<T, E> by copying the error. |
2 | Converts an errored result<U, E> to a result<T, E> by moving the error. |
3 | Only available for result<T, void> or result<void, E> . Constructs the value/error by forwarding the argument. |
State
constexpr explicit operator bool() const noexcept; (1)
constexpr bool is_empty() const noexcept; (2)
constexpr bool has_value() const noexcept; (3)
constexpr bool has_error() const noexcept; (4)
static constexpr bool has_void_value() noexcept; (5)
static constexpr bool has_void_error() noexcept; (6)
1 | Returns true if it contains a value, false otherwise. |
2 | Returns true if it is empty (contains neither value nor error), false otherwise. |
3 | Returns true if it contains a value, false otherwise. |
4 | Returns true if it contains an error, false otherwise. |
5 | Returns true if T == void , false otherwise. |
6 | Returns true if E == void , false otherwise. |
Access
constexpr value_type& value() & noexcept;
constexpr const value_type& value() const& noexcept;
constexpr value_type&& value() && noexcept;
constexpr const value_type&& value() const&& noexcept;
constexpr error_type& error() & noexcept;
constexpr const error_type& error() const& noexcept;
constexpr error_type&& error() && noexcept;
constexpr const error_type&& error() const&& noexcept;
Returns the stored value or error, respectively.
Callbacks
Callback
conceptstruct Callback
{
using return_type = …;
return_type operator()(Args&&... args) const;
};
struct Sink
{
class _sink // exposition only
{
public:
using return_type = …;
void operator()(Args&&... args);
return_type&& finish() &&;
};
_sink sink() const;
};
A Callback
is a function object whose return type is specified by a member typedef.
A Sink
is a type with a sink()
member function that returns a callback.
The callback can be invoked multiple times and the final value is return by calling .finish()
.
Callbacks are used by lexy
to compute the parse result and handle error values.
They can either be written manually implementing to the above concepts or composed from the pre-defined concepts.
Callback adapters
lexy/callback.hpp
namespace lexy
{
template <typename ReturnType = void, typename... Fns>
constexpr Callback callback(Fns&&... fns);
}
Creates a callback with the given ReturnType
from multiple functions.
When calling the resulting callback, it will use overload resolution to determine the correct function to call.
It supports function pointers, lambdas, and member function or data pointers.
lexy/callback.hpp
namespace lexy
{
template <typename T, typename... Fns>
constexpr Sink sink(Fns&&... fns);
}
Creates a sink constructing the given T
using the given functions.
The sink will value-construct the T
and then call one of the functions selected by overload resolution, passing it a reference to the resulting object as first argument.
It supports function pointers, lambdas, and member function or data pointers.
Example
Creating a sink that will add all values.
constexpr auto adder = lexy::sink<int>([](int& cur, int arg) { cur += arg; }); (1)
auto s = adder.sink(); (2)
s(1);
s(2);
s(3);
auto result = std::move(s).finish();
assert(result == 1 + 2 + 3);
1 | Define the sink. |
2 | Use it. |
Callback composition
lexy/callback.hpp
namespace lexy
{
template <typename First, typename Second>
constexpr auto operator|(First first, Second second); (1)
template <typename Sink, typename Callback>
constexpr auto operator>>(Sink sink, Callback callback); (2)
}
1 | The result of first | second , where first and second are both callbacks, is another callback that first invokes first and then passes the result to second .
The result cannot be used as sink. |
2 | The result of sink | callback , is a sink and a callback.
As a sink, it behaves just like sink .
As a callback, it takes the result of the sink as well as any other arguments and forwards them to callback . |
Example
Build a string, then get its length.
constexpr auto make_string = lexy::callback<std::string>([](const char* str) { return str; });
constexpr auto string_length = lexy::callback<std::size_t>(&std::string::size);
constexpr auto inefficient_strlen = make_string | string_length; (1)
assert(inefficient_strlen("1234") == 4); (2)
1 | Compose the two callbacks. |
2 | Use it. |
The callback operator>> is used for productions whose rule contain both a list and produce other values.
The list will be constructed using the sink and then everything will be passed to callback .
|
The no-op callback
lexy/callback.hpp
namespace lexy
{
constexpr auto noop = /* unspecified */;
}
lexy::noop
is both a callback and a sink.
It ignores all arguments passed to it and its return type is void
.
Example
Parse the production, but do nothing on errors.
auto result = lexy::parse<my_production>(my_input, lexy::noop); (1)
if (!result)
throw my_parse_error(); (2)
auto value = result.value(); (3)
1 | Parse my_production . If an error occurs, just return a result<T, void> in the error state. |
2 | lexy::noop does not make errors disappear, they still need to be handled. |
3 | Do something with the parsed value. |
Constructing objects
lexy/callback.hpp
namespace lexy
{
template <typename T>
constexpr auto forward = /* unspecified */;
template <typename T>
constexpr auto construct = /* unspecified */;
template <typename T, typename PtrT = T*>
constexpr auto new_ = /* unspecified */;
}
The callback lexy::forward<T>
can accept either a const T&
or a T&&
and forwards it.
It does not have a sink.
The callback lexy::construct<T>
constructs a T
by forwarding all arguments to a suitable constructor.
If the type does not have a constructor, it forwards all arguments using brace initialization.
It does not have a sink.
The callback lexy::new_<T, PtrT>
works just like lexy::construct<T>
, but it constructs the object on the heap by calling new
.
The resulting pointer is then converted to the specified PtrT
.
It does not have a sink.
Example
A callback that creates a std::unique_ptr<std::string>
.
constexpr auto make_unique_str = lexy::new_<std::string, std::unique_ptr<std::string>>; (1)
constexpr auto make_unique_str2 = lexy::new_<std::string> | lexy::construct<std::unique_ptr<std::string>>; (2)
1 | Specify a suitable PtrT . |
2 | Equivalent version that uses composition and lexy::construct instead. |
Constructing lists
lexy/callback.hpp
namespace lexy
{
template <typename T>
constexpr auto as_list = /* unspecified */;
template <typename T>
constexpr auto as_collection = /* unspecified */;
}
lexy::as_list<T>
is both a callback and a sink.
As a callback, it forwards all arguments to the std::initializer_list
constructor of T
and returns the result.
As a sink, it first default constructs a T
and then repeatedly calls push_back()
for single arguments and emplace_back()
otherwise.
lexy::as_collection<T>
is like lexy::as_list<T>
, but instead of calling push_back()
and emplace_back()
, it calls insert()
and emplace()
.
Example
Create a std::vector<int>
and std::set<int>
.
constexpr auto as_int_vector = lexy::as_list<std::vector<int>>;
constexpr auto as_int_set = lexy::as_collection<std::set<int>>;
Constructing strings
lexy/callback.hpp
namespace lexy
{
template <typename String, typename Encoding = /* see below */>
constexpr auto as_string = /* unspecified */;
}
lexy::as_string<String, Encoding>
is both a callback and a sink.
It constructs a String
object in the given Encoding
.
If no encoding is specified, it deduces one from the character type of the string.
As a callback, it constructs the string directly from the given argument. Then it accepts:
-
A reference to an existing
String
object, which is forwarded as the result. -
A
const CharT*
and astd::size_t
, whereCharT
is a compatible character type. The two arguments are forwarded to aString
constructor. -
A
lexy::lexeme<Reader> lex
, whereReader::iterator
is a pointer. The character type of the reader must be compatible with the encoding. It constructs the string usingString(lex.data(), lex.size())
(potentially casting the pointer type if necessary). -
A
lexy::lexeme<Reader> lex
, whereReader::iterator
is not a pointer. It constructs the string usingString(lex.begin(), lex.end())
. The range constructor has to take care of any necessary character conversion. -
A
lexy::code_point
. It is encoded into a local character array according to the specifiedEncoding
. Then the string is constructed using a two-argument(const CharT*, std::size_t)
constructor.
As a sink, it first default constructs the string. Then it will repeatedly append the following arguments:
-
A single
CharT
, which is convertible to the strings character type. It is appended by calling.push_back()
. -
A reference to an existing
String
object, which is appended by calling.append()
. -
A
const CharT*
and astd::size_t
, whereCharT
is a compatible character type. The two arguments are forwarded to.append()
. -
A
lexy::lexeme<Reader> lex
, whereReader::iterator
is a pointer. The character type of the reader must be compatible with the encoding. It is appended using.append(lex.data(), lex.size())
(potentially casting the pointer type if necessary). -
A
lexy::lexeme<Reader> lex
, whereReader::iterator
is not a pointer. It constructs the string using.append(lex.begin(), lex.end())
. The range append function has to take care of any necessary character conversion. -
A
lexy::code_point
. It is encoded into a local character array according to the specifiedEncoding
. Then it is appended to the string using a two-argument.append(const CharT*, std::size_t)
overload.
Example
constexpr auto as_utf16_string = lexy::as_string<std::u16string>; (1)
constexpr auto as_utf8_string = lexy::as_string<std::string, lexy::utf8_encoding>; (2)
1 | Constructs a std::u16string , deducing the encoding as UTF-16. |
2 | Constructs a std::string , specifying the encoding as UTF-8. |
Rule-specific callbacks
lexy/callback.hpp
namespace lexy
{
template <typename T>
constexpr auto as_aggregate = /* unspecified */;
template <typename T>
constexpr auto as_integer = /* unspecified */;
}
The callback and sink lexy::as_aggregate<T>
is only used together with the lexy::dsl::member
rule and documented there.
The callback lexy::as_integer<T>
constructs an integer type T
and has two overloads:
template <typename Integer>
T operator()(const Integer& value) const; (1)
template <typename Integer>
T operator()(int sign, const Integer& value) const; (2)
1 | Returns T(value) . |
2 | Returns T(sign * value) . |
The second overload is meant to be used together with lexy::dsl::sign
and related rules.
Error handling
Parsing errors are reported by constructing a lexy::error
object and passing it to the error callback of lexy::parse
and lexy::validate
together with the lexy::error_context
.
As such, an error callback looks like this:
class ErrorCallback
{
public:
using return_type = /* … */;
template <typename Production, typename Input, typename Tag>
return_type operator()(const lexy::error_context<Production, Input>& context,
const lexy::error<lexy::input_reader<Input>, Tag>& error) const;
};
Of course, overloading can be used to differentiate between various error types and contexts.
Error types
lexy/error.hpp
namespace lexy
{
template <typename Reader, typename Tag>
class error;
struct expected_literal {};
template <typename Reader>
class error<Reader, expected_literal>;
struct expected_char_class {};
template <typename Reader>
class error<Reader, expected_char_class>;
template <typename Input, typename Tag>
using error_for = error<input_reader<Input>, Tag>;
template <typename Reader, typename Tag, typename ... Args>
constexpr auto make_error(Args&&... args);
}
All errors are represented by instantiations of lexy::error<Reader, Tag>
.
The Tag
is an empty type that specifies the kind of error.
There are specializations for two tags to store additional information.
The function lexy::make_error
constructs an error object given the reader and tag by forwarding all the arguments.
Generic error
template <typename Reader, typename Tag>
class error
{
using iterator = typename Reader::iterator;
public:
constexpr explicit error(iterator pos) noexcept;
constexpr explicit error(iterator begin, iterator end) noexcept;
constexpr iterator position() const noexcept;
constexpr iterator begin() const noexcept;
constexpr iterator end() const noexcept;
constexpr /* see below */ message() const noexcept;
};
The primary class template lexy::error<Reader, Tag>
represents a generic error without additional metadata.
It can either be constructed giving it a single position, then position() == begin() == end()
;
or a range of the input, then position() == begin() ⇐ end()
.
The message()
is determined using the Tag
.
By default, it returns the type name of Tag
after removing the top-level namespace name.
This can be overridden by defining either Tag::name()
or Tag::name
.
The result is an unspecified type similar to std::string_view
.
Expected literal error
struct expected_literal
{};
template <typename Reader>
class error<Reader, expected_literal>
{
using iterator = typename Reader::iterator;
using string_view = /* see below */;
public:
constexpr explicit error(iterator position,
string_view string, std::size_t index) noexcept;
constexpr iterator position() const noexcept;
constexpr string_view string() const noexcept;
constexpr string_view::char_type character() const noexcept;
constexpr std::size_t index() const noexcept;
};
A specialization of lexy::error
is provided if Tag == lexy::expected_literal
.
It represents the error where a literal string was expected, but could not be matched.
It is mainly raised by the lexy::dsl::lit
rule.
The error happens at a given position()
and with a given string()
.
The index()
is the index into the string where matching failed; e.g. 0
if the input starts with a different character, 2
if the first two characters matched, etc.
The character()
is the string character at that index.
The unspecified string_view
type is like std::string_view
.
Its character type must match the encoding of the Reader
.
Character class error
struct expected_char_class
{};
template <typename Reader>
class error<Reader, expected_char_class>
{
using iterator = typename Reader::iterator;
public:
constexpr explicit error(iterator position, const char* name) noexcept;
constexpr iterator position() const noexcept;
constexpr /* see below */ name() const noexcept;
};
A specialization of lexy::error
is provided if Tag == lexy::expected_char_class
.
It represents the error where any character from a given set of characters was expected, but could not be matched.
It is raised by the lexy::dsl::ascii::*
rules or lexy::dsl::newline
, among others.
The error happens at the given position()
and a symbolic name of the character class is returned by name()
.
The return type of name()
is an unspecified type similar to std::string_view
.
By convention, the name format used is <group>.<name>
or <name>
, where both <group>
and <name>
consist of characters.
Examples include newline
, ASCII.alnum
and digit.decimal
.
Error context
lexy/error.hpp
namespace lexy
{
template <typename Production, typename Input>
class error_context
{
using iterator = typename input_reader<Input>::iterator;
public:
constexpr explicit error_context(const Input& input, iterator pos) noexcept;
constexpr const Input& input() const noexcept;
static consteval /* see below */ production();
constexpr iterator position() const noexcept;
};
}
The class lexy::error_context<Production, Input>
contain information about the context where the error occurred.
The entire input containing the error is returned by input()
.
The Production
whose rule has raised the error is specified as template parameter and its name returned by production()
.
Like lexy::error<Reader, Tag>::message()
, it returns the name of the type without the top level namespace name.
This can be overridden by defining Production::name()
or Production::name
.
The result is an unspecified type similar to std::string_view
.
The position()
of the error context is the input position where the production started parsing.
Error location
lexy/error_location.hpp
namespace lexy
{
template <typename Reader>
struct error_location
{
std::size_t line, column;
lexeme<Reader> context;
};
template <typename Input>
using error_location_for = error_location<input_reader<Input>>;
template <typename Input, typename TokenCP, typename TokenNL>
constexpr auto make_error_location(const Input& input,
typename input_reader<Input>::iterator pos,
TokenCP code_point_token,
TokenNL newline_token)
-> error_location_for<Input>;
}
The header lexy/error_location.hpp
provides a utility function lexy::make_error_location()
to convert an error position,
which is always given via an iterator, into the traditional line/column format.
The function takes the position into the input, as well as two tokens.
It then determines the line and column by repeatedly parsing the two tokens until the error position is reached.
Every time the code_point_token
matches, the column is increased by one.
Every time the newline_token
matches, the column is reset to one and the line increased by one.
If neither token matches, column is increased by one and the next code unit skipped.
The final line and column number are returned, together with the context
which is a lexeme containing the entire line where the error occurred.
For ASCII encoded texts, the code_point_token
is lexy::dsl::ascii::character
and the newline_token
is lexy::dsl::newline
.
For Unicode encoded texts, the code_point_token
is lexy::dsl::code_point
and the newline_token
is lexy::dsl::newline
.
The rule DSL
The rule of a production is specified using a DSL built on top of C++ operator overloading.
Everything of the DSL is defined in the namespace lexy::dsl::*
and every header available under lexy/dsl/*
.
The umbrella header lexy/dsl.hpp
includes all DSL headers.
A Rule
is an object that defines a specific set of input to be parsed.
It first tries to match a set of characters from the input by comparing the character at the current reader position to the set of expected characters,
temporarily advancing the reader further if necessary.
If the matching was successful, a subset of matched characters are consumed by advancing the reader permanently.
The rule can then produce zero or more values, which are eventually forwarded to the value callback of its production.
If the matching was not successful, an error is produced instead.
A failed rule does not consume any characters.
A Branch
is a rule that has an associated condition.
The parsing algorithm can efficiently check whether the condition would match at the current reader position.
As such, they are used whenever the algorithm needs to decide between multiple alternatives.
Once the branch condition matches, the branch is taken without any additional backtracking.
A Token
is a special Rule
that is an atomic element of the input.
As a rule, it does not produce any value.
Every Token
is also a Branch
that uses itself as the condition.
Whitespace
By default, lexy
does not treat whitespace in any special way.
You need to instruct it to do so, using either manual or automatic whitespace skipping.
Manual whitespace skipping is done using lexy::dsl::whitespace(rule)
.
It skips zero or more whitespace characters defined by rule
.
Insert it everywhere you want to skip over whitespace.
See examples/email.cpp
or examples/xml.cpp
for an example of manual whitespace skipping.
Automatic whitespace skipping is done by adding a static constexpr auto whitespace
to the root production,
i.e. the production passed to one of the parse functions.
This member is initialized to a rule that defines a single whitespace character.
lexy
will then skip zero or more occurrences of ::whitespace
after every token of the entire grammar.
To temporarily disable whitespace skipping for a production, inherit the production from lexy::token_production
.
Then whitespace will not be skipped for the rule of the production, and all productions reached from that rule.
Likewise, lexy::dsl::no_whitespace()
can be used to disable it for a single rule.
See examples/tutorial.cpp
or examples/json.cpp
for an example of automatic whitespace skipping.
"Whitespace" can mean literal whitespace characters, but also comments (or whatever you want it to mean). |
lexy::dsl::whitespace
(explicit)
lexy/dsl/whitespace.hpp
whitespace(rule) : Rule whitespace(rule_a) | rule_b = whitespace(rule_a | rule_b) whitespace(rule_a) / rule_b = whitespace(rule_a / rule_b)
The explicit whitespace
rule matches rule
zero or more times and treats the result as whitespace.
This happens regardless of the state of automatic whitespace skipping.
If the whitespace rule is used inside a choice or alternative, the entire choice/alternative is treated as whitespace instead.
Requires |
|
Matches |
While the branch condition of |
Values |
None. |
Errors |
All errors raised by |
lexy::dsl::whitespace
(implicit)
lexy/dsl/whitespace.hpp
whitespace : Rule = whitespace(automatic_whitespace_rule)
The implicit whitespace
rule is equivalent to the explicit whitespace
rule with the current whitespace rule;
i.e. it matches the current whitespace rule zero or more times.
The current whitespace rule is determined as follows:
-
If automatic whitespace skipping is disabled, there is no current whitespace rule.
lexy::dsl::whitespace
does nothing. -
If the current production inherits from
lexy::token_production
, there is no current whitespace rule.lexy::dsl::whitespace
does nothing. -
Otherwise, if the current production defines a
static constexpr auto whitespace
member, its value is the current whitespace rule. -
Otherwise, if the root production defines a
static constexpr auto whitespace
member, its value is the current whitespace rule.
Here, the root production is defined as follows:
-
If the current production is a token production, the root production is the current production.
-
Otherwise, if the current production is the production that was originally parsed to the top-level parse function (e.g.
lexy::parse()
), the root production is the current production. -
Otherwise, the root production is taken from the production that parsed the
lexy::dsl::p
orlexy::dsl::recurse
rule to start parsing the current production.
This rule is automatically parsed after every token, after a production that inherits from lexy::token_production
,
or after a lexy::dsl::no_whitespace()
rule.
Example
struct token_p : lexy::token_production
{
struct child
{
static constexpr auto rule = dsl::whitespace; (4)
};
static constexpr auto rule = dsl::whitespace + dsl::p<child>; (3)
};
struct normal_prod
{
static constexpr auto rule = dsl::whitespace + dsl::p<token_p>; (2)
};
struct root_prod
{
static constexpr auto whitespace = dsl::ascii::space;
static constexpr auto rule = dsl::whitespace + dsl::p<normal_prod>; (1)
};
…
auto result = lexy::parse<root_prod>(…);
1 | Here, the automatic whitespace rule is dsl::ascii::space ,
as the current production has a whitespace member. |
2 | Here, the automatic whitespace rule is also dsl::ascii::space .
The current production doesn’t have a whitespace member,
but its root production (root_prod ) does. |
3 | Here, the current production is a token production, so there is no automatic whitespace.
The root production is reset to token_p . |
4 | Here, the root production is token_p , as that is the root of the parent.
As such, there is no automatic whitespace. |
lexy::dsl::no_whitespace()
lexy/dsl/whitespace.hpp
no_whitespace(rule) : Rule no_whitespace(branch) : Branch
The no_whitespace
rule parses the given rule
but disables automatic whitespace skipping while doing so.
It is a branch if given a branch.
Branch Condition |
Whatever |
Matches |
Matches and consumes |
Values |
All values produced by |
Errors |
All errors raised by |
When rule contains a lexy::dsl::p or lexy::dsl::recurse rule, whitespace skipping is re-enabled while that production is parsed.
|
Primitive Tokens
All tokens, not just the tokens defined here, do implicit whitespace skipping.
As such, a token t is really equivalent to t + dsl::whitespace .
This has no effect, unless a whitespace rule has been specified.
|
lexy::dsl::any
lexy/dsl/any.hpp
any : Token
The any
token matches anything, i.e. all the remaining input.
Matches |
All the remaining input. |
Error |
n/a (it never fails) |
any is useful in combination with partial inputs such as the minus rule or switch_ .
|
lexy::dsl::lit
lexy/dsl/literal.hpp
lit_c<C> : Token lit<Str> : Token LEXY_LIT(Str) : Token
The literal tokens match the specified sequence of characters.
Requires |
|
Matches |
The specified character or string of characters, which are consumed. |
Error |
|
lit<Str> requires C++20 support for extended NTTPs.
Use the LEXY_LIT(Str) macro if your compiler does not support them.
|
lexy/dsl/punctuator.hpp
period : Token = lit<"."> comma : Token = lit<","> colon : Token = lit<":"> semicolon : Token = lit<";"> hyphen : Token = lit<"-"> slash : Token = lit<"/"> backslash : Token = lit<"\\"> apostrophe : Token = lit<"'"> hash_sign : Token = lit<"#"> dollar_sign : Token = lit<"$"> at_sign : Token = lit<"@">
The header lexy/dsl/punctuator.hpp
defines common punctuator literals.
They are equivalent to a literal matching the specified character.
Character classes
lexy::dsl::eof
lexy/dsl/eof.hpp
eof : Token
The eof
token matches EOF.
Matches |
Only if the reader is at the end of the input. It does not consume anything (it can’t). |
Error |
|
lexy::dsl::newline
lexy/dsl/newline.hpp
newline : Token
The newline
token matches a newline.
Matches |
|
Error |
|
lexy::dsl::eol
lexy/dsl/newline.hpp
eol : Token
The eol
token matches an end-of-line (EOL).
Matches |
|
Error |
|
lexy::dsl::ascii::*
lexy/dsl/ascii.hpp
namespace ascii { control : Token // 0x00-0x1F, 0x7F blank : Token // ' ' (space character) or '\t' newline : Token // '\n' or '\r' other_space : Token // '\f' or '\v' space : Token // `blank` or `newline` or `other_space` lower : Token // a-z upper : Token // A-Z alpha : Token // `lower` or `upper` digit : Token // 0-9 alnum : Token // `digit` or `alpha` punct : Token // One of: !"#$%&'()*+,-./:;<=>[email protected][\]^_`{|}~ graph : Token // `alnum` or `punct` print : Token // `graph` or ' ' (space characters) character : Token // 0x00-0x7F }
All tokens defined in lexy::dsl::ascii
match one of the categories of ASCII characters.
Matches |
Matches and consumes one of the set of ASCII characters indicated in the comments. |
Errors |
A |
Every ASCII character except for the space character is in exactly one of control , lower , upper , digit or punct .
|
lexy::dsl::code_point
lexy/dsl/code_point.hpp
code_point : Token code_point.capture() : Rule
The code_point
token will match and consume a well-formed Unicode code point according to the encoding of the input.
If code_point.capture()
is used, the consumed code point will be produced as value.
Requires |
The encoding of the input is |
Matches |
Matches and consumes all code units of the next code point. For ASCII and UTF-32 this is only one, but for UTF-8 and UTF-16 it can be multiple code units. If the code point is too big or a UTF-16 surrogate, it fails. For UTF-8, it also fails for overlong sequences. |
Value |
If |
Errors |
If it could not match a valid code point, it fails with a |
Example
// Match and capture one arbitrary code point.
dsl::code_point.capture()
If you want to match a specific code point, use a literal rule instead. This rule is useful for matching things like string literals that can contain arbitrary code points. |
lexy::dsl::operator-
lexy/dsl/minus.hpp
token - except : Token
The minus rule matches the given token, but only if except
does not match on the input the rule has consumed.
Requires |
|
Matches |
Matches and consumes whatever |
Errors |
Whatever errors are raised if |
Use a minus rule to exclude characters from a character class; e.g. lexy::dsl::code_point - lexy::dsl::ascii::control matches all code points except control characters.
|
Minus rules can be chained. This is equivalent to specifying an alternative for except .
|
except has to match everything the rule has consumed before; partial matches don’t count.
Use token - (except + lexy::dsl::any) if you want to allow a partial match.
|
lexy::dsl::token
lexy/dsl/token.hpp
token(rule) : Token
The token
rule turns an arbitrary rule into a token by parsing it and discarding all values it has produced.
Matches |
Whatever |
Error |
A generic error with tag |
While token() is optimized to prevent any overhead created by constructing values that are later discarded,
it still should only be used when required.
|
Values
The following rules are used to produce additional values without any additional matching.
lexy::dsl::value_*
lexy/dsl/value.hpp
value_c<Value> : Rule value_f<Fn> : Rule value_t<T> : Rule value_str<Str> : Rule LEXY_VALUE_STR(Str) : Rule
The value_*
rules create a constant value without parsing anything.
Requires |
|
Matches |
Any input, but does not consume anything. |
Value |
|
Error |
n/a (it does not fail) |
Use the value_* rules only to create symmetry between different branches.
Everything they do, can also be achieved using callbacks, which is usually a better solution.
|
The function might not be called or the object might not be constructed in all situations. You cannot rely on their side effects. |
value_str<Str> requires C++20 support for extended NTTPs.
Use the LEXY_VALUE_STR(Str) macro if your compiler does not support them.
|
lexy::dsl::nullopt
lexy/dsl/option.hpp
namespace lexy
{
struct nullopt
{
template <typename T>
constexpr operator T() const;
};
}
The lexy::nullopt
type represents an empty optional.
It is implicitly convertible to any type that has a default constructor (T()
), a dereference operator (*t
), and a contextual conversion to bool
(if (t)
).
Examples are pointers or std::optional
.
The conversion operator returns a default constructible object, i.e. an empty optional.
lexy/dsl/option.hpp
nullopt : Rule
The nullopt
rule produces a value of type lexy::nullopt
without parsing anything.
Matches |
Any input, but does not consume anything. |
Value |
An object of type |
Error |
n/a (it does not fail) |
It is meant to be used for symmetry with together with the opt() rule.
|
lexy::dsl::label
and lexy::dsl::id
lexy/dsl/label.hpp
namespace lexy
{
template <typename Tag>
struct label
{
// only if Tag::value is well-formed
consteval operator auto() const
{
return Tag::value;
}
};
template <auto Id>
using id = label<std::integral_constant<int, Id>>;
}
lexy/dsl/label.hpp
label<Tag> : Rule id<Id> : Rule
The label
and id
rules are used to disambiguate between two branches that create otherwise the same values but should resolve to different callbacks.
They simply produce the empty tag object or the id to differentiate them without parsing anything.
Requires |
|
Matches |
Any input, but does not consume anything. |
Value |
|
Error |
n/a (it does not fail) |
lexy/dsl/label.hpp
label<Tag>(rule) : Rule = label<Tag> + rule label<Tag>(branch) : Branch = /* as above, except as branch */ id<Id>(rule) : Rule = id<Id> + rule id<Id>(branch) : Branch = /* as above, except as branch */
For convenience, label
and id
have function call operators.
They produce the label/id and then parse the rule.
lexy::dsl::capture
lexy/dsl/capture.hpp
capture(rule) : Rule capture(branch) : Branch
The capture()
rule takes an arbitrary rule and parses it, capturing everything it has consumed into a lexy::lexeme
.
It is a branch if given a branch.
Branch Condition |
The branch condition is whatever |
Matches |
Matches and consumes whatever |
Values |
A |
Errors |
All errors raised by |
Example
// Captures the entire input.
dsl::capture(dsl::any)
lexy::dsl::position
lexy/dsl/position.hpp
position : Rule
The position
rule creates as its value an iterator to the current reader position without consuming any input.
Matches |
Any input, but does not consume anything. |
Value |
An iterator to the current position of the reader. |
Error |
n/a (it does not fail) |
Example
// Parses the entire input and returns the final position.
dsl::any + dsl::position
Use position when creating an AST whose nodes are annotated with their original source position.
|
Errors
The following rules are used to customize/improve error messages.
.error<Tag>()
token.error<Tag>() : Token
The error()
function on tokens changes the error that is raised when a token failed.
Matches |
Matches and consumes what |
Error |
A generic error with the specified |
It is useful for tokens such as dsl::token() and operator- , where the result is a generic tag such as lexy::missing_token or lexy::minus_failure .
|
lexy::dsl::error
lexy/dsl/error.hpp
error<Tag> : Branch error<Tag>(rule) : Branch
The error
rule always fails and produces an error with the given tag.
For the second version, the rule is matched first to determine the error range.
Branch Condition |
Branch is always taken. |
Matches |
Nothing and always fails. |
Error |
An error object of the specified |
Use it as the final branch of a choice rule to customize the lexy::exhausted_choice error.
|
lexy::dsl::require
and lexy::dsl::prevent
lexy/dsl/error.hpp
require<Tag>(rule) : Rule prevent<Tag>(rule) : Rule
The require
and prevent
rules can be used to lookahead and fail if the input matches or does not match the token.
Matches |
Both match the |
Error |
An error object of the specified |
Example
// Parses a sequence of digits but raises an error with tag `forbidden_leading_zero` if a zero is followed by more digits.
// Note: this is already available as `dsl::digits<>.no_leading_zero()`.
dsl::zero >> dsl::prevent<forbidden_leading_zero>(dsl::digits<>)
| dsl::digits<>
Use prevent together with times to prevent the rule from matching more than the specified number of times.
|
Branch conditions
The following rules are designed to be used as the condition of an operator>>
.
They have no effect if not used in a context that requires a branch.
lexy::dsl::else_
lexy/dsl/branch.hpp
else_ : Branch
If else_
is used as a condition, that branch will be taken unconditionally.
It must be used as a last alternative in a choice.
lexy::dsl::peek
lexy/dsl/peek.hpp
peek(rule) : Branch
The peek
branch is taken if rule
matches, but does not consume it.
Automatic whitespace skipping is disabled while determining whether rule matches.
|
Long lookahead can slow down parsing speed due to backtracking. |
lexy::dsl::peek_not
lexy/dsl/peek.hpp
peek_not(rule) : Branch
The peek_not()
branch is taken if rule
does not match, but does not consume it.
Automatic whitespace skipping is disabled while determining whether rule matches.
|
Long lookahead can slow down parsing speed due to backtracking. |
lexy::dsl::lookahead
lexy/dsl/lookahead.hpp
lookahead(needle, end) : Branch
The lookahead
branch is taken if lookahead finds needle
before end
is found, which must both be tokens.
No characters are consumed.
Long lookahead can slow down parsing speed due to backtracking. |
Branches
lexy::dsl::operator+
lexy/dsl/sequence.hpp
rule + rule : Rule
A sequence rule matches multiple rules one after the other.
Matches |
Matches and consume the first rule, then matches and consumes the second rule, and so on. Only succeeds if all of them succeed. |
Values |
All the values produced by the rules in the same order as they were matched. |
Errors |
Whatever errors are raised by the individual rules. |
lexy::dsl::operator>>
lexy/dsl/branch.hpp
branch >> rule : Branch
The operator>>
is used to turn a rule into a branch by giving it a branch condition, which must be a branch itself.
If the branch is used as a normal rule, it first matches the condition followed by the rule.
If it is used in a context that requires a branch, the branch is checked to determine whether it should be taken.
Branch Condition |
Whatever |
Matches |
Matches and consume the branch, then matches and consumes the |
Values |
All the values produced by the branch and rule in the same order as they were matched. |
Errors |
Whatever errors are raised by the individual branch and rule. |
lexy::dsl::if_
lexy/dsl/if.hpp
if_(branch) : Rule
The if_
rule matches a branch only if its condition matches.
Matches |
First matches the branch condition. If that succeeds, consumes it and matches and consumes the rest of the branch. Otherwise, consumes nothing and succeeds anyway. |
Values |
Any values produced by the branch. |
Errors |
Any errors produced by the branch. It will only fail after the condition has been matched. |
Example
// Matches an optional C style comment.
dsl::if_(LEXY_LIT("/*") >> dsl::until(LEXY_LIT("*/")))
lexy::dsl::opt
lexy/dsl/opt.hpp
opt(branch) : Rule = branch | else_ >> nullopt
The opt
rule matches a branch only if its condition matches.
Unlike if_
, if the branch was not taken, it produces a lexy::nullopt
.
Matches |
First matches the branch condition. If that succeeds, consumes it and matches and consumes the rest of the branch. Otherwise, consumes nothing and succeeds anyway. |
Values |
If the branch condition matches, any values produced by the rule.
Otherwise, a single object of type |
Errors |
Any errors produced by the branch. It will only fail after the condition has been matched. |
Example
// Matches an optional list of alpha characters.
// (The id<0> is just there, so the sink will be invoked on each character).
// If no items are present, it will default construct the list type.
dsl::opt(dsl::list(dsl::ascii::alpha >> dsl::id<0>))
lexy::dsl::operator|
lexy/dsl/choice.hpp
branch | branch : Rule
A choice rule matches the first branch in order whose condition was matched.
Matches |
Tries to match the condition of each branch in the order they were specified. As soon as one branch condition matches, matches and consumes that branch without ever backtracking to try another branch. If no branch condition matched, fails without consuming anything. |
Values |
Any values produced by the selected branch. |
Errors |
Any errors raised by the then of the selected branch.
If no branch condition matched, a generic error with tag |
Example
// A contrived example to illustrate the behavior of choice.
// Note that branch with id 1 will never be taken, as branch 0 takes everything starting with a and then fails if it isn't followed by bc.
// The correct behavior is illustrated with 2 and 3, there the branch with the longer condition is listed first.
dsl::id<0>(LEXY_LIT("a") >> LEXY_LIT("bc"))
| dsl::id<1>(LEXY_LIT("a") >> LEXY_LIT("b"))
| dsl::id<2>(LEXY_LIT("bc"))
| dsl::id<3>(LEXY_LIT("b"))
The C++ operator precedence is specified in such a way that condition >> a | else_ >> b works.
The compiler might warn that the precedence is not intuitive without parentheses, but in the context of this DSL it is the expected result.
|
Use … | error<Tag> to raise a custom error instead of lexy::exhausted_choice .
|
lexy::dsl::operator/
lexy/dsl/alternative.hpp
token / token : Token
An alternative rule tries to match each token in order, backtracking if necessary.
Matches |
Tries to match each token in the order they were specified. As soon as one token matches, consumes it and succeeds. If no token matched, fails without consuming anything. |
Errors |
A generic error with tag |
If an alternative consists of only literals, a trie is used to efficiently match them without backtracking. |
Use a choice rule with a suitable condition to avoid potentially long backtracking. |
lexy::dsl::switch_
lexy/dsl/switch.hpp
switch_(rule) : Rule switch_(rule).case_(branch) : Rule switch_(rule).default_(rule) : Rule = switch_(rule).case_(else_ >> rule) switch_(rule).error<Tag>() : Rule = switch_(rule).case_(error<Tag>(any))
The switch_
rule matches a rule and then switches over the input the rule has consumed.
Switch cases can be added by calling .case_()
; they are tried in order.
A default case is added using .default_()
; it is taken unconditionally.
Alternatively, an error case can be added using .error<Tag>()
; it produces an error if no previous case has matched.
Matches |
First matches and consumes the switched rule. What the rule has consumed is then taken as the entire input for matching the switch cases. Then it tries to match the branch conditions of each case in order. When a branch condition matches, that case is taken and its then is matched. If no case has matched, it fails. |
Values |
Any values produced by the switched rule followed by any values produced by the selected case. |
Errors |
If the switched rule fails to match, any errors raised by it.
If the branch condition of a case has matched, any errors raised by the then.
If the switch had an error case, a generic error with the specified |
Example
// Parse identifiers (one or more alpha numeric characters) but detect the three reserved keywords.
// We use `+ dsl::eof` in the case condition to ensure that `boolean` is not matched as `bool`.
dsl::switch_(dsl::while_one(dsl::ascii::alnum))
.case_(LEXY_LIT("true") + dsl::eof >> dsl::id<1>)
.case_(LEXY_LIT("false") + dsl::eof >> dsl::id<2>)
.case_(LEXY_LIT("bool") + dsl::eof >> dsl::id<3>)
.default_(dsl::id<0>) // It wasn't a reserved keyword but a normal identifier.
// Note: a more efficient and convenient method for handling keywords is planned.
It does not matter if the then of a case does not consume everything the original rule has consumed. As soon as the then has matched everything parsing continues from the reader position after the switched rule has been matched. |
Loops
lexy::dsl::until
lexy/dsl/until.hpp
until(token) : Token until(token).or_eof() : Token = until(token / eof)
The until
token consumes all input until the specified token
matches, then consumes that.
Matches |
If the closing |
Errors |
It can only fail if the reader has reached the end of the input without matching the condition. Then it raises the same error as raised if the condition would be matched at EOF. |
Example
// Matches a C style comment.
// Note that we don't care what it contains.
LEXY_LIT("/*") >> dsl::until(LEXY_LIT("*/"))
until includes the token .
|
lexy::dsl::loop
lexy/dsl/loop.hpp
loop(rule) : Rule break_ : Rule
The loop
rule matches the given rule repeatedly until it either fails to match or a break_
rule was matched.
Requires |
|
Matches |
While the rule matches, consumes it and repeats.
If a |
Values |
No values are produced. |
Errors |
Any errors raised when the rule fails to match. |
The loop rule is mainly used to implement other rules.
It is unlikely that you are going to need it yourself.
|
If rule contains a branch that will not consume any characters but does not break, loop will loop forever.
|
lexy::dsl::while_
lexy/dsl/while.hpp
while_(branch) : Rule
The while
rule matches a branch as long as it condition has matched.
Requires |
|
Matches |
While the branch condition matches, matches and consumes the then then repeats. If the branch condition does not match anymore, succeeds without consuming additional input. |
Values |
No values are produced. |
Errors |
The rule can only fail if the then of the branch fails. Then it will raise its error unchanged. |
If the branch does not consume any characters, while_ will loop forever.
|
lexy::dsl::while_one()
lexy/dsl/while.hpp
while_one(branch) : Branch = branch >> while_(branch)
The while_one
rule matches a rule one or more times.
lexy::dsl::do_while()
lexy/dsl/while.hpp
do_while(rule, condition_branch) : Rule = rule + while_(condition_branch >> rule) do_while(branch, condition_branch) : Branch = branch >> while_(condition_branch >> rule)
The do_while
rule matches a rule first unconditionally, and then again repeatedly while the rule matches.
Example
// Equivalent to `dsl::list(dsl::ascii::alpha, dsl::sep(dsl::comma))` but does not produce a value.
dsl::do_while(dsl::ascii::alpha, dsl::comma)
lexy::dsl::sep
and lexy::dsl::trailing_sep
lexy/dsl/separator.hpp
sep(branch) trailing_sep(branch)
sep
and trailing_sep
are used to specify a separator between repeated items; they are not rules that can be parsed directly.
Use sep(branch)
to indicate that branch
has to be consumed between two items.
If it would match after the last item, it is not consumed by the rule.
Use trailing_sep(branch)
to indicate that branch
has to be consumed between two items and can occur after the final item.
If it matches after the last item, it is consumed as well.
lexy::dsl::times
lexy/dsl/times.hpp
namespace lexy
{
template <std::size_t N, typename T>
using times = T (&)[N];
template <typename T>
using twice = times<2, T>;
}
lexy/dsl/times.hpp
times<N>(rule) : Rule times<N>(rule, sep) : Rule twice(rule) : Rule = times<2>(rule) twice(rule, sep) : Rule = times<2>(rule, sep)
The times
rule repeats the rule N
times with optional separator in between and collects all produced values into an array.
The twice
rule is a convenience alias for N = 2
.
Requires |
The separator must not produce any values. All values produced by the parsing the rule must have a common type. In particular, the rule must only produce one value. |
Matches |
If no separator is specified, matches and consumes |
Values |
Produces a single array containing |
Errors |
All errors raised by matching the rule or separator. |
Example
// Parses an IPv4 address (4 uint8_t's seperated by periods).
dsl::times<4>(dsl::integer<std::uint8_t>(dsl::digits<>), dsl::sep(dsl::period))
lexy::dsl::list
lexy/dsl/list.hpp
list(rule) : Rule list(branch) : Branch list(rule, sep) : Rule list(branch, sep) : Branch
The list
rule matches a rule one or more times, optionally separated by a separator.
Values produced by the list items are forwarded to a sink callback.
Branch Condition |
Whatever |
Requires |
The item rule must be a branch unless a non-trailing separator is used (in that case the separator can be used as condition).
A production whose rule contains |
Matches |
Matches and consumes the item rule one or more times. In between items and potentially after the final item, a separator is matched and consumed if provided according to its rules. If the separator is provided and non-trailing, the existence of a separator determines whether or not the rule should be matched again. Otherwise, the branch condition of the branch rule or an added else branch of the choice rule is used to determine that. |
Values |
Only a single value, which is the result of the finished sink. Every time the item rule is parsed, all values it produces are passed to the sink which is invoked once per iteration. If the separator is captured, its lexeme is also passed to the sink, but in a separate invocation. |
Errors |
All errors raised when parsing the item rule or separator. |
Example
// Parses a list of integers seperated by (a potentially trailing) comma.
// As the separator is trailing, it cannot be used to determine the end of the list.
// As such we peek whether the input contains a digit in our item condition.
// The sink is invoked with each integer.
dsl::list(dsl::peek(dsl::digit<>) >> dsl::integer<int>(dsl::digits<>),
dsl::trailing_sep(dsl::comma))
Use one of the bracketing rules if your list item does not have an easy condition and the list is surrounded by given tokens anyway. |
lexy::dsl::opt_list
lexy/dsl/list.hpp
opt_list(branch) : Rule opt_list(branch, sep) : Rule
The opt_list
rule matches a rule zero or more times, optionally separated by a separator.
Values produced by the list items are forwarded to a sink callback.
Requires |
The item rule must be a branch.
A production whose rule contains |
Matches |
Checks whether the item rule would match using its branch condition.
If it does, matches and consumes |
Values |
If the list is non-empty, the result of the sink produced by parsing the |
Errors |
If the list is non-empty, all errors raised by parsing the |
lexy::dsl::combination
lexy/dsl/combination.hpp
combination(branch1, branch2, ...) : Rule combination<Tag>(branch1, branch2, ...) : Rule
The combination
rule matches each of the sub-rules exactly once but in any order.
Values produced by the rules are forwarded to a sink.
Requires |
A production whose rule contains |
Matches |
Matches and consumes all rules in an arbitrary order.
This is done by parsing the choice created from the branches exactly |
Values |
Only a single value, which is the result of the finished sink. All values produced by the branches are passed to the sink which is invoked once per iteration. |
Errors |
All errors raised by parsing the branches.
If no branch is matched, but there are still missing branches,
a generic error with tag |
Example
// Matches 'a', 'b', or 'c', in any order.
dsl::combination(dsl::lit_c<'a'>, dsl::lit_c<'b'>, dsl::lit_c<'c'>)
The branches are tried in order. If an earlier branch always takes precedence over a later one, the combination can never be successful. |
lexy::dsl::partial_combination
lexy/dsl/combination.hpp
partial_combination(branch1, branch2, ...) : Rule partial_combination<Tag>(branch1, branch2, ...) : Rule
The partial_combination
rule matches each of the sub-rules at most once but in any order.
Values produced by the rules are forwarded to a sink.
Requires |
A production whose rule contains |
Matches |
Matches and consumes a subset of the rules in an arbitrary order.
This is done by parsing the choice created from the branches exactly |
Values |
Only a single value, which is the result of the finished sink. All values produced by the branches are passed to the sink which is invoked once per iteration. |
Errors |
All errors raised by parsing the branches.
If a rule is matched twice, a generic error is raised.
It has the specified tag or |
Example
// Matches a subset of 'a', 'b', or 'c', in any order.
dsl::partial_combination(dsl::lit_c<'a'>, dsl::lit_c<'b'>, dsl::lit_c<'c'>)
The branches are tried in order. If an earlier branch always takes precedence over a later one, the combination can never be successful. |
Productions
Every rule is owned by a production. The following rules allow interaction with other productions.
lexy::dsl::p
and lexy::dsl::recurse
lexy/dsl/production.hpp
p<Production> : Rule or Branch recurse<Production> : Rule
The p
and recurse
rules parses the rule of another production.
The p
rule is a branch, if the rule of the other production is a branch.
Requires |
For |
Branch Condition |
Whatever the production’s rule uses as a branch condition. |
Matches |
Matches and consumes |
Values |
A single value, which is the result of parsing the production. All values produced by parsing its rule are forwarded to the productions value callback. |
Errors |
If matching fails, |
Example
// Parse a sub production followed by an exclamation mark.
dsl::p<sub_production> + dsl::lit_c<'!'>
While recurse can be used to implement direct recursion (e.g. prefix >> dsl::p<current_production> | dsl::else_ >> end to match zero or more prefix followed by end ), it is better to use loops instead.
|
Left recursion will create an infinite loop. |
If a production is parsed while whitespace skipping has been disabled using lexy::dsl::no_whitespace() ,
it is temporarily re-enabled while Production::rule is parsed.
If whitespace skipping has been disabled because the parent production inherits from lexy::token_production ,
whitespace skipping is still disabled while parsing Production::rule .
|
lexy::dsl::return_
lexy/dsl/return.hpp
return_ : Rule
Conceptually, each production has an associated function that parses the specified rule.
The return_
rule will exit that function early, without parsing subsequent rules.
Requires |
It must not be used inside loops. |
Matches |
Any input, but does not consume anything. Subsequent rules are not matched further. |
Values |
It does not produce any values, but all values produced so far are forwarded to the callback. |
Errors |
n/a (it does not fail) |
Example
// Match an opening parenthesis followed by 'a' or 'b'.
// If it is followed by 'b', the closing parenthesis is not matched anymore.
dsl::parenthesized(dsl::lit_c<'a'> | dsl::lit_c<'b'> >> dsl::return_)
When using return_ together with the context sensitive parsing facilities, remember to pop all context objects before the return.
|
Brackets and terminator
Terminator
lexy/dsl/terminator.hpp
terminator(branch) terminator(branch).terminator() : Branch = branch
A terminator can be specified using terminator()
.
The result is not a rule, but a DSL for specifying that a rule is followed by the terminator.
The terminator is defined using a branch; it is returned by calling .terminator()
.
lexy/dsl/terminator.hpp
t(rule) : Rule = rule + t.terminator()
Calling t(rule)
, where t
is the result of a terminator()
call, results in a rule that parses the given rule
followed by the terminator.
lexy/dsl/terminator.hpp
t.while_(rule) : Rule t.while_one(rule) : Rule t.opt(rule) : Rule t.list(rule) : Rule t.list(rule, sep) : Rule t.opt_list(rule) : Rule t.opt_list(rule, sep) : Rule
Using t.while_()
, t.while_one()
t.opt()
, t.list()
, or t.opt_list()
, where t
is the result of a terminator()
call,
results in a rule that parses while_(rule)
, while_one(rule)
, opt(rule)
, list(rule)
and opt_list(rule)
, respectively, but followed by the terminator.
The rule
does not need to be a branch, as the terminator is used as the branch condition for the while_()
, opt()
and list()
rule.
Brackets
lexy/dsl/brackets.hpp
brackets(open_branch, close_branch) brackets(open_branch, close_branch).open() : Branch = open_branch brackets(open_branch, close_branch).close() : Branch = close_branch
A set of open and close brackets can be specified using brackets()
.
The result is not a rule, but a DSL for specifying that a rule is surrounded by brackets.
The open and close brackets are defined using branches; they are returned by calling .open()
and .close()
.
lexy/dsl/brackets.hpp
b(rule) : Branch = b.open() >> rule + b.close()
Calling b(rule)
, where b
is the result of a brackets()
call, results in a rule that parses the given rule
surrounded by brackets.
The rule is a branch that uses the opening bracket as a branch condition.
lexy/dsl/brackets.hpp
b.while_(rule) : Branch b.while_one(rule) : Branch b.opt(rule) : Branch b.list(rule) : Branch b.list(rule, sep) : Branch b.opt_list(rule) : Branch b.opt_list(rule, sep) : Branch
Using b.while_()
, b.while_one()
b.opt()
, b.list()
, or b.opt_list()
, where b
is the result of a brackets()
call, results in a branch that parses while_(rule)
, while_one(rule)
, opt(rule)
, list(rule)
and opt_list(rule)
, respectively, but surrounded as brackets.
The rule
does not need to be a branch, as the closing brackets is used as the branch condition for the while_()
, opt()
and list()
rule.
lexy/dsl/brackets.hpp
round_bracketed = brackets(lit_c<'('>, lit_c<')'>) square_bracketed = brackets(lit_c<'['>, lit_c<']'>) curly_bracketed = brackets(lit_c<'{'>, lit_c<'}'>) angle_bracketed = brackets(lit_c<'<'>, lit_c<'>'>) parenthesized = round_bracketed
Common sets of open and close brackets are pre-defined.
Example
// Parses a list of integers seperated by (a potentially trailing) comma surrounded by parentheses.
// The same example without the parentheses was also used for list,
// but we required a list condition that needed to perform lookahead.
// Now, the closing parentheses is used as the condition and we don't need to lookahead.
dsl::parenthesized.list(dsl::integer<int>(dsl::digits<>),
dsl::trailing_sep(dsl::comma))
Numbers
The facilities for parsing integers are split into the digit token, which do not produce any values,
and the integer
rule, which matches a digit token and converts it into an integer.
The integer conversion has to be done during and parsing and not as a callback, as overflow creates a parse error.
Base
lexy/dsl/digit.hpp
namespace lexy::dsl
{
struct binary;
struct octal;
struct decimal;
struct hex_lower;
struct hex_upper;
struct hex;
}
The set of allowed digits and their values is specified using a Base
, which is a policy class passed to the rules.
binary
-
Matches the base 2 digits
0
and1
. octal
-
Matches the base 8 digits
0-7
. decimal
-
Matches the base 10 digits
0-9
. If no base is specified, this is the default. hex_lower
-
Matches the lower-case base 16 digits
0-9
anda-f
. hex_upper
-
Matches the upper-case base 16 digits
0-9
andA-F
. hex
-
Matches the base 16 digits
0-9
,A-F
, anda-f
.
lexy::integer_traits
lexy/dsl/integer.hpp
namespace lexy { template <typename T> struct integer_traits { using type = T; static constexpr bool is_bounded; template <int Radix> static constexpr std::size_t max_digit_count; template <int Radix> static constexpr void add_digit_unchecked(type& result, unsigned digit); template <int Radix> static constexpr bool add_digit_checked(type& result, unsigned digit) }; template <> struct integer_traits<lexy::code_point>; template <typename T> struct unbounded {}; template <typename T> struct integer_traits<unbounded<T>> { using type = typename integer_traits<T>::type; static constexpr bool is_bounded = false; template <int Radix> static constexpr void add_digit_unchecked(type& result, unsigned digit); }; }
The lexy::integer_traits
are used for parsing an integer.
It controls its maximal value and abstracts away the required integer operations.
The type
member is the actual type that will be returned by the parse operation. It is usually T
.
The parsing algorithm does not require that type
is an integer type, it only needs to have a constructor that initializes it from an int
.
If is_bounded
is true
, parsing requires overflow checking.
Otherwise, parsing does not require overflow checking and max_digit_count
and add_digit_checked
are not required.
max_digit_count
returns the number of digits necessary to express the bounded integers maximal value in the given radix.
It must be bigger than 1
.
add_digit_unchecked
and add_digit_checked
add digit
to result by doing the equivalent of result = result * Radix + digit
.
The _checked
version returns true
if that has lead to an integer overflow.
The primary template works with any integer type and there is a specialization for lexy::code_point
.
By wrapping your integer type in lexy::unbounded
, you can disable bounds checking during parsing.
It specialization of lexy::integer_traits
is built on top of the specialization of lexy::integer_traits<T>
,
but disables all bounds checking.
You can specialize lexy::integer_traits
for your own integer types.
lexy::dsl::zero
lexy/dsl/digit.hpp
zero : Token
The zero
token matches the zero digit.
Matches |
Matches and consumes the zero digit |
Errors |
Raises a |
lexy::dsl::digit
lexy/dsl/digit.hpp
digit<Base> : Token
The digit
token matches a digit of the specified base or decimal
if no base was specified.
Matches |
Matches and consumes any of the valid digits of the base. |
Errors |
Raises a |
lexy::dsl::digits
lexy/dsl/digit.hpp
digits<Base> : Token digits<Base>.sep(token) : Token digits<Base>.no_leading_zero() : Token
The digits
token matches a non-empty sequence of digits in the specified base or decimal
if no base was specified.
Calling .sep()
allows adding a digit separator token that can be present at any point between two digits, but is not required.
Calling .no_leading_zero()
raises an error if one or more leading zeros are encountered.
The calls to .sep()
and .no_leading_zero()
can be chained.
Matches |
Matches and consumes one or more digits of the specified base.
If a separator was added, it tries to match it after every digit.
It is consumed if it was matched, but it does not fail if no separator was present.
If a separator is matched without a following digit, it fails.
If |
Errors |
All errors raised by |
Example
// Matches upper-case hexadecimal digits seperated by ' without leading zeroes.
dsl::digits<dsl::hex_upper>.sep(dsl::digit_sep_tick).no_leading_zero()
The separator can be placed at any point between two digits. There is no validation of rules to ensure it is a thousand separator or similar conventions. |
lexy/dsl/digit.hpp
digit_sep_underscore : Token = lit<"_"> digit_sep_tick : Token = lit<"'">
For convenience, two common digit separators _
and '
are predefined as digit_sep_underscore
and digit_sep_tick
respectively.
However, the digit separator can be an arbitrarily complex token.
lexy::dsl::n_digits
lexy/dsl/digit.hpp
n_digits<N, Base> : Token n_digits<N, Base>.sep(token) : Token
The n_digits
token matches exactly N
digits in the specified base or decimal
if no base was specified.
Calling .sep()
allows adding a digit separator token that can be present at any point between two digits, but is not required.
Matches |
Matches and consumes exactly |
Errors |
All errors raised by |
Example
// Matches 4 upper-case hexadecimal digits seperated by '.
dsl::n_digits<4, dsl::hex_upper>.sep(dsl::digit_sep_tick)
lexy::dsl::integer
lexy/dsl/integer.hpp
integer<T, Base>(token) : Rule
The integer
rule converts the lexeme matched by the token
into an integer of type T
using the given base.
The Base
can be omitted if the token is digits
or n_digits
.
It will then be deduced from the token.
Matches |
Matches and consumes what |
Values |
An integer of type |
Errors |
Any errors raised by matching the token.
If the integer type |
Example
// Matches upper-case hexadecimal digits seperated by ' without leading zeroes.
// Converts them into an integer, the base is deduced from the token.
dsl::integer<int>(dsl::digits<dsl::hex_upper>
.sep(dsl::digit_sep_tick).no_leading_zero())
lexy::dsl::code_point_id
lexy/dsl/integer.hpp
code_point_id<N, Base> : Rule = integer<lexy::code_point>(n_digits<N, Base>)
The code_point_id
rule is a convenience rule that parses a code point.
It matches N
digits in the specified base, which defaults to hex
, and converts it into a code point.
Matches |
Matches and consumes exactly |
Values |
The |
Errors |
The same error as |
lexy::dsl::plus_sign
, lexy::dsl::minus_sign
, and lexy::dsl::sign
lexy/dsl/sign.hpp
plus_sign : Rule minus_sign : Rule sign : Rule
The plus_sign
, minus_sign
, and sign
rule match an optional sign.
Matches |
|
Errors |
n/a (they don’t fail) |
Example
// Parse a decimal integer with optional minus sign.
dsl::minus_sign + dsl::integer<int>(dsl::digits<>)
The callback lexy::as_integer takes the value produced by the sign rules together with an integer produced by the integer rule and negates it if necessary.
|
Delimited and quoted
lexy/dsl/delimited.hpp
delimited(open_branch, close_branch) delimited(open_branch, close_branch).open() : Branch = open_branch delimited(open_branch, close_branch).close() : Branch = close_branch
A set of open and close delimiters can be specified using delimited()
.
The result is not a rule, but a DSL for specifying a sequence of code points to be matched between the delimiters.
The open and close delimiters are defined using branches; they are returned by calling .open()
and .close()
.
lexy/dsl/delimited.hpp
delimited(branch) = delimited(branch, branch)
There is a convenience overload if the same rule is used for the open and closing delimiters.
lexy/dsl/delimited.hpp
quoted = delimited(lit<"\"">) triple_quoted = delimited(lit<"\"\"\"">) single_quoted = delimited(lit<"'">) backticked = delimited(lit<"`">) double_backticked = delimited(lit<"``">) triple_backticked = delimited(lit<"```">)
Common delimiters are predefined.
The naming of quoted , triple_quoted and single_quoted is not very logical, but reflects common usage.
|
Simple delimited
lexy/dsl/delimited.hpp
d(rule) : Branch d(token) : Branch = d(capture(token))
Calling d(rule)
, where d
is the result of a delimiter()
call, results in a rule that matches rule
as often as possible surrounded by the delimiters.
Values produced by the rule
are forwarded to a sink callback.
For convenience, if passing a token, the token is captured. Otherwise, nothing would be passed to the sink.
Requires |
A production whose rule contains a delimited rule must provide a sink. |
Branch Condition |
Whatever the opening delimiter uses as branch condition. |
Matching |
Matches and consumes the opening delimiter, followed by zero or more occurrences of |
Values |
Values produced by the opening delimiter, the finished sink (which might be empty), and values produced by the closing delimiter. Every time the rule is parsed, all values produced by it are passed to the sink. |
Errors |
All errors raised when matching the opening delimiter and the rule.
If EOF is reached without a closing delimiter, a generic error with tag |
Example
// Match a string consisting of code points that aren't control characters.
dsl::quoted(dsl::code_point - dsl::ascii::control)
When rule contains a lexy::dsl::p or lexy::dsl::recurse rule, whitespace skipping is re-enabled while that production is parsed.
|
Delimited with escape sequences
lexy/dsl/delimited.hpp
d(rule, escape_branch) : Branch = d(escape_branch | else_ >> rule) d(token, escape_branch) : Branch = d(escape_branch | else_ >> capture(token)) d(rule, escape_choice) : Branch = d(escape_choice | else_ >> rule) d(token, escape_choice) : Branch = d(escape_choice | else_ >> capture(token))
There is a convenience overload to specify escape sequences in the delimited.
The choice
matches all appropriate escape sequences and produces their values.
Calling d(rule, escape)
, where d
is the result of a delimiter()
call, is equivalent to d(escape | else_ >> rule)
, so it results in a rule that matches escape | else_ >> rule
as often as possible surrounded by the delimiters.
Values produced by the rule
or escape
are forwarded to a sink callback.
For convenience, if passing a token, the token is captured. Otherwise, nothing would be passed to the sink.
Example
// Match a string consisting of code points that aren't control characters.
// `\"` can be used to add a `"` to the string.
dsl::quoted(dsl::code_point - dsl::ascii::control,
LEXY_LIT("\\\"") >> dsl::value_c<'"'>)
The closing delimiter is used as termination condition here as well. If the escape sequence starts with a closing delimiter, it will not be matched. |
lexy::dsl::escape()
lexy/dsl/delimited.hpp
escape(token) : Rule
For convenience, the escape
rule can be used to specify the escape token.
An escape rule consists of a leading token that matches the escape character (e.g. \
), and zero or more alternatives for characters that can be escaped.
It then is equivalent to token >> (alt0 | alt1 | alt2 | error<lexy::invalid_escape_sequence>)
.
It will only be considered after the leading token has been matched and then tries to match one of the alternatives.
If no alternative matches, it raises a generic error with tag lexy::invalid_escape_sequence
.
lexy/dsl/delimited.hpp
e.rule(branch) : Rule = escape_token >> ( ... | branch | else_ >> error<lexy::invalid_escape_sequence>)
Calling e.rule(branch)
, where e
is an escape rule, adds branch
to the end of the choice.
lexy/dsl/delimited.hpp
e.capture(token) : Rule = escape_token >> (... | capture(token) | else_ >> error<lexy::invalid_escape_sequence>)
Calling e.capture(token)
, where e
is an escape rule, adds an escape sequence that matches and captures token to the end of the choice.
lexy/dsl/delimited.hpp
e.lit<Str>(rule) : Rule = escape_token >> (... | lit<Str> >> rule | else_ >> error<lexy::invalid_escape_sequence>) e.lit<Str>() : Rule = e.lit<Str>(value_str<Str>) e.lit_c<C>(rule) : Rule = escape_token >> (... | lit_c<C> >> rule | else_ >> error<lexy::invalid_escape_sequence>) e.lit_c<C>() : Rule = e.lit_c<C>(value_c<C>)
Calling e.lit()
or e.lit_c()
, where e
is an escape rule, adds an escape sequences that matches the literal and produces the values of the rule to the end of the choice.
If no rule is specified, it defaults to producing the literal itself.
lexy/dsl/delimited.hpp
backslash_escape = escape(lit_c<'\\'>) dollar_escape = escape(lit_c<'$'>)
Common escape characters are predefined.
Example
// Match a string consisting of code points that aren't control characters.
// `\"` can be used to add a `"` to the string.
// `\uXXXX` can be used to add the code point with the specified value.
dsl::quoted(dsl::code_point - dsl::ascii::control,
dsl::backslash_escape
.lit_c<'"'>()
.rule(dsl::lit_c<'u'> >> dsl::code_point_id<4>)
Aggregates
lexy/dsl/member.hpp
member<MemPtr> = rule : Rule member<MemPtr> = branch : Branch LEXY_MEM(Name) = rule : Rule LEXY_MEM(Name) = branch : Branch
The member
rule together with the lexy::as_aggregate<T>
callback assigns the values produced by the rule given to it via =
to the specified member of the aggregate T
.
Requires |
A production that contains a member rule needs to use |
Matches |
Matches and consumes the |
Values |
Produces two values.
The first value identifiers the targeted member.
For |
Errors |
All errors raised during parsing of the assigned rule. |
The lexy::as_aggregate<T>
callback, collects all member and value pairs.
It then constructs an object of type T
using value initialization and for each pair assigns the value to the specified member of it.
This works either as callback or a sink.
If a member is specified more than once, the final value is stored at the end.
Example
// Parses two integers separated by commas.
// The first integer is assigned to a member called `second`,
// the second integer is assigned to a member called `first`.
(LEXY_MEM(second) = dsl::integer<int>(dsl::digits<>))
+ dsl::comma
+ (LEXY_MEM(first) = dsl::integer<int>(dsl::digits<>))
Context sensitive parsing
To parse context sensitive grammars, lexy
allows the creation of context variables.
They allow to save state between different rules which can be used to parse context sensitive elements such as XML with matching opening and closing tag names.
A context variable has a type, which is limited to bool
, int
and lexy::lexeme
, and an identifier, which is given by a type.
Before a variable can be used it needs to be created with .create()
.
It is then available for all rules of the current production: child and parent production cannot access them.
Variables are not persistent between multiple invocations of a production;
every time a production is parsed it starts out with no variables.
See example/xml.cpp
for an example that uses the context sensitive parsing facilities.
lexy::dsl::context_flag
lexy/dsl/context_flag.hpp
context_flag<Id>
A lexy::dsl::context_flag
controls a boolean that can be true
or false
.
Each object is uniquely identified by the type Id
.
It is not a rule but a DSL for specifying operations which are then rules.
context_flag<Id>.create() : Rule context_flag<Id>.create<Value>() : Rule
The .create()
rule does not interact with the input at all.
When it is parsed, it creates the flag with the given Id
and initializes it to the Value
(defaulting to false
).
context_flag<Id>.set() : Rule context_flag<Id>.reset() : Rule
The .set()
/.reset()
rules do not interact with the input at all.
When they are parsed, they set the flag with the given Id
to true
/false
respectively.
context_flag<Id>.toggle() : Rule
The .toggle()
rule does not interact with the input at all.
When it is parsed, it toggles the value of the flag with the given Id
.
context_counter<Id>.select(rule_true, rule_false) : Rule
The .select()
rule selects on the given rules depending on the value of the flag with the given Id
.
It then parses the selected rule.
Matches |
If the value of the flag is |
Values |
All values produced by parsing the selected rule. |
Errors |
All errors raised by parsing the selected rule. |
context_flag<Id>.require<ErrorTag>() : Rule context_flag<Id>.require<Value, ErrorTag>() : Rule
The .require()
rule does not interact with the input at all.
When it is parsed, it checks that the value of the flag with the given Id
is the given Value
(defaults to true
).
If that is the case, parsing continues.
Otherwise, the rule fails, producing an error with the given ErrorTag
.
lexy::dsl::context_counter
lexy/dsl/context_counter.hpp
context_counter<Id>
A lexy::dsl::context_counter
controls a C++ int
.
Each object is uniquely identified by the type Id
.
It is not a rule but a DSL for specifying operations which are then rules.
context_counter<Id>.create() : Rule context_counter<Id>.create<Value>() : Rule
The .create()
rule does not interact with the input at all.
When it is parsed, it creates the counter with the given Id
and initializes it to the Value
(defaulting to 0
).
context_counter<Id>.inc() : Rule context_counter<Id>.dec() : Rule
The .inc()
/.dec()
rules do not interact with the input at all.
When they are parsed, they increment/decrement the counter with the given Id
respectively.
context_counter<Id>.push(rule) : Rule context_counter<Id>.pop(rule) : Rule
The .push()
/.pop()
rules parse the given rule
.
The counter with the given Id
is then incremented/decremented by the number of characters (code units) consumed by rule
.
Matches |
Matches and consumes |
Values |
All values produced by parsing |
Errors |
All errors raised by parsing |
context_counter<Id>.compare<Value>(rule_less, rule_eq, rule_greater) : Rule
The .compare()
rule compares the value of the counter with the given Id
to Value
.
It then parses one of the three given rules, depending on the result.
Matches |
If the value of the counter is less than |
Values |
All values produced by parsing the selected rule. |
Errors |
All errors raised by parsing the selected rule. |
context_counter<Id>.require<ErrorTag>() : Rule context_counter<Id>.require<Value, ErrorTag>() : Rule
The .require()
rule does not interact with the input at all.
When it is parsed, it checks that the value of the counter with the given Id
is the given Value
(defaults to 0
).
If that is the case, parsing continues.
Otherwise, the rule fails, producing an error with the given ErrorTag
.
lexy::dsl::context_lexeme
lexy/dsl/context_lexeme.hpp
context_lexeme<Id>
A lexy::dsl::context_flag
controls a lexy::lexeme
(i.e. a string view on part of the input).
Each object is uniquely identified by the type Id
.
It is not a rule but a DSL for specifying operations which are then rules.
context_lexeme<Id>.create() : Rule
The .create()
rule does not interact with the input at all.
When it is parsed, it creates the lexeme with the given Id
and initializes it to an empty view.
context_lexeme<Id>.capture(rule) : Rule
The .capture()
rule parses the given rule
.
The lexeme with the given Id
is then set to view everything the rule
has consumed as-if lexy::dsl::capture()
was used.
Matches |
Matches and consumes |
Values |
All values produced by parsing |
Errors |
All errors raised by parsing |
context_lexeme<Id>.require<ErrorTag>(rule) : Rule
The .require()
rule parses the given rule
, capturing it in a temporary lexeme.
The temporary lexeme is then compared with the lexeme given by the Id
.
If the two lexemes are equal, parsing continues.
Otherwise, the rule fails, producing an error with the given ErrorTag
.
Matches |
Matches and consumes |
Values |
Discards values produced by |
Errors |
All errors raised by parsing |
Raw input
The following facilities are meant for parsing input that uses the lexy::raw_encoding
, that is input consisting of bytes, not text.
lexy::dsl::bom
lexy/dsl/bom.hpp
bom<Encoding, Endianness> : Token
The bom
token matches the byte-order mark (BOM) for the given encoding and lexy::encoding_endianness
.
Requires |
|
Matches |
If the encoding has a BOM, matches and consumes the BOM written in the given endianness. |
Errors |
A |
Example
// Matches the UTF-16 big endian BOM (0xFE, 0xFF).
dsl::bom<lexy::utf16_encoding, lexy::encoding_endianness::big>
There is a UTF-8 BOM, but it is the same regardless of endianness. |
This rule is only necessary when you have a raw encoding that contains a BOM.
For example, lexy::read_file() already handles and deals with BOMs for you by default.
|
lexy::dsl::encode
encode<Encoding, Endianness>(rule) : Rule ---
The encode
rule temporarily changes the encoding of the input.
The specified rule
will be matched using a Reader
whose encoding is Encoding
converted from the raw bytes using the specified endianness.
If no Endianness
is specified, the default is lexy::encoding_endianness::bom
, and a BOM is matched on the input to determine the endianness.
If no BOM is present, big endian is assumed.
Requires |
The input’s encoding is a single-byte encoding (usually |
Matches |
If the endianness is |
Values |
All values produced by the rule. |
Errors |
All errors raised by the rule. The error type uses the original reader, not the encoded reader that does the input translation. |
Example
// Matches a UTF-8 code point, followed by an ASCII code point.
dsl::encode<lexy::utf8_encoding>(dsl::code_point)
+ dsl::encode<lexy::ascii_encoding>(dsl::code_point)
Custom rules
The exact interface for the Rule
, Token
and Branch
concepts is currently still experimental.
Refer to the existing rules if you want to add your own.
Glossary
- Branch
-
A rule that has an associated condition and will only be taken if the condition matches. It is used to make decisions in the parsing algorithm.
- Callback
-
A function object with a
return_type
member typedef. - Encoding
-
Set of pre-defined classes that define the text encoding of the input.
- Error Callback
-
The callback used to report errors.
- Grammar
-
An entry production and all productions referenced by it.
- Input
-
Defines the input that will be parsed.
- Production
-
Building-block of a grammar consisting of a rule and an optional callback that produces the parsed value.
- Rule
-
Matches a specific input and then produces a value or an error.
- Sink
-
A type with a
sink()
method that then returns a function object that can be called multiple times. - Token
-
A rule that is an atomic building block of the input.