L43Peg
A Parse Expression Grammar library for Ruby
This Version (v0.1.x) is Alpha Quality (many PEG features are missing, like recursion and even alternatives.
It is however released because it offers quite some nice parsing of ARGV which shall be demonstrated by the following
[speculations](https://rubygems.org/gems/speculate_about)
See [README_spec.rb](spec/speculations/README_spec.rb) for the generated code for details
Context: arg_parser
Given the following argument specification
include L43Peg::Combinators
let :args_spec do
{
start: "--start=(.*)",
end: "(?:--end|-e)=(.*)",
kwd: "--(alpha|beta|gamma)"
}
end
And the assoicated parser
let(:parser) { args_parser(args_spec) }
Then we can parse some input
assert_parse_success(parser, %w[--start=42 --beta -e=44], ast: {start: "42", kwd: "beta", end: "44"}, rest: [])
And we can get the rest in a list of tokens
assert_parse_success(parser, %w[--start=42 --beta -e=44 -s=not_an_arg --end=too_late], ast: {start: "42", kwd: "beta", end: "44"}, rest: %w[-s=not_an_arg --end=too_late])
Also note that multiple values are passed into an array
input = %w[--end=42 --beta -e=44 --beta --end=not_too_late --gamma]
ast = {end: %w[42 44 not_too_late], kwd: %w[beta beta gamma]}
assert_parse_success(parser, input, ast:, rest: [])
Context: Postprocessing
When we map the parser
let :int_args do
{
start: "--start=(.*)",
end: "--end=(.*)",
inc: "--inc=(.*)"
}
end
let(:int_arg_parser) {args_parser(int_args, name: "int parser", &:to_i)}
Then we can convert the string valus
assert_parse_success(int_arg_parser, %w[--start=42 --end=44 --inc=2], ast: {start: 42, end: 44, inc: 2}, rest: [])
Context: Knowing When To Stop
An argument parser that respects itself provides a means to end argument parsing even if more matches follow.
An exmaple for that is the posix argument --
We can use whatever we want in args_parser, here is a variation:
Given the specification
let :args do
{
width: "w:(\\d+)",
height: "h:(\\d+)",
__stop: "(::)"
}
end
let(:wh_parser) {args_parser(args, stop: :__stop, &:to_i)}
Then parsing the following input
input = %w[h:42 w:73 :: w:74]
ast = {height: 42, width: 73}
assert_parse_success(wh_parser, input, ast:, rest: %w[w:74])
Context: User Interface
Context: Exposing the args_parser
Above we have seen that we had to include an internal module so to get access to the args_parser.
Client code might not want to use these intrusive methods and therefore the parsers are also exposed
as module methods
Given an exposed args_parser
let :parser do
L43Peg::Parsers.args_parser(
{
negative: "(-\\d+)",
positive: "\\+?(\\d+)"
},
&:to_i
)
end
But we are also not interested in the internal representation of success and failure of parsing which was
used in the speculations above. Nor do we want to transform our input into the internal representations
as was done above by the helpers. (If you need to see the details of this you can inspect the
file parser_test.rb in spec/support)
Then we can uses the interface of L43Peg
L43Peg.parse_tokens(parser, %w[43 -44 +45]) => :ok, result
expect(result).to eq(positive: [43, 45], negative: -44)
And if we get an error the result is as follows
parser = L43Peg::Parsers.char_parser('a')
L43Peg.parse_string(parser, 'b') => :error,
expect().to eq("char \"b\"")
Context: Regexp Parser
The basic concept is the rgx_parser
Given a rgx_parser for an identifier
include L43Peg::Parsers
let(:id_parser) { rgx_parser("[[:alpha:]][_[:alnum:]]*") }
Then we can parse strings that start as such
assert_parse_success(id_parser, "l43_peg", ast: "l43_peg")
And we can discard some input from the ast with the aid of captures
sym_parser = rgx_parser(":([[:alpha:]][_[:alnum:]]*)")
assert_parse_success(sym_parser, ":no_colon", ast: "no_colon")
But it can also fail
reason = "input does not match /\\A[[:alpha:]][_[:alnum:]]*/ (in rgx_parser(\"[[:alpha:]][_[:alnum:]]*\"))"
assert_parse_failure(id_parser, "42", reason:)
Context: Warnings on empty matches
Oftentimes bugs in PEG parsing are caused by zero width matches, while this is quite obvious with the many and
opt or maybe combinators (N.B. they are not yet implemented, use many(max: 1) instead)
and they common use patterns with these combinators are safe.
However regular expression parsing might hide zero width matches, and that's whey they will trigger a warning by default
Given an empty match rgex parser
let(:empty_parser) { rgx_parser("a*") }
Then we get a warning when matching an empty string
expect { assert_parse_success(empty_parser, "", ast: "") }
.to output("Warning, parser rgx_parser(\"a*\") succeeds with empty match\n").to_stderr
However this behavior can also be disabled
And therefore
parser = rgx_parser("a*", warn_on_empty: false)
expect { assert_parse_success(parser, "", ast: "") }
.not_to output.to_stderr
Context: Tokenize Strings with Regexen
Now we can use a list of rgx_parsers to tokenize a string (in the same way can use tokens_parser to
quantify elements of an array, but with dynamic bounds)
Given some regexen
let :regexen do
[
[:verb, "<<", nil, ->(*){ [:verb, "<"] }],
[:verb, "\\$(\\$)"],
[:color_and_style, "<(.+?),(.+?)>", :all],
[:color, "<(.+?)>", 1],
[:reset, "\\$"],
[:verb, "[^<$]+"],
]
end
let(:tokenizer) { L43Peg::Combinators.rgx_tokenize(regexen) }
Then we can tokenize some inputs
input = "<red,bold>HELLO$and<<<green>$$<reset>"
ast = [
[:color_and_style, ["<red,bold>", "red", "bold"]],
[:verb, "HELLO"],
[:reset, "$"],
[:verb, "and"],
[:verb, "<"],
[:color, "green"],
[:verb, "$"],
[:color, "reset"]
]
assert_parse_success(tokenizer, input, ast:)
Context: Debugging
As parsers are by design imbricated functions debugging is not always simple.
Enter the debug_parser, a parser that debugs parsers by not changing their behavior
by displaying more or less detailed information
Given a parser
include L43Peg::Combinators
let :args do
{
lat: "lat:(\\d+)",
long: "long:(\\d+)",
}
end
let(:geo_parser) {args_parser(args, &:to_i)}
Context: Minimum level of information
Given a minum debug parser
let(:debugger) {debug_parser(geo_parser, level: :min)}
Then we will get some output
expected =
"Tokens<[\"lat:43\", \"long:2\"]>\nSuccess: @1\n"
expect { parsed_success(debugger, ["lat:43", "long:2"]) }
.to output(expected).to_stderr
Context: Default level of information
Given a default debug parser
let(:debugger) {debug_parser(char_parser("a"))}
Then we will get some output on errors
expected ="Input<\"b\"@1:1>\nFailure: char \"b\" @[1, 1]\n"
expect { parsed_failure(debugger, "b") }
.to output(expected).to_stderr
Context: Maximum level of information
Given a maxium level parser
let(:max_debugger) { debug_parser(char_parser("b"), level: :max) }
Then we will get this output on errors
expected =
[
"================================================================================",
'Input<col:1 input:"bc" lnb:1 context:{}>',
"================================================================================",
'Success<ast:"b" cache:{} rest:"c">',
"================================================================================",
""
].join("\n")
expect { parsed_success(max_debugger,"bc") }
.to output(expected).to_stderr
Author
Copyright © 2024 Robert Dober robert.dober@gmail.com
LICENSE
GNU AFFERO GENERAL PUBLIC LICENSE, Version 3, 19 November 2007. Please refer to LICENSE for details.
<!-- SPDX-License-Identifier: AGPL-3.0-or-later -->