L43Peg

A Parse Expression Grammar library for Ruby

This Version (v0.1.x) is Alpha Quality (many PEG features are missing, like recursion and even alternatives.

It is however released because it offers quite some nice parsing of ARGV which shall be demonstrated by the following
[speculations](https://rubygems.org/gems/speculate_about)

See [README_spec.rb](spec/speculations/README_spec.rb) for the generated code for details

Context: `arg_parser`

Given the following argument specification

    include L43Peg::Combinators
    let :args_spec do
    {
      start: "--start=(.*)",
      end: "(?:--end|-e)=(.*)",
      kwd: "--(alpha|beta|gamma)"
    }
    end

And the assoicated parser

  let(:parser) { args_parser(args_spec) }

Then we can parse some input

assert_parse_success(parser, %w[--start=42 --beta -e=44], ast: {start: "42", kwd: "beta", end: "44"}, rest: [])

And we can get the rest in a list of tokens

assert_parse_success(parser, %w[--start=42 --beta -e=44 -s=not_an_arg --end=too_late], ast: {start: "42", kwd: "beta", end: "44"}, rest: %w[-s=not_an_arg --end=too_late])

Also note that multiple values are passed into an array

input = %w[--end=42 --beta -e=44 --beta --end=not_too_late --gamma]
ast   = {end: %w[42 44 not_too_late], kwd: %w[beta beta gamma]}
assert_parse_success(parser, input, ast:, rest: [])

Context: Postprocessing

When we map the parser

  let :int_args do
{
start: "--start=(.*)",
         end: "--end=(.*)",
         inc: "--inc=(.*)"
}
end
let(:int_arg_parser) {args_parser(int_args, name: "int parser", &:to_i)}

Then we can convert the string valus

assert_parse_success(int_arg_parser, %w[--start=42 --end=44 --inc=2], ast: {start: 42, end: 44, inc: 2}, rest: [])

Context: Knowing When To Stop

An argument parser that respects itself provides a means to end argument parsing even if more matches follow. An exmaple for that is the posix argument --

We can use whatever we want in args_parser, here is a variation:

Given the specification

  let :args do
    {
      width: "w:(\\d+)",
      height: "h:(\\d+)",
       __stop: "(::)"
    }
  end
  let(:wh_parser) {args_parser(args, stop: :__stop, &:to_i)}

Then parsing the following input

    input = %w[h:42 w:73 :: w:74]
    ast   = {height: 42, width: 73}
    assert_parse_success(wh_parser, input, ast:, rest: %w[w:74])

Context: User Interface

Context: Exposing the args_parser

Above we have seen that we had to include an internal module so to get access to the args_parser. Client code might not want to use these intrusive methods and therefore the parsers are also exposed as module methods

Given an exposed args_parser

let :parser do
L43Peg::Parsers.args_parser(
  {
    negative: "(-\\d+)",
    positive: "\\+?(\\d+)"
},
&:to_i
)
end

But we are also not interested in the internal representation of success and failure of parsing which was used in the speculations above. Nor do we want to transform our input into the internal representations as was done above by the helpers. (If you need to see the details of this you can inspect the file parser_test.rb in spec/support)

Then we can uses the interface of L43Peg

  L43Peg.parse_tokens(parser, %w[43 -44 +45]) => :ok, result
    expect(result).to eq(positive: [43, 45], negative: -44)

And if we get an error the result is as follows

parser = L43Peg::Parsers.char_parser('a') 
L43Peg.parse_string(parser, 'b') => :error, message
expect(message).to eq("char \"b\"")

Context: Regexp Parser

The basic concept is the rgx_parser

Given a rgx_parser for an identifier

include L43Peg::Parsers
let(:id_parser) { rgx_parser("[[:alpha:]][_[:alnum:]]*") }

Then we can parse strings that start as such

assert_parse_success(id_parser, "l43_peg", ast: "l43_peg")

And we can discard some input from the ast with the aid of captures

sym_parser = rgx_parser(":([[:alpha:]][_[:alnum:]]*)")
assert_parse_success(sym_parser, ":no_colon", ast: "no_colon")

But it can also fail

reason = "input does not match /\\A[[:alpha:]][_[:alnum:]]*/ (in rgx_parser(\"[[:alpha:]][_[:alnum:]]*\"))"
assert_parse_failure(id_parser, "42", reason:)

Context: Warnings on empty matches

Oftentimes bugs in PEG parsing are caused by zero width matches, while this is quite obvious with the many and opt or maybe combinators (N.B. they are not yet implemented, use many(max: 1) instead) and they common use patterns with these combinators are safe.

However regular expression parsing might hide zero width matches, and that's whey they will trigger a warning by default

Given an empty match rgex parser

    let(:empty_parser) { rgx_parser("a*") }

Then we get a warning when matching an empty string

    expect { assert_parse_success(empty_parser, "", ast: "") }
      .to output("Warning, parser rgx_parser(\"a*\") succeeds with empty match\n").to_stderr

However this behavior can also be disabled

And therefore

    parser = rgx_parser("a*", warn_on_empty: false)
    expect { assert_parse_success(parser, "", ast: "") }
      .not_to output.to_stderr

Context: Tokenize Strings with Regexen

Now we can use a list of rgx_parsers to tokenize a string (in the same way can use tokens_parser to quantify elements of an array, but with dynamic bounds)

Given some regexen

let :regexen do
[
[:verb, "<<", nil, ->(*){ [:verb, "<"] }],
[:verb, "\\$(\\$)"],
[:color_and_style, "<(.+?),(.+?)>", :all],
[:color, "<(.+?)>", 1],
[:reset, "\\$"],
[:verb, "[^<$]+"],
]
end
let(:tokenizer) { L43Peg::Combinators.rgx_tokenize(regexen) }

Then we can tokenize some inputs

input = "<red,bold>HELLO$and<<<green>$$<reset>"
ast = [ 
[:color_and_style, ["<red,bold>", "red", "bold"]],
[:verb, "HELLO"],
[:reset, "$"],
[:verb, "and"],
[:verb, "<"],
[:color, "green"],
[:verb, "$"],
[:color, "reset"]
]
assert_parse_success(tokenizer, input, ast:)

Context: Debugging

As parsers are by design imbricated functions debugging is not always simple. Enter the debug_parser, a parser that debugs parsers by not changing their behavior by displaying more or less detailed information

Given a parser

  include L43Peg::Combinators
  let :args do
    {
      lat: "lat:(\\d+)",
      long: "long:(\\d+)",
    }
  end
  let(:geo_parser) {args_parser(args, &:to_i)}

Context: Minimum level of information

Given a minum debug parser

  let(:debugger) {debug_parser(geo_parser, level: :min)}

Then we will get some output

    expected =
        "Tokens<[\"lat:43\", \"long:2\"]>\nSuccess: @1\n"
    expect { parsed_success(debugger, ["lat:43", "long:2"]) }
        .to output(expected).to_stderr

Context: Default level of information

Given a default debug parser

    let(:debugger) {debug_parser(char_parser("a"))}

Then we will get some output on errors

    expected ="Input<\"b\"@1:1>\nFailure: char \"b\" @[1, 1]\n"
    expect { parsed_failure(debugger, "b") }
        .to output(expected).to_stderr

Context: Maximum level of information

Given a maxium level parser

    let(:max_debugger) { debug_parser(char_parser("b"), level: :max) }

Then we will get this output on errors

expected = 
    [ 
      "================================================================================",
      'Input<col:1 input:"bc" lnb:1 context:{}>',
      "================================================================================",
      'Success<ast:"b" cache:{} rest:"c">',
      "================================================================================",
      ""
    ].join("\n")

    expect { parsed_success(max_debugger,"bc") }
      .to output(expected).to_stderr

Author

LICENSE

GNU AFFERO GENERAL PUBLIC LICENSE, Version 3, 19 November 2007. Please refer to LICENSE for details.