LexerKit

A high-performance lexer toolkit for Ruby. Define tokenizers with a Ruby DSL and run them through a Rust native extension.

Features

  • DSL-based lexer definition
  • Fast stream lexing with minimal allocation
  • On-demand token object creation for diagnostics
  • Compiled lexer serialization
  • Regex-based token patterns compiled to DFA

Installation

# Gemfile
gem "lexer_kit"
bundle install

Quick Start

require "lexer_kit"

lexer = LexerKit.build do
  token :NUMBER, /[0-9]+/
  token :PLUS,   "+"
  token :MINUS,  "-"
  token :SPACE,  /[ \t\r\n]+/, skip: true
end.compile

stream = lexer.stream("12 + 34 - 5")
until stream.eof?
  puts "#{stream.token_name}: #{stream.text.inspect}"
  stream.advance
end

Core DSL

token

token :IDENT,  /[a-zA-Z_][a-zA-Z0-9_]*/
token :ARROW,  "->"
token :SPACE,  /[ \t]+/, skip: true
token :DQUOTE, '"', push: :string
token :END_Q,  '"', pop: true

Options:

  • skip: true skips emitting the token
  • push: :mode_name pushes a mode
  • pop: true pops the current mode

keyword / define_keywords

token :IDENT, /[a-z_]+/
keyword :IF, "if"
define_keywords :else, :while, :return

mode

LexerKit.build do
  token :DQUOTE, '"', push: :string
  token :IDENT,  /[a-z]+/

  mode :string do
    token :CONTENT, /[^"\\]+/
    token :ESCAPE,  /\\./
    token :DQUOTE,  '"', pop: true
  end
end

scan_until / delimited

scan_until :BLOCK_COMMENT, open: "/*", close: "*/", skip: true

delimited :TEXT, delimiter: "{{" do
  token :IDENT, /[a-zA-Z_]+/
  token :DOT,   "."
  token :CLOSE, "}}", pop: true
end

utf8_range

token :HIRAGANA, LexerKit.utf8_range("".."")
token :CJK,      LexerKit.utf8_range(0x4E00..0x9FFF)

Regex Notes

  • Most common regex syntax is supported ([], quantifiers, groups, alternation, escapes, /.../i)
  • Backtracking-dependent features are not supported (lookaround, backreference, etc.)
  • Anchors and word-boundary assertions are not used in lexer matching
  • *?, +?, ?? are parsed but behave as longest-match (DFA behavior)

Stream API and Error Handling

stream.start and stream.len are byte offsets.

stream = lexer.stream(input)
until stream.eof?
  if stream.error?
    token = stream.make_token
    puts token.render_diagnostic("unexpected character")
  end
  stream.advance
end

LexerKit always falls back to :INVALID for unmatched input.

Serialization

Pre-compile lexers for faster startup:

lexer = builder.compile
LexerKit::Format::LKT1.save(lexer, path: "lexer.lkt1")
LexerKit::Format::LKB1.save(lexer, path: "lexer.lkb1")
lexer_kit compile lexer.rb -o lexer.lkt1

Load later:

lexer = LexerKit.load_lexer(File.expand_path("data/lexer.lkt1", __dir__))

Performance Snapshot

JSON benchmark (600KB input, project benchmark script):

  • LexerKit: 95.2 i/s
  • StringScanner: 4.8 i/s (about 20x slower)

License

MIT License