LexerKit

A high-performance lexer toolkit for Ruby. Define tokenizers with a Ruby DSL and run them through a Rust native extension.

Features

DSL-based lexer definition
Fast stream lexing with minimal allocation
On-demand token object creation for diagnostics
Compiled lexer serialization
Regex-based token patterns compiled to DFA

Installation

# Gemfile
gem "lexer_kit"

bundle install

Quick Start

require "lexer_kit"

lexer = LexerKit.build do
  token :NUMBER, /[0-9]+/
  token :PLUS,   "+"
  token :MINUS,  "-"
  token :SPACE,  /[ \t\r\n]+/, skip: true
end.compile

stream = lexer.stream("12 + 34 - 5")
until stream.eof?
  puts "#{stream.token_name}: #{stream.text.inspect}"
  stream.advance
end

Core DSL

`token`

token :IDENT,  /[a-zA-Z_][a-zA-Z0-9_]*/
token :ARROW,  "->"
token :SPACE,  /[ \t]+/, skip: true
token :DQUOTE, '"', push: :string
token :END_Q,  '"', pop: true

Options:

skip: true skips emitting the token
push: :mode_name pushes a mode
pop: true pops the current mode

`keyword` / `define_keywords`

token :IDENT, /[a-z_]+/
keyword :IF, "if"
define_keywords :else, :while, :return

`mode`

LexerKit.build do
  token :DQUOTE, '"', push: :string
  token :IDENT,  /[a-z]+/

  mode :string do
    token :CONTENT, /[^"\\]+/
    token :ESCAPE,  /\\./
    token :DQUOTE,  '"', pop: true
  end
end

`scan_until` / `delimited`

scan_until :BLOCK_COMMENT, open: "/*", close: "*/", skip: true

delimited :TEXT, delimiter: "{{" do
  token :IDENT, /[a-zA-Z_]+/
  token :DOT,   "."
  token :CLOSE, "}}", pop: true
end

`utf8_range`

token :HIRAGANA, LexerKit.utf8_range("ぁ".."ん")
token :CJK,      LexerKit.utf8_range(0x4E00..0x9FFF)

Regex Notes

Most common regex syntax is supported ([], quantifiers, groups, alternation, escapes, /.../i)
Backtracking-dependent features are not supported (lookaround, backreference, etc.)
Anchors and word-boundary assertions are not used in lexer matching
*?, +?, ?? are parsed but behave as longest-match (DFA behavior)

Stream API and Error Handling

stream.start and stream.len are byte offsets.

stream = lexer.stream(input)
until stream.eof?
  if stream.error?
    token = stream.make_token
    puts token.render_diagnostic("unexpected character")
  end
  stream.advance
end

LexerKit always falls back to :INVALID for unmatched input.

Serialization

Pre-compile lexers for faster startup:

lexer = builder.compile
LexerKit::Format::LKT1.save(lexer, path: "lexer.lkt1")
LexerKit::Format::LKB1.save(lexer, path: "lexer.lkb1")

lexer_kit compile lexer.rb -o lexer.lkt1

Load later:

lexer = LexerKit.load_lexer(File.expand_path("data/lexer.lkt1", __dir__))

Performance Snapshot

JSON benchmark (600KB input, project benchmark script):

LexerKit: 95.2 i/s
StringScanner: 4.8 i/s (about 20x slower)

License

MIT License