LexerKit
A high-performance lexer toolkit for Ruby. Define tokenizers with a Ruby DSL and run them through a Rust native extension.
Features
- DSL-based lexer definition
- Fast stream lexing with minimal allocation
- On-demand token object creation for diagnostics
- Compiled lexer serialization
- Regex-based token patterns compiled to DFA
Installation
# Gemfile
gem "lexer_kit"
bundle install
Quick Start
require "lexer_kit"
lexer = LexerKit.build do
token :NUMBER, /[0-9]+/
token :PLUS, "+"
token :MINUS, "-"
token :SPACE, /[ \t\r\n]+/, skip: true
end.compile
stream = lexer.stream("12 + 34 - 5")
until stream.eof?
puts "#{stream.token_name}: #{stream.text.inspect}"
stream.advance
end
Core DSL
token
token :IDENT, /[a-zA-Z_][a-zA-Z0-9_]*/
token :ARROW, "->"
token :SPACE, /[ \t]+/, skip: true
token :DQUOTE, '"', push: :string
token :END_Q, '"', pop: true
Options:
skip: trueskips emitting the tokenpush: :mode_namepushes a modepop: truepops the current mode
keyword / define_keywords
token :IDENT, /[a-z_]+/
keyword :IF, "if"
define_keywords :else, :while, :return
mode
LexerKit.build do
token :DQUOTE, '"', push: :string
token :IDENT, /[a-z]+/
mode :string do
token :CONTENT, /[^"\\]+/
token :ESCAPE, /\\./
token :DQUOTE, '"', pop: true
end
end
scan_until / delimited
scan_until :BLOCK_COMMENT, open: "/*", close: "*/", skip: true
delimited :TEXT, delimiter: "{{" do
token :IDENT, /[a-zA-Z_]+/
token :DOT, "."
token :CLOSE, "}}", pop: true
end
utf8_range
token :HIRAGANA, LexerKit.utf8_range("ぁ".."ん")
token :CJK, LexerKit.utf8_range(0x4E00..0x9FFF)
Regex Notes
- Most common regex syntax is supported (
[], quantifiers, groups, alternation, escapes,/.../i) - Backtracking-dependent features are not supported (lookaround, backreference, etc.)
- Anchors and word-boundary assertions are not used in lexer matching
*?,+?,??are parsed but behave as longest-match (DFA behavior)
Stream API and Error Handling
stream.start and stream.len are byte offsets.
stream = lexer.stream(input)
until stream.eof?
if stream.error?
token = stream.make_token
puts token.render_diagnostic("unexpected character")
end
stream.advance
end
LexerKit always falls back to :INVALID for unmatched input.
Serialization
Pre-compile lexers for faster startup:
lexer = builder.compile
LexerKit::Format::LKT1.save(lexer, path: "lexer.lkt1")
LexerKit::Format::LKB1.save(lexer, path: "lexer.lkb1")
lexer_kit compile lexer.rb -o lexer.lkt1
Load later:
lexer = LexerKit.load_lexer(File.("data/lexer.lkt1", __dir__))
Performance Snapshot
JSON benchmark (600KB input, project benchmark script):
- LexerKit:
95.2 i/s - StringScanner:
4.8 i/s(about20xslower)
License
MIT License