Class: Rouge::RegexLexer Abstract
Overview
A stateful lexer that uses sets of regular expressions to tokenize a string. Most lexers are instances of RegexLexer.
Direct Known Subclasses
Lexers::ABAP, Lexers::Actionscript, Lexers::Ada, Lexers::Apache, Lexers::Apex, Lexers::AppleScript, Lexers::ArmAsm, Lexers::Augeas, Lexers::Awk, Lexers::BBCBASIC, Lexers::BPF, Lexers::Batchfile, Lexers::BibTeX, Lexers::Bicep, Lexers::Brainfuck, Lexers::Brightscript, Lexers::Bsl, Lexers::C, Lexers::CMHG, Lexers::CMake, Lexers::COBOL, Lexers::CSS, Lexers::CSVS, Lexers::CSharp, Lexers::Ceylon, Lexers::Cfscript, Lexers::CiscoIos, Lexers::Clean, Lexers::Clojure, Lexers::Codeowners, Lexers::Coffeescript, Lexers::CommonLisp, Lexers::Conf, Lexers::Crystal, Lexers::Cypher, Lexers::D, Lexers::Dafny, Lexers::Dart, Lexers::Datastudio, Lexers::Diff, Lexers::Docker, Lexers::Dot, Lexers::Dylan, Lexers::ECL, Lexers::Eiffel, Lexers::Elixir, Lexers::Elm, Lexers::Email, Lexers::Erlang, Lexers::FSharp, Lexers::Factor, Lexers::Fluent, Lexers::Fortran, Lexers::GDScript, Lexers::GHCCmm, Lexers::GHCCore, Lexers::GLSL, Lexers::Gherkin, Lexers::Go, Lexers::GraphQL, Lexers::Groovy, Lexers::HTML, Lexers::HTTP, Lexers::Haml, Lexers::Haskell, Lexers::Haxe, Lexers::Hcl, Lexers::HyLang, Lexers::IDLang, Lexers::INI, Lexers::IO, Lexers::ISBL, Lexers::Idris, Lexers::IecST, Lexers::IgorPro, Lexers::Isabelle, Lexers::J, Lexers::JSL, Lexers::JSON, Lexers::Janet, Lexers::Java, Lexers::Javascript, Lexers::Jsonnet, Lexers::Julia, Lexers::KickAssembler, Lexers::Kotlin, Lexers::LLVM, Lexers::Lean, Lexers::Liquid, Lexers::LiterateCoffeescript, Lexers::LiterateHaskell, Lexers::Livescript, Lexers::Lua, Lexers::Lustre, Lexers::M68k, Lexers::MXML, Lexers::Magik, Lexers::Make, Lexers::Markdown, Lexers::Mathematica, Lexers::Matlab, Lexers::Meson, Lexers::MiniZinc, Lexers::Moonscript, Lexers::Mosel, Lexers::MsgTrans, Lexers::Nasm, Lexers::NesAsm, Lexers::Nginx, Lexers::Nial, Lexers::Nim, Lexers::Nix, Lexers::OCL, Lexers::OCamlCommon, Lexers::OpenEdge, Lexers::OpenTypeFeatureFile, Lexers::P4, Lexers::PLSQL, Lexers::Pascal, Lexers::Pdf, Lexers::Perl, Lexers::Plist, Lexers::Pony, Lexers::PostScript, Lexers::Powershell, Lexers::Praat, Lexers::Prolog, Lexers::Prometheus, Lexers::Properties, Lexers::Protobuf, Lexers::Puppet, Lexers::Python, Lexers::Q, Lexers::R, Lexers::RML, Lexers::Racket, Lexers::Rego, Lexers::RobotFramework, Lexers::Rocq, Lexers::Ruby, Lexers::Rust, Lexers::SAS, Lexers::SML, Lexers::SPARQL, Lexers::SQF, Lexers::SQL, Lexers::SSH, Lexers::SassCommon, Lexers::Scala, Lexers::Scheme, Lexers::Sed, Lexers::Sed::Regex, Lexers::Sed::Replacement, Lexers::Shell, Lexers::Sieve, Lexers::Slim, Lexers::Smalltalk, Lexers::Stan, Lexers::Stata, Lexers::SuperCollider, Lexers::Swift, Lexers::SystemD, Lexers::Syzlang, Lexers::Syzprog, Lexers::TCL, Lexers::TOML, Lexers::TTCN3, Lexers::Tap, Lexers::TeX, Lexers::Thrift, Lexers::Tulip, Lexers::Turtle, Lexers::VHDL, Lexers::Vala, Lexers::Varnish, Lexers::Verilog, Lexers::Veryl, Lexers::VimL, Lexers::VisualBasic, Lexers::Wollok, Lexers::XML, Lexers::XPath, Lexers::Xojo, Lexers::YAML, Lexers::YANG, Lexers::Zig, TemplateLexer
Defined Under Namespace
Classes: ClosedState, Fallthrough, InvalidRegex, Rule, State, StateDSL
Constant Summary collapse
- MAX_NULL_SCANS =
The number of successive scans permitted without consuming the input stream. If this is exceeded, the match fails.
5
Constants included from Token::Tokens
Token::Tokens::Num, Token::Tokens::Str
Instance Attribute Summary
Attributes inherited from Lexer
Class Method Summary collapse
- .append(name, &b) ⇒ Object
- .get_state(name) ⇒ Object
- .prepend(name, &b) ⇒ Object
- .replace_state(name, new_defn) ⇒ Object
-
.start(&b) ⇒ Object
Specify an action to be run every fresh lex.
-
.start_procs ⇒ Object
The routines to run at the beginning of a fresh lex.
-
.state(name, &b) ⇒ Object
Define a new state for this lexer with the given name.
- .state_definitions ⇒ Object
-
.states ⇒ Object
The states hash for this lexer.
Instance Method Summary collapse
-
#delegate(lexer, text = nil) ⇒ Object
Delegate the lex to another lexer.
-
#fallthrough! ⇒ Object
Breaks out of the current rule block and continues to match later rules, as if the current regex had not matched.
- #get_state(state_name) ⇒ Object
-
#goto(state_name) ⇒ Object
replace the head of the stack with the given state.
- #group(tok) ⇒ Object deprecated Deprecated.
-
#groups(*tokens) ⇒ Object
Yield tokens corresponding to the matched groups of the current match.
-
#in_state?(state_name) ⇒ Boolean
Check if ‘state_name` is in the state stack.
-
#pop!(times = 1) ⇒ Object
Pop the state stack.
-
#push(state_name = nil, &b) ⇒ Object
Push a state onto the stack.
-
#recurse(text = nil) ⇒ Object
Re-lexes the given text (or the most recently matched string if none is given) with the current lexer.
-
#reset! ⇒ Object
reset this lexer to its initial state.
-
#reset_stack ⇒ Object
reset the stack back to ‘[:root]`.
-
#stack ⇒ Object
The state stack.
-
#state ⇒ Object
The current state - i.e.
-
#state?(state_name) ⇒ Boolean
Check if ‘state_name` is the state on top of the state stack.
-
#step(state, stream) ⇒ Object
Runs one step of the lex.
-
#stream_tokens(str, &b) ⇒ Object
This implements the lexer protocol, by yielding [token, value] pairs.
-
#token(tok, val = ) ⇒ Object
Yield a token.
Methods inherited from Lexer
aliases, all, #as_bool, #as_lexer, #as_list, #as_string, #as_token, assert_utf8!, #bool_option, continue_lex, #continue_lex, debug_enabled?, demo, demo_file, desc, detect?, detectable?, disable_debug!, eager_load!, #eager_load!, enable_debug!, filenames, find, find_fancy, guess, guess_by_filename, guess_by_mimetype, guess_by_source, guesses, #hash_option, #initialize, lazy, lex, #lex, #lexer_option, #list_option, lookup_fancy, mimetypes, option, option_docs, skip_auto_load?, #string_option, tag, #tag, title, #token_option, #with
Methods included from Token::Tokens
Constructor Details
This class inherits a constructor from Rouge::Lexer
Class Method Details
.append(name, &b) ⇒ Object
268 269 270 271 272 |
# File 'lib/rouge/regex_lexer.rb', line 268 def self.append(name, &b) name = name.to_sym dsl = state_definitions[name] or raise "no such state #{name.inspect}" replace_state(name, dsl.appended(&b)) end |
.get_state(name) ⇒ Object
275 276 277 278 279 280 281 282 |
# File 'lib/rouge/regex_lexer.rb', line 275 def self.get_state(name) return name if name.is_a? State states[name.to_sym] ||= begin defn = state_definitions[name.to_sym] or raise "unknown state: #{name.inspect}" defn.to_state(self) end end |
.prepend(name, &b) ⇒ Object
262 263 264 265 266 |
# File 'lib/rouge/regex_lexer.rb', line 262 def self.prepend(name, &b) name = name.to_sym dsl = state_definitions[name] or raise "no such state #{name.inspect}" replace_state(name, dsl.prepended(&b)) end |
.replace_state(name, new_defn) ⇒ Object
235 236 237 238 |
# File 'lib/rouge/regex_lexer.rb', line 235 def self.replace_state(name, new_defn) states[name] = nil state_definitions[name] = new_defn end |
.start(&b) ⇒ Object
Specify an action to be run every fresh lex.
251 252 253 |
# File 'lib/rouge/regex_lexer.rb', line 251 def self.start(&b) start_procs << b end |
.start_procs ⇒ Object
The routines to run at the beginning of a fresh lex.
242 243 244 |
# File 'lib/rouge/regex_lexer.rb', line 242 def self.start_procs @start_procs ||= InheritableList.new(superclass.start_procs) end |
.state(name, &b) ⇒ Object
Define a new state for this lexer with the given name. The block will be evaluated in the context of a StateDSL.
257 258 259 260 |
# File 'lib/rouge/regex_lexer.rb', line 257 def self.state(name, &b) name = name.to_sym state_definitions[name] = StateDSL.new(name, &b) end |
.state_definitions ⇒ Object
230 231 232 |
# File 'lib/rouge/regex_lexer.rb', line 230 def self.state_definitions @state_definitions ||= InheritableHash.new(superclass.state_definitions) end |
.states ⇒ Object
The states hash for this lexer.
226 227 228 |
# File 'lib/rouge/regex_lexer.rb', line 226 def self.states @states ||= {} end |
Instance Method Details
#delegate(lexer, text = nil) ⇒ Object
Delegate the lex to another lexer. We use the ‘continue_lex` method so that #reset! will not be called. In this way, a single lexer can be repeatedly delegated to while maintaining its own internal state stack.
435 436 437 438 439 440 441 442 443 |
# File 'lib/rouge/regex_lexer.rb', line 435 def delegate(lexer, text=nil) puts " delegating to: #{lexer.inspect}" if @debug text ||= @current_stream[0] lexer.continue_lex(text) do |tok, val| puts " delegated token: #{tok.inspect}, #{val.inspect}" if @debug yield_token(tok, val) end end |
#fallthrough! ⇒ Object
Breaks out of the current rule block and continues to match later rules, as if the current regex had not matched. Does not affect the stack.
454 455 456 |
# File 'lib/rouge/regex_lexer.rb', line 454 def fallthrough! raise Fallthrough end |
#get_state(state_name) ⇒ Object
285 286 287 |
# File 'lib/rouge/regex_lexer.rb', line 285 def get_state(state_name) self.class.get_state(state_name) end |
#goto(state_name) ⇒ Object
replace the head of the stack with the given state
488 489 490 491 492 493 |
# File 'lib/rouge/regex_lexer.rb', line 488 def goto(state_name) raise 'empty stack!' if stack.empty? puts " going to: state :#{state_name} " if @debug stack[-1] = get_state(state_name) end |
#group(tok) ⇒ Object
Yield a token with the next matched group. Subsequent calls to this method will yield subsequent groups.
414 415 416 |
# File 'lib/rouge/regex_lexer.rb', line 414 def group(tok) raise "RegexLexer#group is deprecated: use #groups instead" end |
#groups(*tokens) ⇒ Object
Yield tokens corresponding to the matched groups of the current match.
420 421 422 423 424 |
# File 'lib/rouge/regex_lexer.rb', line 420 def groups(*tokens) tokens.each_with_index do |tok, i| yield_token(tok, @current_stream[i+1]) end end |
#in_state?(state_name) ⇒ Boolean
Check if ‘state_name` is in the state stack.
503 504 505 506 507 508 |
# File 'lib/rouge/regex_lexer.rb', line 503 def in_state?(state_name) state_name = state_name.to_sym stack.any? do |state| state.name == state_name.to_sym end end |
#pop!(times = 1) ⇒ Object
Pop the state stack. If a number is passed in, it will be popped that number of times.
477 478 479 480 481 482 483 484 485 |
# File 'lib/rouge/regex_lexer.rb', line 477 def pop!(times=1) raise 'empty stack!' if stack.empty? puts " popping stack: #{times}" if @debug stack.pop(times) nil end |
#push(state_name = nil, &b) ⇒ Object
Push a state onto the stack. If no state name is given and you’ve passed a block, a state will be dynamically created using the StateDSL.
461 462 463 464 465 466 467 468 469 470 471 472 473 |
# File 'lib/rouge/regex_lexer.rb', line 461 def push(state_name=nil, &b) push_state = if state_name get_state(state_name) elsif block_given? StateDSL.new(b.inspect, &b).to_state(self.class) else # use the top of the stack by default self.state end puts " pushing: :#{push_state.name}" if @debug stack.push(push_state) end |
#recurse(text = nil) ⇒ Object
Re-lexes the given text (or the most recently matched string if none is given) with the current lexer.
447 448 449 |
# File 'lib/rouge/regex_lexer.rb', line 447 def recurse(text=nil) delegate(self.class, text) end |
#reset! ⇒ Object
reset this lexer to its initial state. This runs all of the start_procs.
306 307 308 309 310 311 312 313 314 |
# File 'lib/rouge/regex_lexer.rb', line 306 def reset! @stack = nil @current_stream = nil puts "start blocks" if @debug && self.class.start_procs.any? self.class.start_procs.each do |pr| instance_eval(&pr) end end |
#reset_stack ⇒ Object
reset the stack back to ‘[:root]`.
496 497 498 499 500 |
# File 'lib/rouge/regex_lexer.rb', line 496 def reset_stack puts ' resetting stack' if @debug stack.clear stack.push get_state(:root) end |
#stack ⇒ Object
The state stack. This is initially the single state ‘[:root]`. It is an error for this stack to be empty.
292 293 294 |
# File 'lib/rouge/regex_lexer.rb', line 292 def stack @stack ||= [get_state(:root)] end |
#state ⇒ Object
The current state - i.e. one on top of the state stack.
NB: if the state stack is empty, this will throw an error rather than returning nil.
300 301 302 |
# File 'lib/rouge/regex_lexer.rb', line 300 def state stack.last or raise 'empty stack!' end |
#state?(state_name) ⇒ Boolean
Check if ‘state_name` is the state on top of the state stack.
511 512 513 |
# File 'lib/rouge/regex_lexer.rb', line 511 def state?(state_name) state_name.to_sym == state.name end |
#step(state, stream) ⇒ Object
Runs one step of the lex. Rules in the current state are tried until one matches, at which point its callback is called.
362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 |
# File 'lib/rouge/regex_lexer.rb', line 362 def step(state, stream) state.rules.each do |rule| if rule.is_a?(State) puts " entering: mixin :#{rule.name}" if @debug return true if step(rule, stream) puts " exiting: mixin :#{rule.name}" if @debug else puts " trying: #{rule.inspect}" if @debug if (size = stream.skip(rule.re)) puts " got: #{stream[0].inspect}" if @debug if size.zero? @null_steps += 1 if @null_steps > MAX_NULL_SCANS puts " warning: too many scans without consuming the string!" if @debug return false end else @null_steps = 0 end begin instance_exec(stream, &rule.callback) rescue Fallthrough stream.unscan next end return true end end end false end |
#stream_tokens(str, &b) ⇒ Object
This implements the lexer protocol, by yielding [token, value] pairs.
The process for lexing works as follows, until the stream is empty:
-
We look at the state on top of the stack (which by default is ‘[:root]`).
-
Each rule in that state is tried until one is successful. If one is found, that rule’s callback is evaluated - which may yield tokens and manipulate the state stack. Otherwise, one character is consumed with an ‘’Error’‘ token, and we continue at (1.)
328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 |
# File 'lib/rouge/regex_lexer.rb', line 328 def stream_tokens(str, &b) stream = StringScanner.new(str, fixed_anchor: true) @current_stream = stream @output_stream = b @states = self.class.states @null_steps = 0 until stream.eos? if @debug puts puts "lexer: #{self.class.tag}" puts "stack: #{stack.map { |s| s.name.to_sym }.inspect}" puts "stream: #{stream.peek(20).inspect}" end success = step(state, stream) if !success puts " no match, yielding Error" if @debug yield(Token::Tokens::Error, stream.getch) end end end |
#token(tok, val = ) ⇒ Object
Yield a token.
406 407 408 |
# File 'lib/rouge/regex_lexer.rb', line 406 def token(tok, val=@current_stream[0]) yield_token(tok, val) end |