Class: Rouge::RegexLexer Abstract

Inherits:
Lexer
  • Object
show all
Defined in:
lib/rouge/regex_lexer.rb

Overview

This class is abstract.

A stateful lexer that uses sets of regular expressions to tokenize a string. Most lexers are instances of RegexLexer.

Direct Known Subclasses

Lexers::ABAP, Lexers::Actionscript, Lexers::Ada, Lexers::Apache, Lexers::Apex, Lexers::AppleScript, Lexers::ArmAsm, Lexers::Augeas, Lexers::Awk, Lexers::BBCBASIC, Lexers::BPF, Lexers::Batchfile, Lexers::BibTeX, Lexers::Bicep, Lexers::Brainfuck, Lexers::Brightscript, Lexers::Bsl, Lexers::C, Lexers::CMHG, Lexers::CMake, Lexers::COBOL, Lexers::CSS, Lexers::CSVS, Lexers::CSharp, Lexers::Ceylon, Lexers::Cfscript, Lexers::CiscoIos, Lexers::Clean, Lexers::Clojure, Lexers::Codeowners, Lexers::Coffeescript, Lexers::CommonLisp, Lexers::Conf, Lexers::Crystal, Lexers::Cypher, Lexers::D, Lexers::Dafny, Lexers::Dart, Lexers::Datastudio, Lexers::Diff, Lexers::Docker, Lexers::Dot, Lexers::Dylan, Lexers::ECL, Lexers::Eiffel, Lexers::Elixir, Lexers::Elm, Lexers::Email, Lexers::Erlang, Lexers::FSharp, Lexers::Factor, Lexers::Fluent, Lexers::Fortran, Lexers::GDScript, Lexers::GHCCmm, Lexers::GHCCore, Lexers::GLSL, Lexers::Gherkin, Lexers::Go, Lexers::GraphQL, Lexers::Groovy, Lexers::HTML, Lexers::HTTP, Lexers::Haml, Lexers::Haskell, Lexers::Haxe, Lexers::Hcl, Lexers::HyLang, Lexers::IDLang, Lexers::INI, Lexers::IO, Lexers::ISBL, Lexers::Idris, Lexers::IecST, Lexers::IgorPro, Lexers::Isabelle, Lexers::J, Lexers::JSL, Lexers::JSON, Lexers::Janet, Lexers::Java, Lexers::Javascript, Lexers::Jsonnet, Lexers::Julia, Lexers::KickAssembler, Lexers::Kotlin, Lexers::LLVM, Lexers::Lean, Lexers::Liquid, Lexers::LiterateCoffeescript, Lexers::LiterateHaskell, Lexers::Livescript, Lexers::Lua, Lexers::Lustre, Lexers::M68k, Lexers::MXML, Lexers::Magik, Lexers::Make, Lexers::Markdown, Lexers::Mathematica, Lexers::Matlab, Lexers::Meson, Lexers::MiniZinc, Lexers::Moonscript, Lexers::Mosel, Lexers::MsgTrans, Lexers::Nasm, Lexers::NesAsm, Lexers::Nginx, Lexers::Nial, Lexers::Nim, Lexers::Nix, Lexers::OCL, Lexers::OCamlCommon, Lexers::OpenEdge, Lexers::OpenTypeFeatureFile, Lexers::P4, Lexers::PLSQL, Lexers::Pascal, Lexers::Pdf, Lexers::Perl, Lexers::Plist, Lexers::Pony, Lexers::PostScript, Lexers::Powershell, Lexers::Praat, Lexers::Prolog, Lexers::Prometheus, Lexers::Properties, Lexers::Protobuf, Lexers::Puppet, Lexers::Python, Lexers::Q, Lexers::R, Lexers::RML, Lexers::Racket, Lexers::Rego, Lexers::RobotFramework, Lexers::Rocq, Lexers::Ruby, Lexers::Rust, Lexers::SAS, Lexers::SML, Lexers::SPARQL, Lexers::SQF, Lexers::SQL, Lexers::SSH, Lexers::SassCommon, Lexers::Scala, Lexers::Scheme, Lexers::Sed, Lexers::Sed::Regex, Lexers::Sed::Replacement, Lexers::Shell, Lexers::Sieve, Lexers::Slim, Lexers::Smalltalk, Lexers::Stan, Lexers::Stata, Lexers::SuperCollider, Lexers::Swift, Lexers::SystemD, Lexers::Syzlang, Lexers::Syzprog, Lexers::TCL, Lexers::TOML, Lexers::TTCN3, Lexers::Tap, Lexers::TeX, Lexers::Thrift, Lexers::Tulip, Lexers::Turtle, Lexers::VHDL, Lexers::Vala, Lexers::Varnish, Lexers::Verilog, Lexers::Veryl, Lexers::VimL, Lexers::VisualBasic, Lexers::Wollok, Lexers::XML, Lexers::XPath, Lexers::Xojo, Lexers::YAML, Lexers::YANG, Lexers::Zig, TemplateLexer

Defined Under Namespace

Classes: ClosedState, Fallthrough, InvalidRegex, Rule, State, StateDSL

Constant Summary collapse

MAX_NULL_SCANS =

The number of successive scans permitted without consuming the input stream. If this is exceeded, the match fails.

5

Constants included from Token::Tokens

Token::Tokens::Num, Token::Tokens::Str

Instance Attribute Summary

Attributes inherited from Lexer

#options

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from Lexer

aliases, all, #as_bool, #as_lexer, #as_list, #as_string, #as_token, assert_utf8!, #bool_option, continue_lex, #continue_lex, debug_enabled?, demo, demo_file, desc, detect?, detectable?, disable_debug!, eager_load!, #eager_load!, enable_debug!, filenames, find, find_fancy, guess, guess_by_filename, guess_by_mimetype, guess_by_source, guesses, #hash_option, #initialize, lazy, lex, #lex, #lexer_option, #list_option, lookup_fancy, mimetypes, option, option_docs, skip_auto_load?, #string_option, tag, #tag, title, #token_option, #with

Methods included from Token::Tokens

token

Constructor Details

This class inherits a constructor from Rouge::Lexer

Class Method Details

.append(name, &b) ⇒ Object



268
269
270
271
272
# File 'lib/rouge/regex_lexer.rb', line 268

def self.append(name, &b)
  name = name.to_sym
  dsl = state_definitions[name] or raise "no such state #{name.inspect}"
  replace_state(name, dsl.appended(&b))
end

.get_state(name) ⇒ Object



275
276
277
278
279
280
281
282
# File 'lib/rouge/regex_lexer.rb', line 275

def self.get_state(name)
  return name if name.is_a? State

  states[name.to_sym] ||= begin
    defn = state_definitions[name.to_sym] or raise "unknown state: #{name.inspect}"
    defn.to_state(self)
  end
end

.prepend(name, &b) ⇒ Object



262
263
264
265
266
# File 'lib/rouge/regex_lexer.rb', line 262

def self.prepend(name, &b)
  name = name.to_sym
  dsl = state_definitions[name] or raise "no such state #{name.inspect}"
  replace_state(name, dsl.prepended(&b))
end

.replace_state(name, new_defn) ⇒ Object



235
236
237
238
# File 'lib/rouge/regex_lexer.rb', line 235

def self.replace_state(name, new_defn)
  states[name] = nil
  state_definitions[name] = new_defn
end

.start(&b) ⇒ Object

Specify an action to be run every fresh lex.

Examples:

start { puts "I'm lexing a new string!" }


251
252
253
# File 'lib/rouge/regex_lexer.rb', line 251

def self.start(&b)
  start_procs << b
end

.start_procsObject

The routines to run at the beginning of a fresh lex.

See Also:



242
243
244
# File 'lib/rouge/regex_lexer.rb', line 242

def self.start_procs
  @start_procs ||= InheritableList.new(superclass.start_procs)
end

.state(name, &b) ⇒ Object

Define a new state for this lexer with the given name. The block will be evaluated in the context of a StateDSL.



257
258
259
260
# File 'lib/rouge/regex_lexer.rb', line 257

def self.state(name, &b)
  name = name.to_sym
  state_definitions[name] = StateDSL.new(name, &b)
end

.state_definitionsObject



230
231
232
# File 'lib/rouge/regex_lexer.rb', line 230

def self.state_definitions
  @state_definitions ||= InheritableHash.new(superclass.state_definitions)
end

.statesObject

The states hash for this lexer.

See Also:



226
227
228
# File 'lib/rouge/regex_lexer.rb', line 226

def self.states
  @states ||= {}
end

Instance Method Details

#delegate(lexer, text = nil) ⇒ Object

Delegate the lex to another lexer. We use the ‘continue_lex` method so that #reset! will not be called. In this way, a single lexer can be repeatedly delegated to while maintaining its own internal state stack.

Parameters:

  • lexer (#lex)

    The lexer or lexer class to delegate to

  • text (String) (defaults to: nil)

    The text to delegate. This defaults to the last matched string.



435
436
437
438
439
440
441
442
443
# File 'lib/rouge/regex_lexer.rb', line 435

def delegate(lexer, text=nil)
  puts "    delegating to: #{lexer.inspect}" if @debug
  text ||= @current_stream[0]

  lexer.continue_lex(text) do |tok, val|
    puts "    delegated token: #{tok.inspect}, #{val.inspect}" if @debug
    yield_token(tok, val)
  end
end

#fallthrough!Object

Breaks out of the current rule block and continues to match later rules, as if the current regex had not matched. Does not affect the stack.

Raises:



454
455
456
# File 'lib/rouge/regex_lexer.rb', line 454

def fallthrough!
  raise Fallthrough
end

#get_state(state_name) ⇒ Object



285
286
287
# File 'lib/rouge/regex_lexer.rb', line 285

def get_state(state_name)
  self.class.get_state(state_name)
end

#goto(state_name) ⇒ Object

replace the head of the stack with the given state



488
489
490
491
492
493
# File 'lib/rouge/regex_lexer.rb', line 488

def goto(state_name)
  raise 'empty stack!' if stack.empty?

  puts "    going to: state :#{state_name} " if @debug
  stack[-1] = get_state(state_name)
end

#group(tok) ⇒ Object

Deprecated.

Yield a token with the next matched group. Subsequent calls to this method will yield subsequent groups.



414
415
416
# File 'lib/rouge/regex_lexer.rb', line 414

def group(tok)
  raise "RegexLexer#group is deprecated: use #groups instead"
end

#groups(*tokens) ⇒ Object

Yield tokens corresponding to the matched groups of the current match.



420
421
422
423
424
# File 'lib/rouge/regex_lexer.rb', line 420

def groups(*tokens)
  tokens.each_with_index do |tok, i|
    yield_token(tok, @current_stream[i+1])
  end
end

#in_state?(state_name) ⇒ Boolean

Check if ‘state_name` is in the state stack.

Returns:

  • (Boolean)


503
504
505
506
507
508
# File 'lib/rouge/regex_lexer.rb', line 503

def in_state?(state_name)
  state_name = state_name.to_sym
  stack.any? do |state|
    state.name == state_name.to_sym
  end
end

#pop!(times = 1) ⇒ Object

Pop the state stack. If a number is passed in, it will be popped that number of times.



477
478
479
480
481
482
483
484
485
# File 'lib/rouge/regex_lexer.rb', line 477

def pop!(times=1)
  raise 'empty stack!' if stack.empty?

  puts "    popping stack: #{times}" if @debug

  stack.pop(times)

  nil
end

#push(state_name = nil, &b) ⇒ Object

Push a state onto the stack. If no state name is given and you’ve passed a block, a state will be dynamically created using the StateDSL.



461
462
463
464
465
466
467
468
469
470
471
472
473
# File 'lib/rouge/regex_lexer.rb', line 461

def push(state_name=nil, &b)
  push_state = if state_name
    get_state(state_name)
  elsif block_given?
    StateDSL.new(b.inspect, &b).to_state(self.class)
  else
    # use the top of the stack by default
    self.state
  end

  puts "    pushing: :#{push_state.name}" if @debug
  stack.push(push_state)
end

#recurse(text = nil) ⇒ Object

Re-lexes the given text (or the most recently matched string if none is given) with the current lexer.



447
448
449
# File 'lib/rouge/regex_lexer.rb', line 447

def recurse(text=nil)
  delegate(self.class, text)
end

#reset!Object

reset this lexer to its initial state. This runs all of the start_procs.



306
307
308
309
310
311
312
313
314
# File 'lib/rouge/regex_lexer.rb', line 306

def reset!
  @stack = nil
  @current_stream = nil

  puts "start blocks" if @debug && self.class.start_procs.any?
  self.class.start_procs.each do |pr|
    instance_eval(&pr)
  end
end

#reset_stackObject

reset the stack back to ‘[:root]`.



496
497
498
499
500
# File 'lib/rouge/regex_lexer.rb', line 496

def reset_stack
  puts '    resetting stack' if @debug
  stack.clear
  stack.push get_state(:root)
end

#stackObject

The state stack. This is initially the single state ‘[:root]`. It is an error for this stack to be empty.

See Also:



292
293
294
# File 'lib/rouge/regex_lexer.rb', line 292

def stack
  @stack ||= [get_state(:root)]
end

#stateObject

The current state - i.e. one on top of the state stack.

NB: if the state stack is empty, this will throw an error rather than returning nil.



300
301
302
# File 'lib/rouge/regex_lexer.rb', line 300

def state
  stack.last or raise 'empty stack!'
end

#state?(state_name) ⇒ Boolean

Check if ‘state_name` is the state on top of the state stack.

Returns:

  • (Boolean)


511
512
513
# File 'lib/rouge/regex_lexer.rb', line 511

def state?(state_name)
  state_name.to_sym == state.name
end

#step(state, stream) ⇒ Object

Runs one step of the lex. Rules in the current state are tried until one matches, at which point its callback is called.

Returns:

  • true if a rule was tried successfully

  • false otherwise.



362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
# File 'lib/rouge/regex_lexer.rb', line 362

def step(state, stream)
  state.rules.each do |rule|
    if rule.is_a?(State)
      puts "  entering: mixin :#{rule.name}" if @debug
      return true if step(rule, stream)
      puts "  exiting: mixin :#{rule.name}" if @debug
    else
      puts "  trying: #{rule.inspect}" if @debug

      if (size = stream.skip(rule.re))
        puts "    got: #{stream[0].inspect}" if @debug

        if size.zero?
          @null_steps += 1
          if @null_steps > MAX_NULL_SCANS
            puts "    warning: too many scans without consuming the string!" if @debug
            return false
          end
        else
          @null_steps = 0
        end

        begin
          instance_exec(stream, &rule.callback)
        rescue Fallthrough
          stream.unscan
          next
        end

        return true
      end
    end
  end

  false
end

#stream_tokens(str, &b) ⇒ Object

This implements the lexer protocol, by yielding [token, value] pairs.

The process for lexing works as follows, until the stream is empty:

  1. We look at the state on top of the stack (which by default is ‘[:root]`).

  2. Each rule in that state is tried until one is successful. If one is found, that rule’s callback is evaluated - which may yield tokens and manipulate the state stack. Otherwise, one character is consumed with an ‘’Error’‘ token, and we continue at (1.)



328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
# File 'lib/rouge/regex_lexer.rb', line 328

def stream_tokens(str, &b)
  stream = StringScanner.new(str, fixed_anchor: true)

  @current_stream = stream
  @output_stream  = b
  @states         = self.class.states
  @null_steps     = 0

  until stream.eos?
    if @debug
      puts
      puts "lexer: #{self.class.tag}"
      puts "stack: #{stack.map { |s| s.name.to_sym }.inspect}"
      puts "stream: #{stream.peek(20).inspect}"
    end

    success = step(state, stream)

    if !success
      puts "    no match, yielding Error" if @debug
      yield(Token::Tokens::Error, stream.getch)
    end
  end
end

#token(tok, val = ) ⇒ Object

Yield a token.

Parameters:

  • tok

    the token type

  • val (defaults to: )

    (optional) the string value to yield. If absent, this defaults to the entire last match.



406
407
408
# File 'lib/rouge/regex_lexer.rb', line 406

def token(tok, val=@current_stream[0])
  yield_token(tok, val)
end