Class: UnicodeScanner

Inherits:
Object
  • Object
show all
Defined in:
lib/unicode_scanner.rb

Overview

UnicodeScanner provides for Unicode-aware lexical scanning operations on a ‘String`. Here is an example of its usage:

““ ruby s = UnicodeScanner.new(‘This is an example string’) s.eos? # -> false

p s.scan(/w+/) # -> “This” p s.scan(/w+/) # -> nil p s.scan(/s+/) # -> “ ” p s.scan(/s+/) # -> nil p s.scan(/w+/) # -> “is” s.eos? # -> false

p s.scan(/s+/) # -> “ ” p s.scan(/w+/) # -> “an” p s.scan(/s+/) # -> “ ” p s.scan(/w+/) # -> “example” p s.scan(/s+/) # -> “ ” p s.scan(/w+/) # -> “string” s.eos? # -> true

p s.scan(/s+/) # -> nil p s.scan(/w+/) # -> nil ““

Scanning a string means remembering the position of a _scan pointer_, which is just an index. The point of scanning is to move forward a bit at a time, so matches are sought after the scan pointer; usually immediately after it.

Given the string “test string”, here are the pertinent scan pointer positions:

““

t e s t   s t r i n g

0 1 2 … 1

0

““

When you #scan for a pattern (a regular expression), the match must occur at the character after the scan pointer. If you use #scan_until, then the match can occur anywhere after the scan pointer. In both cases, the scan pointer moves _just beyond_ the last character of the match, ready to scan again from the next character onwards. This is demonstrated by the example above.

Method Categories


There are other methods besides the plain scanners. You can look ahead in the string without actually scanning. You can access the most recent match. You can modify the string being scanned, reset or terminate the scanner, find out or change the position of the scan pointer, skip ahead, and so on.

### Advancing the Scan Pointer

### Looking Ahead

### Finding Where we Are

### Setting Where we Are

### Match Data

### Miscellaneous

There are aliases to several of the methods.

Constant Summary collapse

INSPECT_LENGTH =
5

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(string) ⇒ UnicodeScanner

Creates a new UnicodeScanner object to scan over the given ‘string`.

Parameters:

  • string (String)

    The string to iterate over.



111
112
113
114
115
116
117
# File 'lib/unicode_scanner.rb', line 111

def initialize(string)
  @string   = string
  @matches  = nil
  @matched  = false
  @current  = 0
  @previous = 0
end

Instance Attribute Details

#stringString

Returns The string being scanned.

Returns:

  • (String)

    The string being scanned.



574
575
576
# File 'lib/unicode_scanner.rb', line 574

def string
  @string
end

Instance Method Details

#[](n) ⇒ String?

Return the nth subgroup in the most recent match.

Examples:

s = UnicodeScanner.new("Fri Dec 12 1975 14:39")
s.scan(/(\w+) (\w+) (\d+) /)       # -> "Fri Dec 12 "
s[0]                               # -> "Fri Dec 12 "
s[1]                               # -> "Fri"
s[2]                               # -> "Dec"
s[3]                               # -> "12"
s.post_match                       # -> "1975 14:39"
s.pre_match                        # -> ""

Parameters:

  • n (Fixnum)

    The index of the subgroup to return.

Returns:

  • (String, nil)

    The subgroup, if it exists.



152
153
154
# File 'lib/unicode_scanner.rb', line 152

def [](n)
  @matched ? @matches[n] : nil
end

#beginning_of_line?true, false Also known as: bol?

Returns ‘true` iff the scan pointer is at the beginning of the line.

Examples:

s = UnicodeScanner.new("test\ntest\n")
s.bol?           # => true
s.scan(/te/)
s.bol?           # => false
s.scan(/st\n/)
s.bol?           # => true
s.terminate
s.bol?           # => true

Returns:

  • (true, false)

    ‘true` iff the scan pointer is at the beginning of the line.



169
170
171
172
173
174
# File 'lib/unicode_scanner.rb', line 169

def beginning_of_line?
  return nil if @current > @string.size
  return true if @current.zero?

  return @string[@current - 1] == "\n"
end

#check(pattern) ⇒ String?

This returns the value that #scan would return, without advancing the scan pointer. The match register is affected, though.

Mnemonic: it “checks” to see whether a #scan will return a value.

Examples:

s = UnicodeScanner.new("Fri Dec 12 1975 14:39")
s.check /Fri/               # -> "Fri"
s.pos                       # -> 0
s.matched                   # -> "Fri"
s.check /12/                # -> nil
s.matched                   # -> nil

Parameters:

  • pattern (Regexp)

    The pattern to scan for.

Returns:

  • (String, nil)

    The matched segment, if matched.



194
195
196
# File 'lib/unicode_scanner.rb', line 194

def check(pattern)
  do_scan pattern, false, true, true
end

#check_until(pattern) ⇒ String?

This returns the value that #scan_until would return, without advancing the scan pointer. The match register is affected, though.

Mnemonic: it “checks” to see whether a #scan_until will return a value.

Examples:

s = UnicodeScanner.new("Fri Dec 12 1975 14:39")
s.check_until /12/          # -> "Fri Dec 12"
s.pos                       # -> 0
s.matched                   # -> 12

Parameters:

  • pattern (Regexp)

    The pattern to scan until reaching.

Returns:

  • (String, nil)

    The matched segment, if matched.



212
213
214
# File 'lib/unicode_scanner.rb', line 212

def check_until(pattern)
  do_scan pattern, false, true, false
end

#concat(str) ⇒ Object Also known as: <<

Appends ‘str` to the string being scanned. This method does not affect scan pointer.

Examples:

s = UnicodeScanner.new("Fri Dec 12 1975 14:39")
s.scan(/Fri /)
s << " +1000 GMT"
s.string            # -> "Fri Dec 12 1975 14:39 +1000 GMT"
s.scan(/Dec/)       # -> "Dec"

Parameters:

  • str (String)

    The string to append.



131
132
133
# File 'lib/unicode_scanner.rb', line 131

def concat(str)
  @string.concat str
end

#eos?true, false

Returns ‘true` if the scan pointer is at the end of the string.

Examples:

s = UnicodeScanner.new('test string')
p s.eos?          # => false
s.scan(/test/)
p s.eos?          # => false
s.terminate
p s.eos?          # => true

Returns:

  • (true, false)

    ‘true` if the scan pointer is at the end of the string.



226
227
228
# File 'lib/unicode_scanner.rb', line 226

def eos?
  @current >= @string.length
end

#exist?(pattern) ⇒ true, false

Looks ahead to see if the ‘pattern` exists anywhere in the string, without advancing the scan pointer. This predicates whether a #scan_until will return a value.

Examples:

s = UnicodeScanner.new('test string')
s.exist? /s/            # -> 3
s.scan /test/           # -> "test"
s.exist? /s/            # -> 2
s.exist? /e/            # -> nil

Parameters:

  • pattern (Regexp)

    The pattern to search for.

Returns:

  • (true, false)

    Whether the pattern exists ahead.



244
245
246
# File 'lib/unicode_scanner.rb', line 244

def exist?(pattern)
  do_scan pattern, false, false, false
end

#getchString

Scans one character and returns it.

Examples:

s = UnicodeScanner.new("ab")
s.getch           # => "a"
s.getch           # => "b"
s.getch           # => nil

$KCODE = 'EUC'
s = UnicodeScanner.new("\2244\2242")
s.getch           # => "\244\242"   # Japanese hira-kana "A" in EUC-JP
s.getch           # => nil

Returns:

  • (String)

    The character.



263
264
265
266
267
# File 'lib/unicode_scanner.rb', line 263

def getch
  return nil if eos?

  do_scan(/./u, true, true, true)
end

#inspectString

Returns a string that represents the UnicodeScanner object, showing:

  • the current position

  • the size of the string

  • the characters surrounding the scan pointer

Examples:

s = ::new("Fri Dec 12 1975 14:39")
s.inspect # -> '#<UnicodeScanner 0/21 @ "Fri D...">'
s.scan_until /12/ # -> "Fri Dec 12"
s.inspect # -> '#<UnicodeScanner 10/21 "...ec 12" @ " 1975...">'

Returns:

  • (String)

    A description of this object.



283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
# File 'lib/unicode_scanner.rb', line 283

def inspect
  return "#<#{self.class} (uninitialized)>" if @string.nil?
  return "#<#{self.class} fin>" if eos?

  if @current.zero?
    return format("#<%{class} %<cur>d/%<len>d @ %{after}>",
                  class: self.class.to_s,
                  cur:   @current,
                  len:   @string.length,
                  after: inspect_after.inspect)
  end

  format("#<%{class} %<cur>d/%<len>d %{before} @ %{after}>",
         class:  self.class.to_s,
         cur:    @current,
         len:    @string.length,
         before: inspect_before.inspect,
         after:  inspect_after.inspect)
end

#match?(pattern) ⇒ true, false

Tests whether the given ‘pattern` is matched from the current scan pointer. Returns the length of the match, or `nil`. The scan pointer is not advanced.

Examples:

s = UnicodeScanner.new('test string')
p s.match?(/\w+/)   # -> 4
p s.match?(/\w+/)   # -> 4
p s.match?(/\s+/)   # -> nil

Parameters:

  • pattern (Regexp)

    The pattern to match with.

Returns:

  • (true, false)

    Whether the pattern is matched from the scan pointer.



315
316
317
# File 'lib/unicode_scanner.rb', line 315

def match?(pattern)
  do_scan pattern, false, false, true
end

#matchedString?

Returns The last matched string.

Examples:

s = UnicodeScanner.new('test string')
s.match?(/\w+/)     # -> 4
s.matched           # -> "test"

Returns:

  • (String, nil)

    The last matched string.



325
326
327
328
329
# File 'lib/unicode_scanner.rb', line 325

def matched
  return nil unless @matched

  @matches[0]
end

#matched?true, false

Returns ‘true` iff the last match was successful.

Examples:

s = UnicodeScanner.new('test string')
s.match?(/\w+/)     # => 4
s.matched?          # => true
s.match?(/\d+/)     # => nil
s.matched?          # => false

Returns:

  • (true, false)

    ‘true` iff the last match was successful.



339
# File 'lib/unicode_scanner.rb', line 339

def matched?() @matched end

#matched_sizeFixnum?

Returns The size of the most recent match (see #matched), or ‘nil` if there was no recent match.

Examples:

s = UnicodeScanner.new('test string')
s.check /\w+/           # -> "test"
s.matched_size          # -> 4
s.check /\d+/           # -> nil
s.matched_size          # -> nil

Returns:

  • (Fixnum, nil)

    The size of the most recent match (see #matched), or ‘nil` if there was no recent match.



350
351
352
353
354
# File 'lib/unicode_scanner.rb', line 350

def matched_size
  return nil unless @matched

  @matches.end(0) - @matches.begin(0)
end

#peek(len) ⇒ String

Extracts a string corresponding to ‘string`, without advancing the scan pointer.

Examples:

s = UnicodeScanner.new('test string')
s.peek(7)          # => "test st"
s.peek(7)          # => "test st"

Parameters:

  • len (Fixnum)

    The number of characters ahead to peek.

Returns:

  • (String)

    The string after the current position.



367
368
369
370
371
# File 'lib/unicode_scanner.rb', line 367

def peek(len)
  return "" if eos?

  @string[@current, len]
end

#posFixnum Also known as: pointer

Returns the byte position of the scan pointer. In the ‘reset’ position, this value is zero. In the ‘terminated’ position (i.e. the string is exhausted), this value is the bytesize of the string.

In short, it’s a 0-based index into the string.

Examples:

s = UnicodeScanner.new('test string')
s.pos               # -> 0
s.scan_until /str/  # -> "test str"
s.pos               # -> 8
s.terminate         # -> #<UnicodeScanner fin>
s.pos               # -> 11

Returns:

  • (Fixnum)

    The current scan position.



389
# File 'lib/unicode_scanner.rb', line 389

def pos() @current end

#pos=(n) ⇒ Object

Set the byte position of the scan pointer.

Examples:

s = UnicodeScanner.new('test string')
s.pos = 7            # -> 7
s.rest               # -> "ring"

Parameters:

  • n (Fixnum)

    The new position.

Raises:

  • (RangeError)


402
403
404
405
406
407
408
# File 'lib/unicode_scanner.rb', line 402

def pos=(n)
  n += @string.length if n.negative?
  raise RangeError, "index out of range" if n.negative?
  raise RangeError, "index out of range" if n > @string.length

  @current = n
end

#post_matchString

Returns The _post-match_ (in the regular expression sense) of the last scan.

Examples:

s = UnicodeScanner.new('test string')
s.scan(/\w+/)           # -> "test"
s.scan(/\s+/)           # -> " "
s.pre_match             # -> "test"
s.post_match            # -> "string"

Returns:

  • (String)

    The _post-match_ (in the regular expression sense) of the last scan.



419
420
421
422
423
# File 'lib/unicode_scanner.rb', line 419

def post_match
  return nil unless @matched

  @string[@previous + @matches.end(0), @string.length]
end

#pre_matchString

Returns The _pre-match_ (in the regular expression sense) of the last scan.

Examples:

s = UnicodeScanner.new('test string')
s.scan(/\w+/)           # -> "test"
s.scan(/\s+/)           # -> " "
s.pre_match             # -> "test"
s.post_match            # -> "string"

Returns:

  • (String)

    The _pre-match_ (in the regular expression sense) of the last scan.



434
435
436
437
438
# File 'lib/unicode_scanner.rb', line 434

def pre_match
  return nil unless @matched

  @string[0, @previous + @matches.begin(0)]
end

#resetObject

Reset the scan pointer (index 0) and clear matching data.



442
443
444
445
# File 'lib/unicode_scanner.rb', line 442

def reset
  @current = 0
  @matched = false
end

#restString

Returns The “rest” of the string (i.e. everything after the scan pointer). If there is no more data (‘eos? = true`), it returns `“”`.

Returns:

  • (String)

    The “rest” of the string (i.e. everything after the scan pointer). If there is no more data (‘eos? = true`), it returns `“”`.



450
451
452
453
454
# File 'lib/unicode_scanner.rb', line 450

def rest
  return "" if eos?

  return @string[@current, @string.length]
end

#rest_sizeFixnum

Returns The value returned by ‘s.rest.size`.

Returns:

  • (Fixnum)

    The value returned by ‘s.rest.size`.



458
459
460
461
462
# File 'lib/unicode_scanner.rb', line 458

def rest_size
  return 0 if eos?

  @string.length - @current
end

#scan(pattern) ⇒ String?

Tries to match with ‘pattern` at the current position. If there’s a match, the scanner advances the “scan pointer” and returns the matched string. Otherwise, the scanner returns ‘nil`.

Examples:

s = UnicodeScanner.new('test string')
p s.scan(/\w+/)   # -> "test"
p s.scan(/\w+/)   # -> nil
p s.scan(/\s+/)   # -> " "
p s.scan(/\w+/)   # -> "string"
p s.scan(/./)     # -> nil

Parameters:

  • pattern (Regexp)

    The pattern to match.

Returns:

  • (String, nil)

    The string that was matched, if a match was found.



479
480
481
# File 'lib/unicode_scanner.rb', line 479

def scan(pattern)
  do_scan pattern, true, true, true
end

#scan_full(pattern, advance_pointer, return_string) ⇒ String, ...

Tests whether the given ‘pattern` is matched from the current scan pointer. Advances the scan pointer if `advance_pointer` is `true`. Returns the matched string if `return_string` is true. The match register is affected.

“full” means “scan with full parameters”.

Parameters:

  • pattern (Regexp)

    The pattern to scan.

  • advance_pointer (true, false)

    Whether to advance the scan pointer if a match is found.

  • return_string (true, false)

    Whether to return the matched segment.

Returns:

  • (String, Fixnum, nil)

    The matched segment if ‘return_string` is `true`, otherwise the number of characters advanced. `nil` if nothing matched.



497
498
499
# File 'lib/unicode_scanner.rb', line 497

def scan_full(pattern, advance_pointer, return_string)
  do_scan pattern, advance_pointer, return_string, true
end

#scan_until(pattern) ⇒ String?

Scans the string until the ‘pattern` is matched. Returns the substring up to and including the end of the match, advancing the scan pointer to that location. If there is no match, `nil` is returned.

Examples:

s = UnicodeScanner.new("Fri Dec 12 1975 14:39")
s.scan_until(/1/)        # -> "Fri Dec 1"
s.pre_match              # -> "Fri Dec "
s.scan_until(/XYZ/)      # -> nil

Parameters:

  • pattern (Regexp)

    The pattern to match.

Returns:

  • (String, nil)

    The segment that matched.



514
515
516
# File 'lib/unicode_scanner.rb', line 514

def scan_until(pattern)
  do_scan pattern, true, true, false
end

#search_full(pattern, advance_pointer, return_string) ⇒ String, ...

Scans the string ‘until` the pattern is matched. Advances the scan pointer if `advance_pointer`, otherwise not. Returns the matched string if `return_string` is `true`, otherwise returns the number of characters advanced. This method does affect the match register.

Parameters:

  • pattern (Regexp)

    The pattern to scan.

  • advance_pointer (true, false)

    Whether to advance the scan pointer if a match is found.

  • return_string (true, false)

    Whether to return the matched segment.

Returns:

  • (String, Fixnum, nil)

    The matched segment if ‘return_string` is `true`, otherwise the number of characters advanced. `nil` if nothing matched.



531
532
533
# File 'lib/unicode_scanner.rb', line 531

def search_full(pattern, advance_pointer, return_string)
  do_scan pattern, advance_pointer, return_string, false
end

#skip(pattern) ⇒ Fixnum?

Attempts to skip over the given ‘pattern` beginning with the scan pointer. If it matches, the scan pointer is advanced to the end of the match, and the length of the match is returned. Otherwise, `nil` is returned.

It’s similar to #scan, but without returning the matched string.

Examples:

s = UnicodeScanner.new('test string')
p s.skip(/\w+/)   # -> 4
p s.skip(/\w+/)   # -> nil
p s.skip(/\s+/)   # -> 1
p s.skip(/\w+/)   # -> 6
p s.skip(/./)     # -> nil

Parameters:

  • pattern (Regexp)

    The pattern to match.

Returns:

  • (Fixnum, nil)

    The number of characters advanced, if matched.



552
553
554
# File 'lib/unicode_scanner.rb', line 552

def skip(pattern)
  do_scan pattern, true, false, true
end

#skip_until(pattern) ⇒ Fixnum?

Advances the scan pointer until ‘pattern` is matched and consumed. Returns the number of characters advanced, or `nil` if no match was found.

Look ahead to match ‘pattern`, and advance the scan pointer to the end of the match. Return the number of characters advanced, or `nil` if the match was unsuccessful.

It’s similar to #scan_until, but without returning the intervening string.

Parameters:

  • pattern (Regexp)

    The pattern to match.

Returns:

  • (Fixnum, nil)

    The number of characters advanced, if matched.



568
569
570
# File 'lib/unicode_scanner.rb', line 568

def skip_until(pattern)
  do_scan pattern, true, false, false
end

#terminateObject Also known as: clear

Set the scan pointer to the end of the string and clear matching data.



589
590
591
592
593
# File 'lib/unicode_scanner.rb', line 589

def terminate
  @current = @string.length
  @matched = false
  self
end

#unscanObject

Set the scan pointer to the previous position. Only one previous position is remembered, and it changes with each scanning operation.

Examples:

s = UnicodeScanner.new('test string')
s.scan(/\w+/)        # => "test"
s.unscan
s.scan(/../)         # => "te"
s.scan(/\d/)         # => nil
s.unscan             # ScanError: unscan failed: previous match record not exist

Raises:



607
608
609
610
611
612
613
# File 'lib/unicode_scanner.rb', line 607

def unscan
  raise ScanError, "unscan failed: previous match record not exist" unless @matched

  @current = @previous
  @matched = false
  self
end