Class: UnicodeScanner
- Inherits:
-
Object
- Object
- UnicodeScanner
- Defined in:
- lib/unicode_scanner.rb
Overview
UnicodeScanner provides for Unicode-aware lexical scanning operations on a ‘String`. Here is an example of its usage:
““ ruby s = UnicodeScanner.new(‘This is an example string’) s.eos? # -> false
p s.scan(/w+/) # -> “This” p s.scan(/w+/) # -> nil p s.scan(/s+/) # -> “ ” p s.scan(/s+/) # -> nil p s.scan(/w+/) # -> “is” s.eos? # -> false
p s.scan(/s+/) # -> “ ” p s.scan(/w+/) # -> “an” p s.scan(/s+/) # -> “ ” p s.scan(/w+/) # -> “example” p s.scan(/s+/) # -> “ ” p s.scan(/w+/) # -> “string” s.eos? # -> true
p s.scan(/s+/) # -> nil p s.scan(/w+/) # -> nil ““
Scanning a string means remembering the position of a _scan pointer_, which is just an index. The point of scanning is to move forward a bit at a time, so matches are sought after the scan pointer; usually immediately after it.
Given the string “test string”, here are the pertinent scan pointer positions:
““
t e s t s t r i n g
0 1 2 … 1
0
““
When you #scan for a pattern (a regular expression), the match must occur at the character after the scan pointer. If you use #scan_until, then the match can occur anywhere after the scan pointer. In both cases, the scan pointer moves _just beyond_ the last character of the match, ready to scan again from the next character onwards. This is demonstrated by the example above.
Method Categories
There are other methods besides the plain scanners. You can look ahead in the string without actually scanning. You can access the most recent match. You can modify the string being scanned, reset or terminate the scanner, find out or change the position of the scan pointer, skip ahead, and so on.
### Advancing the Scan Pointer
### Looking Ahead
### Finding Where we Are
### Setting Where we Are
### Match Data
### Miscellaneous
There are aliases to several of the methods.
Constant Summary collapse
- INSPECT_LENGTH =
5
Instance Attribute Summary collapse
-
#string ⇒ String
The string being scanned.
Instance Method Summary collapse
-
#[](n) ⇒ String?
Return the nth subgroup in the most recent match.
-
#beginning_of_line? ⇒ true, false
(also: #bol?)
‘true` iff the scan pointer is at the beginning of the line.
-
#check(pattern) ⇒ String?
This returns the value that #scan would return, without advancing the scan pointer.
-
#check_until(pattern) ⇒ String?
This returns the value that #scan_until would return, without advancing the scan pointer.
-
#concat(str) ⇒ Object
(also: #<<)
Appends ‘str` to the string being scanned.
-
#eos? ⇒ true, false
‘true` if the scan pointer is at the end of the string.
-
#exist?(pattern) ⇒ true, false
Looks ahead to see if the ‘pattern` exists anywhere in the string, without advancing the scan pointer.
-
#getch ⇒ String
Scans one character and returns it.
-
#initialize(string) ⇒ UnicodeScanner
constructor
Creates a new UnicodeScanner object to scan over the given ‘string`.
-
#inspect ⇒ String
Returns a string that represents the UnicodeScanner object, showing:.
-
#match?(pattern) ⇒ true, false
Tests whether the given ‘pattern` is matched from the current scan pointer.
-
#matched ⇒ String?
The last matched string.
-
#matched? ⇒ true, false
‘true` iff the last match was successful.
-
#matched_size ⇒ Fixnum?
The size of the most recent match (see #matched), or ‘nil` if there was no recent match.
-
#peek(len) ⇒ String
Extracts a string corresponding to ‘string`, without advancing the scan pointer.
-
#pos ⇒ Fixnum
(also: #pointer)
Returns the byte position of the scan pointer.
-
#pos=(n) ⇒ Object
Set the byte position of the scan pointer.
-
#post_match ⇒ String
The _post-match_ (in the regular expression sense) of the last scan.
-
#pre_match ⇒ String
The _pre-match_ (in the regular expression sense) of the last scan.
-
#reset ⇒ Object
Reset the scan pointer (index 0) and clear matching data.
-
#rest ⇒ String
The “rest” of the string (i.e. everything after the scan pointer).
-
#rest_size ⇒ Fixnum
The value returned by ‘s.rest.size`.
-
#scan(pattern) ⇒ String?
Tries to match with ‘pattern` at the current position.
-
#scan_full(pattern, advance_pointer, return_string) ⇒ String, ...
Tests whether the given ‘pattern` is matched from the current scan pointer.
-
#scan_until(pattern) ⇒ String?
Scans the string until the ‘pattern` is matched.
-
#search_full(pattern, advance_pointer, return_string) ⇒ String, ...
Scans the string ‘until` the pattern is matched.
-
#skip(pattern) ⇒ Fixnum?
Attempts to skip over the given ‘pattern` beginning with the scan pointer.
-
#skip_until(pattern) ⇒ Fixnum?
Advances the scan pointer until ‘pattern` is matched and consumed.
-
#terminate ⇒ Object
(also: #clear)
Set the scan pointer to the end of the string and clear matching data.
-
#unscan ⇒ Object
Set the scan pointer to the previous position.
Constructor Details
#initialize(string) ⇒ UnicodeScanner
Creates a new UnicodeScanner object to scan over the given ‘string`.
111 112 113 114 115 116 117 |
# File 'lib/unicode_scanner.rb', line 111 def initialize(string) @string = string @matches = nil @matched = false @current = 0 @previous = 0 end |
Instance Attribute Details
#string ⇒ String
Returns The string being scanned.
574 575 576 |
# File 'lib/unicode_scanner.rb', line 574 def string @string end |
Instance Method Details
#[](n) ⇒ String?
Return the nth subgroup in the most recent match.
152 153 154 |
# File 'lib/unicode_scanner.rb', line 152 def [](n) @matched ? @matches[n] : nil end |
#beginning_of_line? ⇒ true, false Also known as: bol?
Returns ‘true` iff the scan pointer is at the beginning of the line.
169 170 171 172 173 174 |
# File 'lib/unicode_scanner.rb', line 169 def beginning_of_line? return nil if @current > @string.size return true if @current.zero? return @string[@current - 1] == "\n" end |
#check(pattern) ⇒ String?
194 195 196 |
# File 'lib/unicode_scanner.rb', line 194 def check(pattern) do_scan pattern, false, true, true end |
#check_until(pattern) ⇒ String?
This returns the value that #scan_until would return, without advancing the scan pointer. The match register is affected, though.
Mnemonic: it “checks” to see whether a #scan_until will return a value.
212 213 214 |
# File 'lib/unicode_scanner.rb', line 212 def check_until(pattern) do_scan pattern, false, true, false end |
#concat(str) ⇒ Object Also known as: <<
Appends ‘str` to the string being scanned. This method does not affect scan pointer.
131 132 133 |
# File 'lib/unicode_scanner.rb', line 131 def concat(str) @string.concat str end |
#eos? ⇒ true, false
Returns ‘true` if the scan pointer is at the end of the string.
226 227 228 |
# File 'lib/unicode_scanner.rb', line 226 def eos? @current >= @string.length end |
#exist?(pattern) ⇒ true, false
Looks ahead to see if the ‘pattern` exists anywhere in the string, without advancing the scan pointer. This predicates whether a #scan_until will return a value.
244 245 246 |
# File 'lib/unicode_scanner.rb', line 244 def exist?(pattern) do_scan pattern, false, false, false end |
#getch ⇒ String
Scans one character and returns it.
263 264 265 266 267 |
# File 'lib/unicode_scanner.rb', line 263 def getch return nil if eos? do_scan(/./u, true, true, true) end |
#inspect ⇒ String
Returns a string that represents the UnicodeScanner object, showing:
-
the current position
-
the size of the string
-
the characters surrounding the scan pointer
283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 |
# File 'lib/unicode_scanner.rb', line 283 def inspect return "#<#{self.class} (uninitialized)>" if @string.nil? return "#<#{self.class} fin>" if eos? if @current.zero? return format("#<%{class} %<cur>d/%<len>d @ %{after}>", class: self.class.to_s, cur: @current, len: @string.length, after: inspect_after.inspect) end format("#<%{class} %<cur>d/%<len>d %{before} @ %{after}>", class: self.class.to_s, cur: @current, len: @string.length, before: inspect_before.inspect, after: inspect_after.inspect) end |
#match?(pattern) ⇒ true, false
Tests whether the given ‘pattern` is matched from the current scan pointer. Returns the length of the match, or `nil`. The scan pointer is not advanced.
315 316 317 |
# File 'lib/unicode_scanner.rb', line 315 def match?(pattern) do_scan pattern, false, false, true end |
#matched ⇒ String?
Returns The last matched string.
325 326 327 328 329 |
# File 'lib/unicode_scanner.rb', line 325 def matched return nil unless @matched @matches[0] end |
#matched? ⇒ true, false
Returns ‘true` iff the last match was successful.
339 |
# File 'lib/unicode_scanner.rb', line 339 def matched?() @matched end |
#matched_size ⇒ Fixnum?
Returns The size of the most recent match (see #matched), or ‘nil` if there was no recent match.
350 351 352 353 354 |
# File 'lib/unicode_scanner.rb', line 350 def matched_size return nil unless @matched @matches.end(0) - @matches.begin(0) end |
#peek(len) ⇒ String
Extracts a string corresponding to ‘string`, without advancing the scan pointer.
367 368 369 370 371 |
# File 'lib/unicode_scanner.rb', line 367 def peek(len) return "" if eos? @string[@current, len] end |
#pos ⇒ Fixnum Also known as: pointer
Returns the byte position of the scan pointer. In the ‘reset’ position, this value is zero. In the ‘terminated’ position (i.e. the string is exhausted), this value is the bytesize of the string.
In short, it’s a 0-based index into the string.
389 |
# File 'lib/unicode_scanner.rb', line 389 def pos() @current end |
#pos=(n) ⇒ Object
Set the byte position of the scan pointer.
402 403 404 405 406 407 408 |
# File 'lib/unicode_scanner.rb', line 402 def pos=(n) n += @string.length if n.negative? raise RangeError, "index out of range" if n.negative? raise RangeError, "index out of range" if n > @string.length @current = n end |
#post_match ⇒ String
Returns The _post-match_ (in the regular expression sense) of the last scan.
419 420 421 422 423 |
# File 'lib/unicode_scanner.rb', line 419 def post_match return nil unless @matched @string[@previous + @matches.end(0), @string.length] end |
#pre_match ⇒ String
Returns The _pre-match_ (in the regular expression sense) of the last scan.
434 435 436 437 438 |
# File 'lib/unicode_scanner.rb', line 434 def pre_match return nil unless @matched @string[0, @previous + @matches.begin(0)] end |
#reset ⇒ Object
Reset the scan pointer (index 0) and clear matching data.
442 443 444 445 |
# File 'lib/unicode_scanner.rb', line 442 def reset @current = 0 @matched = false end |
#rest ⇒ String
Returns The “rest” of the string (i.e. everything after the scan pointer). If there is no more data (‘eos? = true`), it returns `“”`.
450 451 452 453 454 |
# File 'lib/unicode_scanner.rb', line 450 def rest return "" if eos? return @string[@current, @string.length] end |
#rest_size ⇒ Fixnum
Returns The value returned by ‘s.rest.size`.
458 459 460 461 462 |
# File 'lib/unicode_scanner.rb', line 458 def rest_size return 0 if eos? @string.length - @current end |
#scan(pattern) ⇒ String?
Tries to match with ‘pattern` at the current position. If there’s a match, the scanner advances the “scan pointer” and returns the matched string. Otherwise, the scanner returns ‘nil`.
479 480 481 |
# File 'lib/unicode_scanner.rb', line 479 def scan(pattern) do_scan pattern, true, true, true end |
#scan_full(pattern, advance_pointer, return_string) ⇒ String, ...
Tests whether the given ‘pattern` is matched from the current scan pointer. Advances the scan pointer if `advance_pointer` is `true`. Returns the matched string if `return_string` is true. The match register is affected.
“full” means “scan with full parameters”.
497 498 499 |
# File 'lib/unicode_scanner.rb', line 497 def scan_full(pattern, advance_pointer, return_string) do_scan pattern, advance_pointer, return_string, true end |
#scan_until(pattern) ⇒ String?
Scans the string until the ‘pattern` is matched. Returns the substring up to and including the end of the match, advancing the scan pointer to that location. If there is no match, `nil` is returned.
514 515 516 |
# File 'lib/unicode_scanner.rb', line 514 def scan_until(pattern) do_scan pattern, true, true, false end |
#search_full(pattern, advance_pointer, return_string) ⇒ String, ...
Scans the string ‘until` the pattern is matched. Advances the scan pointer if `advance_pointer`, otherwise not. Returns the matched string if `return_string` is `true`, otherwise returns the number of characters advanced. This method does affect the match register.
531 532 533 |
# File 'lib/unicode_scanner.rb', line 531 def search_full(pattern, advance_pointer, return_string) do_scan pattern, advance_pointer, return_string, false end |
#skip(pattern) ⇒ Fixnum?
Attempts to skip over the given ‘pattern` beginning with the scan pointer. If it matches, the scan pointer is advanced to the end of the match, and the length of the match is returned. Otherwise, `nil` is returned.
It’s similar to #scan, but without returning the matched string.
552 553 554 |
# File 'lib/unicode_scanner.rb', line 552 def skip(pattern) do_scan pattern, true, false, true end |
#skip_until(pattern) ⇒ Fixnum?
Advances the scan pointer until ‘pattern` is matched and consumed. Returns the number of characters advanced, or `nil` if no match was found.
Look ahead to match ‘pattern`, and advance the scan pointer to the end of the match. Return the number of characters advanced, or `nil` if the match was unsuccessful.
It’s similar to #scan_until, but without returning the intervening string.
568 569 570 |
# File 'lib/unicode_scanner.rb', line 568 def skip_until(pattern) do_scan pattern, true, false, false end |
#terminate ⇒ Object Also known as: clear
Set the scan pointer to the end of the string and clear matching data.
589 590 591 592 593 |
# File 'lib/unicode_scanner.rb', line 589 def terminate @current = @string.length @matched = false self end |
#unscan ⇒ Object
Set the scan pointer to the previous position. Only one previous position is remembered, and it changes with each scanning operation.
607 608 609 610 611 612 613 |
# File 'lib/unicode_scanner.rb', line 607 def unscan raise ScanError, "unscan failed: previous match record not exist" unless @matched @current = @previous @matched = false self end |