Class: URIPattern::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/uri_pattern/tokenizer.rb

Defined Under Namespace

Classes: Token

Constant Summary collapse

IDENTIFIER_RE =

A “:name” identifier follows the spec’s “regexIdentifierStart” / “regexIdentifierPart” (path-to-regex-modified):

start = /[$_\p{ID_Start}]/u,  part = /[$_‌‍\p{ID_Continue}]/u

In Ruby “_”, ZWNJ and ZWJ are already in pID_Continue (and “$” is not), while “_” is not in pID_Start; so the start class adds “$” and “_” and the part class adds only “$”. Matching the spec here (rather than a permissive “[u80-u10FFFF]”) makes e.g. “:$foo” a name and rejects a name starting with a non-ID_Start code point (e.g. “:🚲”), as the reference does.

/\A[$_\p{ID_Start}][$\p{ID_Continue}]*/u

Instance Method Summary collapse

Constructor Details

#initialize(pattern, policy: :lenient) ⇒ Tokenizer

Returns a new instance of Tokenizer.



17
18
19
20
21
22
# File 'lib/uri_pattern/tokenizer.rb', line 17

def initialize(pattern, policy: :lenient)
  @pattern = pattern
  @policy = policy
  @index = 0
  @tokens = []
end

Instance Method Details

#tokenizeObject



24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/uri_pattern/tokenizer.rb', line 24

def tokenize
  while @index < @pattern.length
    ch = @pattern[@index]

    case ch
    when "\\"
      if @index + 1 < @pattern.length
        emit(:escaped_char, @pattern[@index + 1])
        @index += 2
      else
        handle_invalid("trailing backslash")
      end
    when "{"
      emit(:open, ch)
      @index += 1
    when "}"
      emit(:close, ch)
      @index += 1
    when "("
      # Lex the whole "(...)" group atomically into one :regexp token, as the
      # spec's tokenizer does (validating it during the scan).
      scan_regexp_group
    when ")"
      # A ")" not consumed by a group scan is a literal character (the spec's
      # tokenizer falls through to a CHAR token here).
      emit(:char, ch)
      @index += 1
    when "*"
      prev = @tokens.last
      if prev && %i[close regexp name asterisk].include?(prev.type)
        emit(:other_modifier, ch)
      else
        emit(:asterisk, ch)
      end
      @index += 1
    when "?", "+"
      # "?"/"+" are always modifier tokens. A modifier that does not follow a
      # group/name/regexp/wildcard is a dangling modifier; the compiler rejects
      # it. (A literal "?"/"+" must be escaped, e.g. "\\?".)
      emit(:other_modifier, ch)
      @index += 1
    when ":"
      rest = @pattern[(@index + 1)..]
      if (m = IDENTIFIER_RE.match(rest))
        emit(:name, m[0])
        @index += 1 + m[0].length
      else
        # ":" must be followed by a valid name. When it is not, the spec's
        # tokenizer reports "missing parameter name": strict tokenizing (used
        # when compiling a component) raises, while lenient tokenizing
        # (constructor string parsing) emits an :invalid_char so the ":" is
        # still recognized as a protocol/password/port delimiter by the
        # constructor string parser (which treats :invalid_char as a
        # non-special char, like :char).
        handle_invalid("missing parameter name")
      end
    else
      emit(:char, ch)
      @index += 1
    end
  end
  emit(:end, "")
  @tokens
end