Module: URIPattern::URLParser

Defined in:
lib/uri_pattern/url_parser.rb

Constant Summary collapse

WHATWG_SCHEME =

Indices in the array returned by URI::WhatwgParser#split:

scheme, userinfo, host, port, nil, path, opaque_path, query, fragment
0
WHATWG_USERINFO =
1
WHATWG_HOST =
2
WHATWG_PORT =
3
WHATWG_PATH =
5
WHATWG_OPAQUE_PATH =
6
WHATWG_QUERY =
7
WHATWG_FRAGMENT =
8
DEFAULT_PORTS =
{
  "http"  => 80,
  "https" => 443,
  "ws"    => 80,
  "wss"   => 443,
  "ftp"   => 21
}.freeze
SPECIAL_SCHEMES_SET =
Set.new(%w[http https ws wss ftp file]).freeze
DUMMY_URL =

— “dummy URL” canonicalization of a fixed pattern run ——————–

The WHATWG URLPattern spec canonicalizes each fixed-text part of a pattern by running it through a throwaway (“dummy”) URL, so the URL parser applies the exact spec percent-encode set and (for pathname) dot-segment handling. We delegate here instead of maintaining encode-set tables by hand, which both simplifies the code and tracks the spec precisely.

DUMMY_URL is the spec’s “create a dummy URL” input verbatim (urlpattern.spec.whatwg.org/ — “Let dummyInput be ‘dummy.invalid/`”).

"https://dummy.invalid/"

Class Method Summary collapse

Class Method Details

.canonicalize_hash_run(run) ⇒ Object



179
180
181
182
183
184
185
# File 'lib/uri_pattern/url_parser.rb', line 179

def canonicalize_hash_run(run)
  u = dummy_url
  u.fragment = run
  u.fragment.to_s
rescue => e
  raise URIPattern::Error, "Invalid hash #{run.inspect}: #{e.message}"
end

.canonicalize_password_run(run) ⇒ Object



195
196
197
198
199
200
201
# File 'lib/uri_pattern/url_parser.rb', line 195

def canonicalize_password_run(run)
  u = dummy_url
  u.password = run
  u.password.to_s
rescue => e
  raise URIPattern::Error, "Invalid password #{run.inspect}: #{e.message}"
end

.canonicalize_pathname_run(run, opaque_path: false) ⇒ Object

“canonicalize a pathname” / “canonicalize an opaque pathname”: run the fixed text through a dummy URL via full parsing (so “#”/“?” terminate the path and dot segments collapse, matching the polyfill). A non-opaque run that is not “/”-prefixed gets the spec’s “/-” prefix trick so a leading “../” is preserved rather than collapsed against the root.



208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
# File 'lib/uri_pattern/url_parser.rb', line 208

def canonicalize_pathname_run(run, opaque_path: false)
  return run if run.empty?
  if opaque_path
    parsed = URI::WhatwgParser.new.split("data:#{run}")
    (parsed[WHATWG_OPAQUE_PATH] || parsed[WHATWG_PATH]).to_s
  else
    lead = run.start_with?("/")
    modified = lead ? run : "/-#{run}"
    # Append the run as the dummy URL's path. The run supplies its own leading
    # "/", so drop DUMMY_URL's trailing slash before joining. Parsing the whole
    # URL (rather than resolving the run against DUMMY_URL as a base) keeps a
    # leading "//" a path instead of an authority, and lets "#"/"?" terminate.
    parsed = URI::WhatwgParser.new.split(DUMMY_URL.chomp("/") + modified)
    pathname = parsed[WHATWG_PATH].to_s
    lead ? pathname : pathname.sub(%r{\A/-}, "")
  end
rescue => e
  raise URIPattern::Error, "Invalid pathname #{run.inspect}: #{e.message}"
end

.canonicalize_protocol_input(value) ⇒ Object

“canonicalize a protocol” on a match input: a scheme is ASCII, starts with a letter, and contains only letters, digits, “+”, “-” and “.”. A value with any other code point (e.g. “café”) cannot be a protocol, so matching fails.



144
145
146
147
148
# File 'lib/uri_pattern/url_parser.rb', line 144

def canonicalize_protocol_input(value)
  return "" if value.empty?
  return nil unless value.match?(/\A[a-zA-Z][a-zA-Z0-9+.\-]*\z/)
  value.downcase
end

.canonicalize_search_run(run) ⇒ Object

“canonicalize a search” / “…hash” / “…username” / “…password”: the polyfill sets the corresponding URL component and reads it back. The uri-whatwg_parser setters run the basic URL parser with the matching state override and apply the spec encode sets (special-query for search, userinfo for username/password, etc.).



171
172
173
174
175
176
177
# File 'lib/uri_pattern/url_parser.rb', line 171

def canonicalize_search_run(run)
  u = dummy_url
  u.query = run
  u.query.to_s
rescue => e
  raise URIPattern::Error, "Invalid search #{run.inspect}: #{e.message}"
end

.canonicalize_username_run(run) ⇒ Object



187
188
189
190
191
192
193
# File 'lib/uri_pattern/url_parser.rb', line 187

def canonicalize_username_run(run)
  u = dummy_url
  u.user = run
  u.user.to_s
rescue => e
  raise URIPattern::Error, "Invalid username #{run.inspect}: #{e.message}"
end

.dummy_urlObject



162
163
164
# File 'lib/uri_pattern/url_parser.rb', line 162

def dummy_url
  URI::WhatwgParser.new.parse(DUMMY_URL)
end

.normalize_hash_input(hash) ⇒ Object

Normalize a hash input through WHATWG URL rules for each component. Returns nil if a required component fails normalization.



103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# File 'lib/uri_pattern/url_parser.rb', line 103

def normalize_hash_input(hash)
  protocol = hash[:protocol].to_s.downcase
  # Opaque path: non-special scheme, no username/password/hostname/port set
  opaque_path = !protocol.empty? && !SPECIAL_SCHEMES_SET.include?(protocol) &&
                (hash[:hostname].nil? || hash[:hostname].to_s.empty?) &&
                (hash[:username].nil? || hash[:username].to_s.empty?) &&
                (hash[:password].nil? || hash[:password].to_s.empty?) &&
                (hash[:port].nil? || hash[:port].to_s.empty?)
  result = {}
  hash.each do |k, v|
    result[k] = case k
    when :protocol
      norm = canonicalize_protocol_input(v.to_s)
      return nil if norm.nil?
      norm
    when :port
      norm = normalize_port_input(v.to_s, protocol)
      return nil if norm.nil?
      norm
    when :pathname
      canonicalize_pathname_run(v.to_s, opaque_path: opaque_path)
    when :hostname
      normalize_hostname_input(v.to_s)
    when :username
      canonicalize_username_run(v.to_s)
    when :password
      canonicalize_password_run(v.to_s)
    when :query
      canonicalize_search_run(v.to_s)
    when :fragment
      canonicalize_hash_run(v.to_s)
    else
      v.to_s
    end
  end
  result
end

.normalize_hostname_input(hostname) ⇒ Object

Normalize a hostname: IDN, and strip CR/LF/tab.



92
93
94
95
96
97
98
99
# File 'lib/uri_pattern/url_parser.rb', line 92

def normalize_hostname_input(hostname)
  return "" if hostname.nil? || hostname.empty?
  h = hostname.gsub(/[\r\n\t]/, "")
  return "" if h.empty?
  URI::WhatwgParser.new.split("https://#{h}/")[WHATWG_HOST] || h
rescue
  h
end

.normalize_port_input(port_str, protocol = "") ⇒ Object

Normalize a port string for use as a match input component. Strips tabs, takes leading numeric digits, and suppresses the default port. Returns nil if the port string has no leading digits (parse failure).



80
81
82
83
84
85
86
87
# File 'lib/uri_pattern/url_parser.rb', line 80

def normalize_port_input(port_str, protocol = "")
  port = port_str.to_s.gsub(/[\t\f]/, "")
  digits = port.match(/\A\d*/)[0]
  return nil if digits.empty? && !port.empty?
  return nil if digits.length > 0 && digits.to_i > 65535
  default = DEFAULT_PORTS[protocol.to_s.downcase]
  default && default.to_s == digits ? "" : digits
end

.resolve(relative, base_url) ⇒ Object



31
32
33
34
35
# File 'lib/uri_pattern/url_parser.rb', line 31

def resolve(relative, base_url)
  URI::WhatwgParser.new.parse(relative, base: base_url).to_s
rescue => e
  raise URIPattern::Error, "Failed to resolve URL: #{e.message}"
end

.split_components(url, base_url: nil) ⇒ Object



10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# File 'lib/uri_pattern/url_parser.rb', line 10

def split_components(url, base_url: nil)
  url = resolve(url, base_url) if base_url && !url.empty?
  parsed = URI::WhatwgParser.new.split(url)
  userinfo = parsed[WHATWG_USERINFO] || ""
  user, pass = userinfo.include?(":") ? userinfo.split(":", 2) : [userinfo, nil]
  {
    protocol: parsed[WHATWG_SCHEME] || "",
    username: user || "",
    password: pass || "",
    hostname: parsed[WHATWG_HOST] || "",
    port:     parsed[WHATWG_PORT] ? parsed[WHATWG_PORT].to_s : "",
    pathname: parsed[WHATWG_PATH] || parsed[WHATWG_OPAQUE_PATH] || "",
    query:    parsed[WHATWG_QUERY] || "",
    fragment: parsed[WHATWG_FRAGMENT] || ""
  }
rescue URIPattern::Error
  raise
rescue => e
  raise URIPattern::Error, "Failed to parse URL #{url.inspect}: #{e.message}"
end

.split_pattern(pattern) ⇒ Object

Parse a constructor string into its eight pattern components, following the WHATWG URLPattern “parse a constructor string” algorithm: urlpattern.spec.whatwg.org/#constructor-string-parsing

Returns a hash keyed by the eight component symbols. A component that does not appear in the input is left as nil so that defaults can be applied downstream.



43
44
45
46
47
48
49
50
51
52
53
54
55
56
# File 'lib/uri_pattern/url_parser.rb', line 43

def split_pattern(pattern)
  tokens = URIPattern::Tokenizer.new(pattern, policy: :lenient).tokenize
  raw = ConstructorStringParser.new(pattern, tokens).parse
  {
    protocol: raw[:protocol],
    username: raw[:username],
    password: raw[:password],
    hostname: raw[:hostname],
    port:     raw[:port],
    pathname: raw[:pathname],
    query:    raw[:search],
    fragment: raw[:hash]
  }
end