Module: URIPattern::URLParser
- Defined in:
- lib/uri_pattern/url_parser.rb
Constant Summary collapse
- WHATWG_SCHEME =
Indices in the array returned by URI::WhatwgParser#split:
- scheme, userinfo, host, port, nil, path, opaque_path, query, fragment
0- WHATWG_USERINFO =
1- WHATWG_HOST =
2- WHATWG_PORT =
3- WHATWG_PATH =
5- WHATWG_OPAQUE_PATH =
6- WHATWG_QUERY =
7- WHATWG_FRAGMENT =
8- DEFAULT_PORTS =
{ "http" => 80, "https" => 443, "ws" => 80, "wss" => 443, "ftp" => 21 }.freeze
- SPECIAL_SCHEMES_SET =
Set.new(%w[http https ws wss ftp file]).freeze
- DUMMY_URL =
— “dummy URL” canonicalization of a fixed pattern run ——————–
The WHATWG URLPattern spec canonicalizes each fixed-text part of a pattern by running it through a throwaway (“dummy”) URL, so the URL parser applies the exact spec percent-encode set and (for pathname) dot-segment handling. We delegate here instead of maintaining encode-set tables by hand, which both simplifies the code and tracks the spec precisely.
DUMMY_URL is the spec’s “create a dummy URL” input verbatim (urlpattern.spec.whatwg.org/ — “Let dummyInput be ‘dummy.invalid/`”).
"https://dummy.invalid/"
Class Method Summary collapse
- .canonicalize_hash_run(run) ⇒ Object
- .canonicalize_password_run(run) ⇒ Object
-
.canonicalize_pathname_run(run, opaque_path: false) ⇒ Object
“canonicalize a pathname” / “canonicalize an opaque pathname”: run the fixed text through a dummy URL via full parsing (so “#”/“?” terminate the path and dot segments collapse, matching the polyfill).
-
.canonicalize_protocol_input(value) ⇒ Object
“canonicalize a protocol” on a match input: a scheme is ASCII, starts with a letter, and contains only letters, digits, “+”, “-” and “.”.
-
.canonicalize_search_run(run) ⇒ Object
“canonicalize a search” / “…hash” / “…username” / “…password”: the polyfill sets the corresponding URL component and reads it back.
- .canonicalize_username_run(run) ⇒ Object
- .dummy_url ⇒ Object
-
.normalize_hash_input(hash) ⇒ Object
Normalize a hash input through WHATWG URL rules for each component.
-
.normalize_hostname_input(hostname) ⇒ Object
Normalize a hostname: IDN, and strip CR/LF/tab.
-
.normalize_port_input(port_str, protocol = "") ⇒ Object
Normalize a port string for use as a match input component.
- .resolve(relative, base_url) ⇒ Object
- .split_components(url, base_url: nil) ⇒ Object
-
.split_pattern(pattern) ⇒ Object
Parse a constructor string into its eight pattern components, following the WHATWG URLPattern “parse a constructor string” algorithm: urlpattern.spec.whatwg.org/#constructor-string-parsing.
Class Method Details
.canonicalize_hash_run(run) ⇒ Object
179 180 181 182 183 184 185 |
# File 'lib/uri_pattern/url_parser.rb', line 179 def canonicalize_hash_run(run) u = dummy_url u.fragment = run u.fragment.to_s rescue => e raise URIPattern::Error, "Invalid hash #{run.inspect}: #{e.}" end |
.canonicalize_password_run(run) ⇒ Object
195 196 197 198 199 200 201 |
# File 'lib/uri_pattern/url_parser.rb', line 195 def canonicalize_password_run(run) u = dummy_url u.password = run u.password.to_s rescue => e raise URIPattern::Error, "Invalid password #{run.inspect}: #{e.}" end |
.canonicalize_pathname_run(run, opaque_path: false) ⇒ Object
“canonicalize a pathname” / “canonicalize an opaque pathname”: run the fixed text through a dummy URL via full parsing (so “#”/“?” terminate the path and dot segments collapse, matching the polyfill). A non-opaque run that is not “/”-prefixed gets the spec’s “/-” prefix trick so a leading “../” is preserved rather than collapsed against the root.
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
# File 'lib/uri_pattern/url_parser.rb', line 208 def canonicalize_pathname_run(run, opaque_path: false) return run if run.empty? if opaque_path parsed = URI::WhatwgParser.new.split("data:#{run}") (parsed[WHATWG_OPAQUE_PATH] || parsed[WHATWG_PATH]).to_s else lead = run.start_with?("/") modified = lead ? run : "/-#{run}" # Append the run as the dummy URL's path. The run supplies its own leading # "/", so drop DUMMY_URL's trailing slash before joining. Parsing the whole # URL (rather than resolving the run against DUMMY_URL as a base) keeps a # leading "//" a path instead of an authority, and lets "#"/"?" terminate. parsed = URI::WhatwgParser.new.split(DUMMY_URL.chomp("/") + modified) pathname = parsed[WHATWG_PATH].to_s lead ? pathname : pathname.sub(%r{\A/-}, "") end rescue => e raise URIPattern::Error, "Invalid pathname #{run.inspect}: #{e.}" end |
.canonicalize_protocol_input(value) ⇒ Object
“canonicalize a protocol” on a match input: a scheme is ASCII, starts with a letter, and contains only letters, digits, “+”, “-” and “.”. A value with any other code point (e.g. “café”) cannot be a protocol, so matching fails.
144 145 146 147 148 |
# File 'lib/uri_pattern/url_parser.rb', line 144 def canonicalize_protocol_input(value) return "" if value.empty? return nil unless value.match?(/\A[a-zA-Z][a-zA-Z0-9+.\-]*\z/) value.downcase end |
.canonicalize_search_run(run) ⇒ Object
“canonicalize a search” / “…hash” / “…username” / “…password”: the polyfill sets the corresponding URL component and reads it back. The uri-whatwg_parser setters run the basic URL parser with the matching state override and apply the spec encode sets (special-query for search, userinfo for username/password, etc.).
171 172 173 174 175 176 177 |
# File 'lib/uri_pattern/url_parser.rb', line 171 def canonicalize_search_run(run) u = dummy_url u.query = run u.query.to_s rescue => e raise URIPattern::Error, "Invalid search #{run.inspect}: #{e.}" end |
.canonicalize_username_run(run) ⇒ Object
187 188 189 190 191 192 193 |
# File 'lib/uri_pattern/url_parser.rb', line 187 def canonicalize_username_run(run) u = dummy_url u.user = run u.user.to_s rescue => e raise URIPattern::Error, "Invalid username #{run.inspect}: #{e.}" end |
.dummy_url ⇒ Object
162 163 164 |
# File 'lib/uri_pattern/url_parser.rb', line 162 def dummy_url URI::WhatwgParser.new.parse(DUMMY_URL) end |
.normalize_hash_input(hash) ⇒ Object
Normalize a hash input through WHATWG URL rules for each component. Returns nil if a required component fails normalization.
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
# File 'lib/uri_pattern/url_parser.rb', line 103 def normalize_hash_input(hash) protocol = hash[:protocol].to_s.downcase # Opaque path: non-special scheme, no username/password/hostname/port set opaque_path = !protocol.empty? && !SPECIAL_SCHEMES_SET.include?(protocol) && (hash[:hostname].nil? || hash[:hostname].to_s.empty?) && (hash[:username].nil? || hash[:username].to_s.empty?) && (hash[:password].nil? || hash[:password].to_s.empty?) && (hash[:port].nil? || hash[:port].to_s.empty?) result = {} hash.each do |k, v| result[k] = case k when :protocol norm = canonicalize_protocol_input(v.to_s) return nil if norm.nil? norm when :port norm = normalize_port_input(v.to_s, protocol) return nil if norm.nil? norm when :pathname canonicalize_pathname_run(v.to_s, opaque_path: opaque_path) when :hostname normalize_hostname_input(v.to_s) when :username canonicalize_username_run(v.to_s) when :password canonicalize_password_run(v.to_s) when :query canonicalize_search_run(v.to_s) when :fragment canonicalize_hash_run(v.to_s) else v.to_s end end result end |
.normalize_hostname_input(hostname) ⇒ Object
Normalize a hostname: IDN, and strip CR/LF/tab.
92 93 94 95 96 97 98 99 |
# File 'lib/uri_pattern/url_parser.rb', line 92 def normalize_hostname_input(hostname) return "" if hostname.nil? || hostname.empty? h = hostname.gsub(/[\r\n\t]/, "") return "" if h.empty? URI::WhatwgParser.new.split("https://#{h}/")[WHATWG_HOST] || h rescue h end |
.normalize_port_input(port_str, protocol = "") ⇒ Object
Normalize a port string for use as a match input component. Strips tabs, takes leading numeric digits, and suppresses the default port. Returns nil if the port string has no leading digits (parse failure).
80 81 82 83 84 85 86 87 |
# File 'lib/uri_pattern/url_parser.rb', line 80 def normalize_port_input(port_str, protocol = "") port = port_str.to_s.gsub(/[\t\f]/, "") digits = port.match(/\A\d*/)[0] return nil if digits.empty? && !port.empty? return nil if digits.length > 0 && digits.to_i > 65535 default = DEFAULT_PORTS[protocol.to_s.downcase] default && default.to_s == digits ? "" : digits end |
.resolve(relative, base_url) ⇒ Object
31 32 33 34 35 |
# File 'lib/uri_pattern/url_parser.rb', line 31 def resolve(relative, base_url) URI::WhatwgParser.new.parse(relative, base: base_url).to_s rescue => e raise URIPattern::Error, "Failed to resolve URL: #{e.}" end |
.split_components(url, base_url: nil) ⇒ Object
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# File 'lib/uri_pattern/url_parser.rb', line 10 def split_components(url, base_url: nil) url = resolve(url, base_url) if base_url && !url.empty? parsed = URI::WhatwgParser.new.split(url) userinfo = parsed[WHATWG_USERINFO] || "" user, pass = userinfo.include?(":") ? userinfo.split(":", 2) : [userinfo, nil] { protocol: parsed[WHATWG_SCHEME] || "", username: user || "", password: pass || "", hostname: parsed[WHATWG_HOST] || "", port: parsed[WHATWG_PORT] ? parsed[WHATWG_PORT].to_s : "", pathname: parsed[WHATWG_PATH] || parsed[WHATWG_OPAQUE_PATH] || "", query: parsed[WHATWG_QUERY] || "", fragment: parsed[WHATWG_FRAGMENT] || "" } rescue URIPattern::Error raise rescue => e raise URIPattern::Error, "Failed to parse URL #{url.inspect}: #{e.}" end |
.split_pattern(pattern) ⇒ Object
Parse a constructor string into its eight pattern components, following the WHATWG URLPattern “parse a constructor string” algorithm: urlpattern.spec.whatwg.org/#constructor-string-parsing
Returns a hash keyed by the eight component symbols. A component that does not appear in the input is left as nil so that defaults can be applied downstream.
43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# File 'lib/uri_pattern/url_parser.rb', line 43 def split_pattern(pattern) tokens = URIPattern::Tokenizer.new(pattern, policy: :lenient).tokenize raw = ConstructorStringParser.new(pattern, tokens).parse { protocol: raw[:protocol], username: raw[:username], password: raw[:password], hostname: raw[:hostname], port: raw[:port], pathname: raw[:pathname], query: raw[:search], fragment: raw[:hash] } end |