Module: Iriq::Parser

Defined in:
lib/iriq/parser.rb

Overview

Lightweight, Unicode-aware parser for URL/IRI/URN inputs.

Intentionally NOT a full RFC 3986 / 3987 / WHATWG URL implementation. We accept enough of the common shapes (URLs, scheme-less hosts, URNs, raw Unicode hosts and paths) to support normalization and clustering.

Constant Summary collapse

SCHEME_RE =
/\A([a-zA-Z][a-zA-Z0-9+\-.]*):/.freeze
HOSTISH_RE =

Matches a host-ish first token before the first slash. We deliberately allow any non-ASCII character so IRIs work without punycode.

%r{
  \A
  (?<host>[^/?#\s:]+\.[^/?#\s:]+|localhost)  # something.something or localhost
  (?::(?<port>\d+))?
  (?<rest>[/?#].*)?
  \z
}x.freeze
DEFAULT_PORTS =
{
  "http"  => 80,
  "https" => 443,
  "ftp"   => 21,
  "ws"    => 80,
  "wss"   => 443,
}.freeze

Class Method Summary collapse

Class Method Details

.parse(input) ⇒ Object

Raises:



30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# File 'lib/iriq/parser.rb', line 30

def parse(input)
  raise ParseError, "input is nil" if input.nil?
  raise ParseError, "input must be a String" unless input.is_a?(String)

  stripped = input.strip
  raise ParseError, "input is empty" if stripped.empty?

  if (m = stripped.match(SCHEME_RE))
    scheme = m[1].downcase
    rest   = stripped[m[0].length..]

    if scheme == "urn"
      parse_urn(input, rest)
    elsif rest.start_with?("//")
      parse_authority_url(input, scheme, rest[2..])
    else
      # opaque scheme like mailto:foo@bar — keep nss, mark as urn-ish so we
      # don't pretend we know its host/path layout.
      Identifier.new(original: input, scheme: scheme, nss: rest, kind: :urn)
    end
  else
    # No scheme. If it looks like a hostname, assume https.
    if HOSTISH_RE.match?(stripped)
      parse_authority_url(input, "https", stripped)
    else
      raise ParseError, "cannot parse #{input.inspect}: no scheme and no host-like prefix"
    end
  end
end

.parse_authority_url(original, scheme, remainder) ⇒ Object

Raises:



66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# File 'lib/iriq/parser.rb', line 66

def parse_authority_url(original, scheme, remainder)
  m = remainder.match(HOSTISH_RE) || remainder.match(%r{\A(?<host>[^/?#]+?)(?::(?<port>\d+))?(?<rest>[/?#].*)?\z})
  raise ParseError, "cannot parse authority from #{original.inspect}" unless m

  host = m[:host].downcase
  port = m[:port]&.to_i
  port = nil if port && DEFAULT_PORTS[scheme] == port

  rest = m[:rest] || ""
  path, query, fragment = split_path_query_fragment(rest)
  segments = path_segments(path)

  Identifier.new(
    original:      original,
    scheme:        scheme,
    host:          host,
    port:          port,
    path:          "/" + segments.join("/"),
    path_segments: segments,
    query:         query,
    query_params:  parse_query(query),
    fragment:      fragment,
    kind:          :url,
  )
end

.parse_query(query) ⇒ Object



130
131
132
133
134
135
136
137
138
139
# File 'lib/iriq/parser.rb', line 130

def parse_query(query)
  return {} if query.nil? || query.empty?

  query.split("&").each_with_object({}) do |pair, acc|
    k, v = pair.split("=", 2)
    next if k.nil? || k.empty?

    acc[k] = v
  end
end

.parse_urn(original, rest) ⇒ Object

Raises:



60
61
62
63
64
# File 'lib/iriq/parser.rb', line 60

def parse_urn(original, rest)
  raise ParseError, "urn missing namespace" if rest.nil? || rest.empty?

  Identifier.new(original: original, scheme: "urn", nss: rest, kind: :urn)
end

.path_segments(path) ⇒ Object

Apply dot-segment normalization (RFC 3986 §5.2.4, lightweight version) and drop empty segments from leading/trailing/duplicate slashes.



112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
# File 'lib/iriq/parser.rb', line 112

def path_segments(path)
  return [] if path.nil? || path.empty? || path == "/"

  raw = path.sub(%r{\A/}, "").split("/")
  out = []
  raw.each do |seg|
    case seg
    when "", "."
      next
    when ".."
      out.pop
    else
      out << seg
    end
  end
  out
end

.split_path_query_fragment(rest) ⇒ Object



92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# File 'lib/iriq/parser.rb', line 92

def split_path_query_fragment(rest)
  path     = rest
  query    = nil
  fragment = nil

  if (idx = path.index("#"))
    fragment = path[(idx + 1)..]
    path     = path[0...idx]
  end

  if (idx = path.index("?"))
    query = path[(idx + 1)..]
    path  = path[0...idx]
  end

  [path, query, fragment]
end