Class: Iriq::Extractor

Inherits:
Object
  • Object
show all
Defined in:
lib/iriq/extractor.rb

Overview

Pulls IRIs out of free text. Scheme-anchored — only URLs whose scheme appears explicitly are extracted (scheme-less hosts like “foo.com/x” are too noisy to disambiguate from prose).

Iriq::Extractor.new.extract("Visit https://foo.com today.")
# => [#<Iriq::Identifier https://foo.com>]

Design draws on twitter-text and GFM autolink rules: scheme anchoring, iterative trailing-punct trim, balanced-paren preservation.

Constant Summary collapse

SCHEMES =
%w[https http ftp wss ws].freeze
SCHEMELESS_TLDS =

Conservative TLD allow-list for scheme-less extraction. Limited to a small set of very common TLDs to keep false-positive rate low. A scheme-less candidate ALSO requires a ‘/path` to count, so plain `foo.com` in prose still won’t match — only ‘foo.com/something`.

%w[com org net io ai dev co app gov edu].freeze
BOUNDARY =

Boundary chars — a URL ends at any of these (whitespace, angle brackets, quotes, backtick).

%r{[\s<>"'`]}.freeze
NON_ASCII_BOUNDARY =

Non-ASCII Unicode brackets and quotation marks that almost always terminate a URL in source text (e.g. ‘「URL」`). ASCII brackets are NOT listed here — those stay inside the URL match so the balanced-paren trim step can handle them (Wikipedia URLs like /Foo_(bar) survive).

(
  "」』)】〉》〕〗〙〛⦆}]>" +  # CJK closing brackets
  "「『(【〈《〔〖〘〚⦅{[<" +  # CJK opening brackets
  "“”‘’„‟‚«»‹›"                 # Unicode quotation marks
).chars.uniq.join.freeze
URL_CHAR_CLASS =
%{[^\\s<>"'`,#{NON_ASCII_BOUNDARY}]+}.freeze
CANDIDATE_RE =
%r{
  (?<![\w/])                                                    # not mid-word, not mid-path
  (?:
    (?i:#{SCHEMES.join("|")})://#{URL_CHAR_CLASS}               # absolute URL
    |
    urn:[a-zA-Z0-9][a-zA-Z0-9\-]{0,30}:#{URL_CHAR_CLASS}        # urn:NID:NSS
  )
}xu.freeze
SCHEMELESS_ALT =

Scheme-less alternative — same chars allowed as the absolute URL but requires a host with an allow-listed TLD AND a ‘/path` to keep prose noise low. The host part allows ASCII labels separated by dots; no Unicode hosts (those are too easily confused with prose).

%{(?:[a-zA-Z0-9](?:[a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+(?i:#{SCHEMELESS_TLDS.join("|")})/#{URL_CHAR_CLASS}}.freeze
COMBINED_RE =

Single-scan combined pattern used when scheme_less is on. One regex over the text is meaningfully cheaper than two.

%r{
  (?<![\w/.@])
  (?:
    (?i:#{SCHEMES.join("|")})://#{URL_CHAR_CLASS}
    |
    urn:[a-zA-Z0-9][a-zA-Z0-9\-]{0,30}:#{URL_CHAR_CLASS}
    |
    #{SCHEMELESS_ALT}
  )
}xu.freeze
TRAILING_PUNCT_RE =

Punctuation that’s almost always sentence punctuation rather than part of a URL when it appears at the trailing edge.

/[.,;:!?'"‘’“”]+\z/u.freeze
BRACKET_PAIRS =

Unmatched closing brackets that should be trimmed.

{ ")" => "(", "]" => "[", "}" => "{" }.freeze

Instance Method Summary collapse

Constructor Details

#initialize(scheme_less: true) ⇒ Extractor

Returns a new instance of Extractor.



71
72
73
# File 'lib/iriq/extractor.rb', line 71

def initialize(scheme_less: true)
  @scheme_less = scheme_less
end

Instance Method Details

#extract(text) ⇒ Object



75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/iriq/extractor.rb', line 75

def extract(text)
  return [] if text.nil? || text.empty?

  candidates = scan_candidates(text)
  candidates.filter_map do |candidate|
    trimmed = trim(candidate)
    next nil if trimmed.empty?

    begin
      Parser.parse(trimmed)
    rescue ParseError
      nil
    end
  end
end

#extract_strings(text) ⇒ Object

Same as extract but returns only canonical strings, deduplicated, preserving first-seen order.



93
94
95
96
97
# File 'lib/iriq/extractor.rb', line 93

def extract_strings(text)
  seen = {}
  extract(text).each { |iri| seen[iri.canonical] ||= true }
  seen.keys
end