Class: Iriq::Extractor

Inherits:

Object

Object
Iriq::Extractor

show all

Defined in:: lib/iriq/extractor.rb

Overview

Pulls IRIs out of free text. Scheme-anchored — only URLs whose scheme appears explicitly are extracted (scheme-less hosts like “foo.com/x” are too noisy to disambiguate from prose).

Iriq::Extractor.new.extract("Visit https://foo.com today.")
# => [#<Iriq::Identifier https://foo.com>]

Design draws on twitter-text and GFM autolink rules: scheme anchoring, iterative trailing-punct trim, balanced-paren preservation.

Constant Summary collapse

SCHEMES =

%w[https http ftp wss ws].freeze

SCHEMELESS_TLDS = Conservative TLD allow-list for scheme-less extraction. Limited to a small set of very common TLDs to keep false-positive rate low. A scheme-less candidate ALSO requires a ‘/path` to count, so plain `foo.com` in prose still won’t match — only ‘foo.com/something`.

%w[com org net io ai dev co app gov edu].freeze

BOUNDARY = Boundary chars — a URL ends at any of these (whitespace, angle brackets, quotes, backtick).

%r{[\s<>"'`]}.freeze

NON_ASCII_BOUNDARY = Non-ASCII Unicode brackets and quotation marks that almost always terminate a URL in source text (e.g. ‘「URL」`). ASCII brackets are NOT listed here — those stay inside the URL match so the balanced-paren trim step can handle them (Wikipedia URLs like /Foo_(bar) survive).

(
  "」』）】〉》〕〗〙〛｠｝］＞" +  # CJK closing brackets
  "「『（【〈《〔〖〘〚｟｛［＜" +  # CJK opening brackets
  "“”‘’„‟‚«»‹›"                 # Unicode quotation marks
).chars.uniq.join.freeze

URL_CHAR_CLASS =

%{[^\\s<>"'`,#{NON_ASCII_BOUNDARY}]+}.freeze

CANDIDATE_RE =

%r{
  (?<![\w/])                                                    # not mid-word, not mid-path
  (?:
    (?i:#{SCHEMES.join("|")})://#{URL_CHAR_CLASS}               # absolute URL
    |
    urn:[a-zA-Z0-9][a-zA-Z0-9\-]{0,30}:#{URL_CHAR_CLASS}        # urn:NID:NSS
  )
}xu.freeze

SCHEMELESS_ALT = Scheme-less alternative — same chars allowed as the absolute URL but requires a host with an allow-listed TLD AND a ‘/path` to keep prose noise low. The host part allows ASCII labels separated by dots; no Unicode hosts (those are too easily confused with prose).

%{(?:[a-zA-Z0-9](?:[a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+(?i:#{SCHEMELESS_TLDS.join("|")})/#{URL_CHAR_CLASS}}.freeze

COMBINED_RE = Single-scan combined pattern used when scheme_less is on. One regex over the text is meaningfully cheaper than two.

%r{
  (?<![\w/.@])
  (?:
    (?i:#{SCHEMES.join("|")})://#{URL_CHAR_CLASS}
    |
    urn:[a-zA-Z0-9][a-zA-Z0-9\-]{0,30}:#{URL_CHAR_CLASS}
    |
    #{SCHEMELESS_ALT}
  )
}xu.freeze

TRAILING_PUNCT_RE = Punctuation that’s almost always sentence punctuation rather than part of a URL when it appears at the trailing edge.

/[.,;:!?'"‘’“”]+\z/u.freeze

BRACKET_PAIRS = Unmatched closing brackets that should be trimmed.

{ ")" => "(", "]" => "[", "}" => "{" }.freeze

Instance Method Summary collapse

#extract(text) ⇒ Object
#extract_strings(text) ⇒ Object

Same as extract but returns only canonical strings, deduplicated, preserving first-seen order.
#initialize(scheme_less: true) ⇒ Extractor constructor

A new instance of Extractor.

Constructor Details

#initialize(scheme_less: true) ⇒ `Extractor`

Returns a new instance of Extractor.



71
72
73

# File 'lib/iriq/extractor.rb', line 71

def initialize(scheme_less: true)
  @scheme_less = scheme_less
end

Instance Method Details

#extract(text) ⇒ `Object`

# File 'lib/iriq/extractor.rb', line 75

def extract(text)
  return [] if text.nil? || text.empty?

  candidates = scan_candidates(text)
  candidates.filter_map do |candidate|
    trimmed = trim(candidate)
    next nil if trimmed.empty?

    begin
      Parser.parse(trimmed)
    rescue ParseError
      nil
    end
  end
end

#extract_strings(text) ⇒ `Object`

Same as extract but returns only canonical strings, deduplicated, preserving first-seen order.

# File 'lib/iriq/extractor.rb', line 93

def extract_strings(text)
  seen = {}
  extract(text).each { |iri| seen[iri.canonical] ||= true }
  seen.keys
end