Class: Iriq::Extractor
- Inherits:
-
Object
- Object
- Iriq::Extractor
- Defined in:
- lib/iriq/extractor.rb
Overview
Pulls IRIs out of free text. Scheme-anchored — only URLs whose scheme appears explicitly are extracted (scheme-less hosts like “foo.com/x” are too noisy to disambiguate from prose).
Iriq::Extractor.new.extract("Visit https://foo.com today.")
# => [#<Iriq::Identifier https://foo.com>]
Design draws on twitter-text and GFM autolink rules: scheme anchoring, iterative trailing-punct trim, balanced-paren preservation.
Constant Summary collapse
- SCHEMES =
%w[https http ftp wss ws].freeze
- SCHEMELESS_TLDS =
Conservative TLD allow-list for scheme-less extraction. Limited to a small set of very common TLDs to keep false-positive rate low. A scheme-less candidate ALSO requires a ‘/path` to count, so plain `foo.com` in prose still won’t match — only ‘foo.com/something`.
%w[com org net io ai dev co app gov edu].freeze
- BOUNDARY =
Boundary chars — a URL ends at any of these (whitespace, angle brackets, quotes, backtick).
%r{[\s<>"'`]}.freeze
- NON_ASCII_BOUNDARY =
Non-ASCII Unicode brackets and quotation marks that almost always terminate a URL in source text (e.g. ‘「URL」`). ASCII brackets are NOT listed here — those stay inside the URL match so the balanced-paren trim step can handle them (Wikipedia URLs like /Foo_(bar) survive).
( "」』)】〉》〕〗〙〛⦆}]>" + # CJK closing brackets "「『(【〈《〔〖〘〚⦅{[<" + # CJK opening brackets "“”‘’„‟‚«»‹›" # Unicode quotation marks ).chars.uniq.join.freeze
- URL_CHAR_CLASS =
%{[^\\s<>"'`,#{NON_ASCII_BOUNDARY}]+}.freeze
- CANDIDATE_RE =
%r{ (?<![\w/]) # not mid-word, not mid-path (?: (?i:#{SCHEMES.join("|")})://#{URL_CHAR_CLASS} # absolute URL | urn:[a-zA-Z0-9][a-zA-Z0-9\-]{0,30}:#{URL_CHAR_CLASS} # urn:NID:NSS ) }xu.freeze
- SCHEMELESS_ALT =
Scheme-less alternative — same chars allowed as the absolute URL but requires a host with an allow-listed TLD AND a ‘/path` to keep prose noise low. The host part allows ASCII labels separated by dots; no Unicode hosts (those are too easily confused with prose).
%{(?:[a-zA-Z0-9](?:[a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+(?i:#{SCHEMELESS_TLDS.join("|")})/#{URL_CHAR_CLASS}}.freeze
- COMBINED_RE =
Single-scan combined pattern used when scheme_less is on. One regex over the text is meaningfully cheaper than two.
%r{ (?<![\w/.@]) (?: (?i:#{SCHEMES.join("|")})://#{URL_CHAR_CLASS} | urn:[a-zA-Z0-9][a-zA-Z0-9\-]{0,30}:#{URL_CHAR_CLASS} | #{SCHEMELESS_ALT} ) }xu.freeze
- TRAILING_PUNCT_RE =
Punctuation that’s almost always sentence punctuation rather than part of a URL when it appears at the trailing edge.
/[.,;:!?'"‘’“”]+\z/u.freeze
- BRACKET_PAIRS =
Unmatched closing brackets that should be trimmed.
{ ")" => "(", "]" => "[", "}" => "{" }.freeze
Instance Method Summary collapse
- #extract(text) ⇒ Object
-
#extract_strings(text) ⇒ Object
Same as extract but returns only canonical strings, deduplicated, preserving first-seen order.
-
#initialize(scheme_less: true) ⇒ Extractor
constructor
A new instance of Extractor.
Constructor Details
#initialize(scheme_less: true) ⇒ Extractor
Returns a new instance of Extractor.
71 72 73 |
# File 'lib/iriq/extractor.rb', line 71 def initialize(scheme_less: true) @scheme_less = scheme_less end |
Instance Method Details
#extract(text) ⇒ Object
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/iriq/extractor.rb', line 75 def extract(text) return [] if text.nil? || text.empty? candidates = scan_candidates(text) candidates.filter_map do |candidate| trimmed = trim(candidate) next nil if trimmed.empty? begin Parser.parse(trimmed) rescue ParseError nil end end end |
#extract_strings(text) ⇒ Object
Same as extract but returns only canonical strings, deduplicated, preserving first-seen order.
93 94 95 96 97 |
# File 'lib/iriq/extractor.rb', line 93 def extract_strings(text) seen = {} extract(text).each { |iri| seen[iri.canonical] ||= true } seen.keys end |