Class: Uts58::Extractor
- Inherits:
-
Object
- Object
- Uts58::Extractor
- Defined in:
- lib/uts58/extractor.rb
Overview
Finds links in arbitrary text per UTS58. The public API mirrors Twitter::TwitterText::Extractor closely enough that twitter-text consumers (notably Mastodon) can easily swap one for the other.
Instances carry only optional configuration (see #max_length=); if you don’t need to set anything, the module-level shortcuts are simpler.
Note that this may often find overlapping link candiates, e.g. “contact example@example.com for details” may find a mailto link and also a link to https://example.com. You’ll almost certainly want to #remove_overlapping_entities after extracting the kinds of entities you want and merging the lists.
(Bluesky handles vs. web sites are another example of common overlap, Fediverse vs. email a third, Tibetan domains vs. themselves a less common fourth, the list goes on.)
Constant Summary collapse
- PATH_CLOSERS =
[35, 47, 63]
- QUERY_CLOSERS =
[35]
- FRAGMENT_CLOSERS =
[]
- QUERY_SEPARATORS =
and & begin a new query part
[0x3d, 0x26]
- DIRECTIVE_SEPARATORS =
, = and & within a :~: directive
[0x2c, 0x3d, 0x26]
Instance Attribute Summary collapse
-
#max_length ⇒ Object
Maximum allowed length of the matched text, in input codepoints.
Instance Method Summary collapse
- #extract_email_addresses(text, options = {}) ⇒ Object
-
#extract_email_addresses_with_indices(text, options = {}) ⇒ Object
Returns every email address found in
textas a list of hashes:. -
#extract_urls(text, options = {}) ⇒ Object
Returns just the URLs found in
text, as an array of strings, in the order they occur. -
#extract_urls_with_indices(text, options = {}) ⇒ Object
Returns every URL found in
textas a list of hashes:. -
#initialize ⇒ Extractor
constructor
A new instance of Extractor.
-
#remove_overlapping_entities(entities) ⇒ Object
Given a list of entities (hashes with an
:indiceskey of the shape[start, end], as produced by #extract_urls_with_indices) drops every entity that overlaps an earlier one and returns the survivors.
Constructor Details
#initialize ⇒ Extractor
Returns a new instance of Extractor.
45 46 47 |
# File 'lib/uts58/extractor.rb', line 45 def initialize @max_length = nil end |
Instance Attribute Details
#max_length ⇒ Object
Maximum allowed length of the matched text, in input codepoints. Matches whose input span exceeds this are dropped from the result of #extract_urls_with_indices and the other extraction methods.
“Matched text” means the substring that came out of text — for example 11 for "example.com". The returned :url can be both longer and shorter, most commonly when a missing scheme is filled in ( “example.com” is 19 codepoints). The limit is measured against the input, not against the returned URL.
Default is nil, meaning no limit.
43 44 45 |
# File 'lib/uts58/extractor.rb', line 43 def max_length @max_length end |
Instance Method Details
#extract_email_addresses(text, options = {}) ⇒ Object
210 211 212 |
# File 'lib/uts58/extractor.rb', line 210 def extract_email_addresses(text, = {}) extract_email_addresses_with_indices(text, ).map { |r| r[:email] } end |
#extract_email_addresses_with_indices(text, options = {}) ⇒ Object
Returns every email address found in text as a list of hashes:
{ email: String, url: String, indices: [start, end] }
email is the bare address ( "info@example.com" ); url is the same thing as a mailto: URL ( "mailto:info@example.com" ), so that the result drops straight into anything that already knows how to render a :url entity. Both carry the IDN-decoded domain (A-labels become U-labels, as in #extract_urls_with_indices). indices are codepoint offsets, end exclusive; they cover a leading mailto: in the input if there was one, per UTS58 5.2.
#extract_urls_with_indices does not match a host that immediately follows an @ — UTS58 has no userinfo, so “info@example.com” never yields a bare example.com link. To get such an address as plain text rather than a mailto: link, extract both kinds, merge with #remove_overlapping_entities, then drop the survivors that have an :email key; that leaves the span unlinked.
Returns an empty array if text contains no addresses. options is accepted for twitter-text compatibility and currently ignored.
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 |
# File 'lib/uts58/extractor.rb', line 173 def extract_email_addresses_with_indices(text, = {}) result = [] text.to_enum(:scan, /@/).map{Regexp.last_match}.each do |match| at_pos = match.begin(0) pre = text[0...at_pos] lp_match = /[\p{XID_Continue}.!#$%&'*+\-\/=?^_`{|}~]+\z/.match(pre) next unless lp_match local = lp_match[0] next if local.start_with?('.') || local.end_with?('.') || local.include?('..') s = match.post_match prefix = /^([-\p{L}\p{N}\p{M}ßς۽۾་〇]+[\.。]){1,4}[-\p{L}\p{N}\p{M}]+(?![-\p{L}\p{N}\p{M}])/.match(s) next unless prefix && prefix[0].length < 254 host_raw = prefix.match(0).gsub(/。/, ".") next unless valid_labels?(host_raw) hn = SimpleIDN.to_unicode(host_raw) begin about = PublicSuffix.parse(hn, ignore_private: true, default_rule: nil) next unless about && about.tld != "invalid" rescue PublicSuffix::DomainInvalid, PublicSuffix::DomainNotAllowed next end local_start = at_pos - local.length end_pos = at_pos + 1 + prefix[0].length # UTS58 5.2 step 6: absorb a leading "mailto:" into the span. if local_start >= 7 && text[(local_start - 7)...local_start].downcase == "mailto:" local_start -= 7 end next if @max_length && (end_pos - local_start) > @max_length result << { email: "#{local}@#{hn}", url: "mailto:#{local}@#{hn}", indices: [local_start, end_pos] } end result end |
#extract_urls(text, options = {}) ⇒ Object
Returns just the URLs found in text, as an array of strings, in the order they occur. Use #extract_urls_with_indices instead if you also need the offsets, e.g. for adding HTML markup or for pairing the found links with the form used in the text.
For text such as “a example.com b”, this returns [“example.com”].
148 149 150 |
# File 'lib/uts58/extractor.rb', line 148 def extract_urls(text, = {}) extract_urls_with_indices(text, ).map { |r| r[:url] } end |
#extract_urls_with_indices(text, options = {}) ⇒ Object
Returns every URL found in text as a list of hashes:
{ url: String, indices: [start, end] }
url is the cleaned-up form: any A-labels in the hostname are decoded to U-labels, and the scheme is filled in as https:// if the input had none. indices are codepoint offsets into text, with end exclusive, so text[start...end] gives the substring that matched.
Note that the start and end may not match the length of url. One very common example is input that like “foo example.com bar”, where the URL will be example.com, including “https://”.
Returns an empty array if text contains no links. options is accepted for twitter-text compatibility and currently ignored.
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
# File 'lib/uts58/extractor.rb', line 65 def extract_urls_with_indices(text, = {}) result = [] # The '@' in the negative lookbehind is non-obvious part: UTS58 # has no userinfo, so a host that immediately follows an '@' # cannot be a link to https://. An email address (or a # bluesky/mastodon one) is possible. The rest of the set just # keeps a trigger from firing inside a word or path. text.to_enum(:scan,/(?<![-\p{Alnum}\p{M}.\/@])(?=\p{Alnum}[-\p{L}\p{N}\p{M}\u00DF\u03C2\u06FD\u06FE\u0F0B\u3007]*[\.:。])/).map{Regexp.last_match}.each do |match| # get rid of a leading protocol. We also tolerate letter/mark/number # characters between the trigger and the scheme, so that input # like "テストhttp://example.com" attaches the scheme correctly: # the trigger fires at offset 0 (the start of "テスト") because # nothing precedes it, and the actual link begins three # codepoints later. s = match.post_match scheme_match = /^([\p{Han}\p{Hiragana}\p{Katakana}\p{Hangul}\p{Thai}\p{Lao}\p{Khmer}\p{Myanmar}]*?)(https?:\/\/)/i.match(s) if scheme_match scheme_offset = scheme_match[1].length proto = scheme_match[2] s = scheme_match.post_match else scheme_offset = 0 proto = "https://" end # look for the prefix that might be a hostname or an IDN. # this is a somewhat sloppy match, with a few false positives. prefix = /^([-\p{L}\p{N}\p{M}\u00DF\u03C2\u06FD\u06FE\u0F0B\u3007]+[\.。]){1,4}[-\p{L}\p{N}\p{M}]+(?![-\p{L}\p{N}\p{M}])/.match(s) if prefix && prefix[0].length < 254 host_raw = prefix.match(0).gsub(/。/, ".") next unless valid_labels?(host_raw) hn = SimpleIDN.to_unicode(host_raw) begin about = PublicSuffix.parse(hn, ignore_private: true, default_rule: nil) if about && about.tld != "invalid" then # at this point, we do have enough to mark something, # the question is how much. there may be a trailing # port, then a path, then a query, finally a fragment. rest = prefix.post_match # "example.com." keeps its trailing dot only when a path, query, # or fragment follows; at the end of a sentence it's prose (UTS58). if rest[0] == "." && ["/", "?", "#"].include?(rest[1]) rest = rest[1..] end # a port number must be 1..65535 port = /^:(\d+)/.match(rest) if port n = port[1].to_i next if n < 1 || n > 65535 rest = port.post_match end # path rest = skip_component(rest, PATH_CLOSERS) while rest[0] == "/" # query rest = skip_component(rest, QUERY_CLOSERS, QUERY_SEPARATORS) if rest[0] == '?' rest = skip_component(rest, FRAGMENT_CLOSERS, [], DIRECTIVE_SEPARATORS) if rest[0] == "#" rest_length = prefix.post_match.length - rest.length match_length = match.post_match.length - rest.length - scheme_offset next if @max_length && match_length > @max_length start = match.begin(0) + scheme_offset result << { url: "#{proto}#{hn}#{prefix.post_match[...rest_length]}", indices: [start, start + match_length] } end rescue PublicSuffix::DomainInvalid # evidently we're not looking at the start of a link rescue PublicSuffix::DomainNotAllowed # ditto end end end # ah! the good feeling of going home after a hard day's work result end |
#remove_overlapping_entities(entities) ⇒ Object
Given a list of entities (hashes with an :indices key of the shape [start, end], as produced by #extract_urls_with_indices) drops every entity that overlaps an earlier one and returns the survivors.
Useful when merging the output of several extractors (URLs, mentions, hashtags, …), or when #extract_urls_with_indices itself finds several partly overlapping candidate URLs and you want only the longest. The algorithm prefers entries that start earlier; ties are broken by input order.
The input array is not modified.
226 227 228 229 230 231 232 233 234 235 236 237 |
# File 'lib/uts58/extractor.rb', line 226 def remove_overlapping_entities(entities) sorted = entities.sort_by { |e| e[:indices].first } prev = nil sorted.reject do |e| if prev && prev[:indices].last > e[:indices].first true else prev = e false end end end |