Module: MsgExtractor::Util

Defined in:: lib/msg_extractor/util.rb

Constant Summary collapse

ENTITIES =

{
  "amp" => "&", "lt" => "<", "gt" => ">", "quot" => '"',
  "apos" => "'", "nbsp" => " "
}.freeze

Class Method Summary collapse

.decode_entities(text) ⇒ Object

Single-pass entity decoder.
.dedupe_path(path) ⇒ Object

“f.txt” -> “f (1).txt” -> “f (2).txt” until the path is free.
.html_to_text(html) ⇒ Object

Crude tag-stripping fallback used only when a message has an HTML body but no plain-text body.
.sanitize_filename(name) ⇒ Object

Class Method Details

.decode_entities(text) ⇒ `Object`

Single-pass entity decoder. Handles named entities, decimal numeric references, and hex numeric references. Hostile codepoints (out-of-range or surrogate) are replaced with the Unicode replacement character instead of raising. Avoids double-decoding: &#65; → “A”, not “A”.

# File 'lib/msg_extractor/util.rb', line 40

def decode_entities(text)
  text.gsub(/&(?:(amp|lt|gt|quot|apos|nbsp)|#(\d+)|#x(\h+));/) do
    if (name = Regexp.last_match(1))
      ENTITIES[name]
    else
      cp = Regexp.last_match(2)&.to_i || Regexp.last_match(3).to_i(16)
      cp <= 0x10FFFF && !(0xD800..0xDFFF).cover?(cp) ? cp.chr(Encoding::UTF_8) : "\u{FFFD}"
    end
  end
end

.dedupe_path(path) ⇒ `Object`

“f.txt” -> “f (1).txt” -> “f (2).txt” until the path is free.

# File 'lib/msg_extractor/util.rb', line 12

def dedupe_path(path)
  return path unless ::File.exist?(path)
  extension = ::File.extname(path)
  base = path.delete_suffix(extension)
  counter = 1
  counter += 1 while ::File.exist?("#{base} (#{counter})#{extension}")
  "#{base} (#{counter})#{extension}"
end

.html_to_text(html) ⇒ `Object`

Crude tag-stripping fallback used only when a message has an HTML body but no plain-text body.

# File 'lib/msg_extractor/util.rb', line 23

def html_to_text(html)
  text = strip_blocks(html)
           .gsub(/<br\s*\/?>/i, "\n")
           .gsub(%r{</(p|div|tr|li|h[1-6])>}i, "\n")
           .gsub(/<[^>]+>/, "")
  decode_entities(text).gsub(/[ \t]+\n/, "\n").gsub(/\n{3,}/, "\n\n").strip
end

.sanitize_filename(name) ⇒ `Object`

# File 'lib/msg_extractor/util.rb', line 5

def sanitize_filename(name)
  cleaned = name.to_s.gsub(%r{[\x00-\x1F\\/:*?"<>|]}, "_").strip
  cleaned = "unnamed" if cleaned.empty? || cleaned.match?(/\A\.+\z/)
  cleaned
end

Module: MsgExtractor::Util

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.decode_entities(text) ⇒ Object

.dedupe_path(path) ⇒ Object

.html_to_text(html) ⇒ Object

.sanitize_filename(name) ⇒ Object

.decode_entities(text) ⇒ `Object`

.dedupe_path(path) ⇒ `Object`

.html_to_text(html) ⇒ `Object`

.sanitize_filename(name) ⇒ `Object`