Module: Metaclean::Strategy

Defined in:
lib/metaclean/strategy.rb

Constant Summary collapse

PRIVACY_GROUPS =

Tag GROUPS that almost always carry personally identifying info. Survival of any tag in these groups raises a flag to the user.

%w[GPS MakerNotes XMP-dc XMP-photoshop IPTC ICC-header].freeze
PRIVACY_TAGS =

Specific tag NAMES (regardless of group) we never want to leak. If exiftool reports e.g. “EXIF:Artist” we still flag it because of the tag-name match, not the group.

%w[
  Artist Author Creator Copyright Rights
  By-line By-lineTitle Credit Source Contact OwnerName
  CameraOwnerName SerialNumber InternalSerialNumber LensSerialNumber
  Software HostComputer ProcessingSoftware
  ImageDescription UserComment
  LastModifiedBy LastSavedBy LastAuthor
].freeze
MAT2_PREFERRED =

File extensions where mat2 is meaningfully stricter than ExifTool and should run first. For other formats, ExifTool is the broader expert.

%w[
  pdf docx xlsx pptx odt ods odp odg epub png svg
  mp4 avi mkv mov webm
].freeze

Class Method Summary collapse

Class Method Details

.privacy_residual(meta) ⇒ Object

Looks at metadata read AFTER cleaning and returns the entries that still look privacy-relevant. The runner uses this for the “still present” warning at the end of each file.

Why both group-match and tag-match? Tag names can appear under different groups depending on the format (e.g. “Author” in PDF vs “Artist” in EXIF). Combining the two keeps coverage broad without having to enumerate every tag pair.



80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# File 'lib/metaclean/strategy.rb', line 80

def privacy_residual(meta)
  meta.reject { |k, _| k == 'SourceFile' }.select do |k, _|
    # ExifTool keys look like "GPS:GPSLatitude". Split on the first ":".
    group, tag = k.to_s.split(':', 2)
    # Skip System/File/etc. — those aren't user metadata.
    next false if Display::NON_METADATA_GROUPS.include?(group)

    if tag.nil?
      # No "Group:" prefix — the whole key is the tag name.
      PRIVACY_TAGS.include?(group.to_s)
    else
      PRIVACY_GROUPS.include?(group) || PRIVACY_TAGS.include?(tag)
    end
  end
end

.tools_for(path, prefer: {}) ⇒ Object

Returns an ordered list of tool symbols (e.g. ‘[:mat2, :exiftool, :qpdf]`) to run on `path`. The runner executes them in order; if one fails or is skipped, the next still runs.

‘prefer:` is a hash of user opt-outs from the CLI flags (–no-mat2, –exiftool-only, etc.). The pattern `prefer != false` treats both `nil` (not set) and `true` as “use it” — only an explicit `false` disables.



47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# File 'lib/metaclean/strategy.rb', line 47

def tools_for(path, prefer: {})
  ext = File.extname(path).downcase.delete('.')
  tools = []

  if ext == 'pdf'
    # PDFs benefit from all three, in this order:
    #   mat2 → cleans the high-level metadata + content streams it knows
    #   exiftool → strips the Info dictionary (Author, Title, Producer)
    #   qpdf → rebuilds the file, dropping any unreferenced bits
    tools << :mat2     if prefer[:mat2]     != false && Mat2.available?
    tools << :exiftool if prefer[:exiftool] != false
    tools << :qpdf     if prefer[:qpdf]     != false && Qpdf.available?
  elsif MAT2_PREFERRED.include?(ext) && prefer[:mat2] != false && Mat2.available?
    # Office docs, modern image/video containers — mat2 leads.
    tools << :mat2
    tools << :exiftool if prefer[:exiftool] != false
  else
    # Everything else (JPEG, MP3, RAW, …) — ExifTool is the gold standard.
    tools << :exiftool if prefer[:exiftool] != false
    tools << :mat2     if prefer[:mat2]     != false && Mat2.supports?(path)
  end

  tools
end