Module: Metaclean::Strategy

Defined in:
lib/metaclean/strategy.rb

Constant Summary collapse

PRIVACY_GROUP_PREFIXES =

Group-name PREFIXES treated as privacy-bearing. Matching whole families by prefix keeps the residual check fail-closed instead of an exact allowlist that silently misses variants:

GPS*       — GPS plus any sub-group
XMP-*      — every XMP namespace (XMP-exif GPS, XMP-mwg-rs face/person
             names, XMP-xmpMM DocumentID, XMP-iptcExt, …)
MakerNotes*, IPTC*
IFD1       — the embedded thumbnail IFD; a surviving thumbnail can carry
             the original's full EXIF+GPS

Over-flagging here is deliberate: for a privacy tool a false “still present” is far cheaper than a false “Cleaned”. (ICC colour-profile groups are intentionally NOT flagged — a colour profile isn’t PII; any genuinely identifying field such as Copyright is still caught by PRIVACY_TAGS below.)

%w[GPS XMP- MakerNotes IPTC IFD1].freeze
MAT2_ESSENTIAL =

Formats ExifTool can’t WRITE, so it leaves document-internal metadata only mat2’s rebuild removes (and can’t re-read to verify). If mat2 won’t run for one of these, the runner warns coverage is reduced rather than reporting a confident “Cleaned”. (PDF is NOT here: ExifTool writes PDF metadata and qpdf rebuilds the file, so PDF is fully handled and verifiable without mat2.)

%w[docx xlsx pptx odt ods odp odg odf epub].freeze
PRIVACY_TAGS =

Specific tag NAMES (regardless of group) we never want to leak. If exiftool reports e.g. “EXIF:Artist” we still flag it because of the tag-name match, not the group. exiftool’s ‘-all=` normally strips these, so this list is a fail-closed BACKSTOP: if any survive a strip we’d rather over-warn than report a confident “Cleaned”.

%w[
  Artist Author Creator Copyright Rights
  By-line By-lineTitle Credit Source Contact OwnerName
  CameraOwnerName SerialNumber InternalSerialNumber LensSerialNumber
  Software HostComputer ProcessingSoftware
  ImageDescription UserComment
  LastModifiedBy LastSavedBy LastAuthor
  Make Model LensModel DateTimeOriginal CreateDate
  Title Subject Keywords Description Category Producer Company Manager
  CreationDate ModDate
  XPAuthor XPComment XPSubject XPKeywords XPTitle Comment
].freeze
MAT2_PREFERRED =

File extensions where mat2 is meaningfully stricter than ExifTool and should run first. For other formats, ExifTool is the broader expert. (mkv/webm are NOT here — see FFMPEG_FORMATS; no mat2/ExifTool path writes Matroska.)

%w[
  docx xlsx pptx odt ods odp odg odf epub png svg
  mp4 avi
].freeze
FFMPEG_FORMATS =

Matroska containers. ExifTool is read-only for them and mat2 has no Matroska parser, so neither can strip mkv/webm. ffmpeg is the only tool in the set that can — it remuxes the container dropping all metadata while copying every stream verbatim (lossless, no re-encode).

%w[mkv webm].freeze
MAT2_DEGRADES =

Raster formats mat2 cannot strip without DAMAGING the file: it rebuilds via Pillow, which recompresses JPEG/WebP (visible quality loss — a clean wallpaper drops ~65% in size with no metadata to remove) and downconverts TIFF (16-bit → 8-bit). ExifTool strips all of these completely and IN PLACE (pixels byte-identical), so ExifTool owns them and mat2 is skipped —cleaning metadata must never silently damage the file.

%w[jpg jpeg webp tif tiff].freeze

Class Method Summary collapse

Class Method Details

.blank_value?(value) ⇒ Boolean

True when a value carries no information: empty, or only zeros plus date/time punctuation and the “Z” (UTC) marker — e.g. “0000:00:00 00:00:00”, or the ASF variant “0000:00:00 00:00:00Z” that mat2 writes into WMV’s mandatory date field. Only the digit 0 is stripped (never 1-9), so a real value like “59.9139”, “Jane Doe”, or a real “2024:…” date keeps other characters and is NOT blank. (GPS is exempt from this check entirely — see privacy_residual.)

Returns:

  • (Boolean)


152
153
154
155
# File 'lib/metaclean/strategy.rb', line 152

def blank_value?(value)
  s = value.to_s
  s.strip.empty? || s.gsub(/[Z0\s:.+-]/, '').empty?
end

.mat2_essential?(path) ⇒ Boolean

Does this path need mat2 for adequate coverage? (See MAT2_ESSENTIAL.)

Returns:

  • (Boolean)


171
172
173
# File 'lib/metaclean/strategy.rb', line 171

def mat2_essential?(path)
  MAT2_ESSENTIAL.include?(File.extname(path).downcase.delete('.'))
end

.privacy_group?(group) ⇒ Boolean

A group is privacy-bearing if it matches one of the family prefixes (GPS, XMP-, MakerNotes, IPTC, IFD1).

Returns:

  • (Boolean)


159
160
161
# File 'lib/metaclean/strategy.rb', line 159

def privacy_group?(group)
  PRIVACY_GROUP_PREFIXES.any? { |p| group.to_s.start_with?(p) }
end

.privacy_residual(meta) ⇒ Object

Looks at metadata read AFTER cleaning and returns the entries that still look privacy-relevant. The runner uses this for the “still present” warning at the end of each file.

Why both group-match and tag-match? Tag names can appear under different groups depending on the format (e.g. “Author” in PDF vs “Artist” in EXIF). Combining the two keeps coverage broad without having to enumerate every tag pair.



123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# File 'lib/metaclean/strategy.rb', line 123

def privacy_residual(meta)
  meta.select do |k, v|
    # Skip SourceFile and the System/File/etc. groups — not user metadata.
    next false unless Display.embedded_key?(k)

    # ExifTool keys look like "GPS:GPSLatitude". Split on the first ":";
    # no "Group:" prefix means the whole key is the tag name.
    group, tag = k.to_s.split(':', 2)
    name = tag.nil? ? group.to_s : tag

    # A zeroed/empty value is not a leak for un-removable container atoms like
    # QuickTime:CreateDate (deletable only by zeroing, "0000:00:00 …") — without
    # this every video would fail the gate on an already-zeroed date. GPS is the
    # exception: 0,0 is a REAL location (Null Island) and a coordinate ExifTool
    # reports as 0 (or null) must still be caught, so the blank exemption NEVER
    # applies to GPS-family entries — the whole point of the fail-closed backstop.
    gps = group.to_s.start_with?('GPS') || name.start_with?('GPS')
    next false if !gps && blank_value?(v)

    privacy_group?(group) || privacy_tag?(name)
  end
end

.privacy_tag?(tag) ⇒ Boolean

A tag is privacy-bearing if it’s in the exact list OR is any GPS* tag (GPSLatitude/GPSLongitude/GPSPosition/… regardless of group).

Returns:

  • (Boolean)


165
166
167
168
# File 'lib/metaclean/strategy.rb', line 165

def privacy_tag?(tag)
  t = tag.to_s
  PRIVACY_TAGS.include?(t) || t.start_with?('GPS')
end

.tools_for(path) ⇒ Object

Returns an ordered list of tool symbols (e.g. ‘[:mat2, :exiftool, :qpdf]`) to run on `path`. The runner executes them in order; if one fails or is skipped, the next still runs. The three tools are always used together for maximum coverage — there is no per-tool opt-out; a tool that isn’t installed is simply left out (the ‘.available?`/`.supports?` checks).



82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# File 'lib/metaclean/strategy.rb', line 82

def tools_for(path)
  ext = File.extname(path).downcase.delete('.')
  tools = []

  if ext == 'pdf'
    # mat2 cleans PDFs by RASTERIZING every page (text → images): it destroys
    # the text layer and balloons the file (~35×). So PDFs skip mat2 and use:
    #   exiftool → strips the Info dictionary + XMP (Author, Title, Producer…)
    #   qpdf → rebuilds the file, dropping unreferenced objects / old revisions
    # Both are lossless and leave the text intact. (PDF JS/macros are out of
    # scope — see README.)
    tools << :exiftool
    tools << :qpdf if Qpdf.available?
  elsif FFMPEG_FORMATS.include?(ext)
    # Matroska (mkv/webm): ffmpeg is the ONLY tool that can clean these.
    # ExifTool still re-reads the result afterwards, so the residual check
    # (the false-clean backstop) is not blind.
    tools << :ffmpeg if Ffmpeg.available?
  elsif MAT2_PREFERRED.include?(ext) && Mat2.available?
    # Office docs, modern image/video containers — mat2 leads.
    tools << :mat2
    tools << :exiftool
  else
    # Everything else (JPEG, MP3, RAW, …) — ExifTool is the gold standard.
    # mat2 still adds coverage for many, but NOT for rasters it would damage
    # (MAT2_DEGRADES) — there ExifTool's in-place strip is complete and lossless.
    tools << :exiftool
    tools << :mat2 if Mat2.supports?(path) && !MAT2_DEGRADES.include?(ext)
  end

  tools
end