Module: Metaclean::Strategy
- Defined in:
- lib/metaclean/strategy.rb
Constant Summary collapse
- PRIVACY_GROUP_PREFIXES =
Group-name PREFIXES treated as privacy-bearing. Matching whole families by prefix keeps the residual check fail-closed instead of an exact allowlist that silently misses variants:
GPS* — GPS plus any sub-group XMP-* — every XMP namespace (XMP-exif GPS, XMP-mwg-rs face/person names, XMP-xmpMM DocumentID, XMP-iptcExt, …) MakerNotes*, IPTC* IFD1 — the embedded thumbnail IFD; a surviving thumbnail can carry the original's full EXIF+GPSOver-flagging here is deliberate: for a privacy tool a false “still present” is far cheaper than a false “Cleaned”. (ICC colour-profile groups are intentionally NOT flagged — a colour profile isn’t PII; any genuinely identifying field such as Copyright is still caught by PRIVACY_TAGS below.)
%w[GPS XMP- MakerNotes IPTC IFD1].freeze
- MAT2_ESSENTIAL =
Formats where ExifTool alone leaves document-internal metadata that only mat2’s rebuild removes (and which ExifTool also can’t fully re-read to verify). If mat2 won’t run for one of these, the runner warns that coverage is reduced rather than reporting a confident “Cleaned”.
%w[pdf docx xlsx pptx odt ods odp odg odf epub].freeze
- PRIVACY_TAGS =
Specific tag NAMES (regardless of group) we never want to leak. If exiftool reports e.g. “EXIF:Artist” we still flag it because of the tag-name match, not the group. exiftool’s ‘-all=` normally strips these, so this list is a fail-closed BACKSTOP: if any survive a strip we’d rather over-warn than report a confident “Cleaned”.
%w[ Artist Author Creator Copyright Rights By-line By-lineTitle Credit Source Contact OwnerName CameraOwnerName SerialNumber InternalSerialNumber LensSerialNumber Software HostComputer ProcessingSoftware ImageDescription UserComment LastModifiedBy LastSavedBy LastAuthor Make Model LensModel DateTimeOriginal CreateDate Title Subject Keywords Description Category Producer Company Manager CreationDate ModDate XPAuthor XPComment XPSubject XPKeywords XPTitle Comment ].freeze
- MAT2_PREFERRED =
File extensions where mat2 is meaningfully stricter than ExifTool and should run first. For other formats, ExifTool is the broader expert.
%w[ pdf docx xlsx pptx odt ods odp odg odf epub png svg mp4 avi mkv mov webm ].freeze
Class Method Summary collapse
-
.mat2_essential?(path) ⇒ Boolean
Does this path need mat2 for adequate coverage? (See MAT2_ESSENTIAL.).
-
.privacy_group?(group) ⇒ Boolean
A group is privacy-bearing if it matches one of the family prefixes (GPS, XMP-, MakerNotes, IPTC, IFD1).
-
.privacy_residual(meta) ⇒ Object
Looks at metadata read AFTER cleaning and returns the entries that still look privacy-relevant.
-
.privacy_tag?(tag) ⇒ Boolean
A tag is privacy-bearing if it’s in the exact list OR is any GPS* tag (GPSLatitude/GPSLongitude/GPSPosition/… regardless of group).
-
.tools_for(path) ⇒ Object
Returns an ordered list of tool symbols (e.g. ‘[:mat2, :exiftool, :qpdf]`) to run on `path`.
Class Method Details
.mat2_essential?(path) ⇒ Boolean
Does this path need mat2 for adequate coverage? (See MAT2_ESSENTIAL.)
125 126 127 |
# File 'lib/metaclean/strategy.rb', line 125 def mat2_essential?(path) MAT2_ESSENTIAL.include?(File.extname(path).downcase.delete('.')) end |
.privacy_group?(group) ⇒ Boolean
A group is privacy-bearing if it matches one of the family prefixes (GPS, XMP-, MakerNotes, IPTC, IFD1).
113 114 115 |
# File 'lib/metaclean/strategy.rb', line 113 def privacy_group?(group) PRIVACY_GROUP_PREFIXES.any? { |p| group.to_s.start_with?(p) } end |
.privacy_residual(meta) ⇒ Object
Looks at metadata read AFTER cleaning and returns the entries that still look privacy-relevant. The runner uses this for the “still present” warning at the end of each file.
Why both group-match and tag-match? Tag names can appear under different groups depending on the format (e.g. “Author” in PDF vs “Artist” in EXIF). Combining the two keeps coverage broad without having to enumerate every tag pair.
98 99 100 101 102 103 104 105 106 107 108 109 |
# File 'lib/metaclean/strategy.rb', line 98 def privacy_residual() .select do |k, _| # Skip SourceFile and the System/File/etc. groups — not user metadata. next false unless Display.(k) # ExifTool keys look like "GPS:GPSLatitude". Split on the first ":"; # no "Group:" prefix means the whole key is the tag name. group, tag = k.to_s.split(':', 2) name = tag.nil? ? group.to_s : tag privacy_group?(group) || privacy_tag?(name) end end |
.privacy_tag?(tag) ⇒ Boolean
A tag is privacy-bearing if it’s in the exact list OR is any GPS* tag (GPSLatitude/GPSLongitude/GPSPosition/… regardless of group).
119 120 121 122 |
# File 'lib/metaclean/strategy.rb', line 119 def privacy_tag?(tag) t = tag.to_s PRIVACY_TAGS.include?(t) || t.start_with?('GPS') end |
.tools_for(path) ⇒ Object
Returns an ordered list of tool symbols (e.g. ‘[:mat2, :exiftool, :qpdf]`) to run on `path`. The runner executes them in order; if one fails or is skipped, the next still runs. The three tools are always used together for maximum coverage — there is no per-tool opt-out; a tool that isn’t installed is simply left out (the ‘.available?`/`.supports?` checks).
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/metaclean/strategy.rb', line 65 def tools_for(path) ext = File.extname(path).downcase.delete('.') tools = [] if ext == 'pdf' # PDFs benefit from all three, in this order: # mat2 → cleans the high-level metadata + content streams it knows # exiftool → strips the Info dictionary (Author, Title, Producer) # qpdf → rebuilds the file, dropping any unreferenced bits tools << :mat2 if Mat2.available? tools << :exiftool tools << :qpdf if Qpdf.available? elsif MAT2_PREFERRED.include?(ext) && Mat2.available? # Office docs, modern image/video containers — mat2 leads. tools << :mat2 tools << :exiftool else # Everything else (JPEG, MP3, RAW, …) — ExifTool is the gold standard. tools << :exiftool tools << :mat2 if Mat2.supports?(path) end tools end |