Module: Metaclean

Defined in:
lib/metaclean/display.rb,
lib/metaclean.rb,
lib/metaclean/cli.rb,
lib/metaclean/mat2.rb,
lib/metaclean/qpdf.rb,
lib/metaclean/ffmpeg.rb,
lib/metaclean/runner.rb,
lib/metaclean/version.rb,
lib/metaclean/exiftool.rb,
lib/metaclean/strategy.rb

Overview

The “policy” module: which tools to run for which file, and what counts as privacy-relevant if it survives a clean.

Keeping this logic in its own file means the runner doesn’t need to know about formats — it just asks Strategy.tools_for(path) and runs whatever comes back.

Defined Under Namespace

Modules: Display, Exiftool, Ffmpeg, Mat2, Qpdf, Strategy Classes: CLI, Error, Runner, ToolsMissing

Constant Summary collapse

COMMAND_TIMEOUT =

External tools can hang, or run away producing endless output, on a corrupt or hostile file. Every OPERATIONAL shell-out (read/strip/rebuild) goes through this instead of Open3.capture3 so one bad file is bounded on BOTH axes — by wall-clock (COMMAND_TIMEOUT) and by captured bytes (MAX_OUTPUT_BYTES) — rather than hanging or exhausting memory and taking the whole batch with it. The quick availability probes (‘-ver`/`–version`) stay on plain capture3: fixed args, no file input, nothing to hang on.

120
MAX_OUTPUT_BYTES =

Per stream (stdout AND stderr). Far above any legitimate output from the tools’ invocations here (metadata JSON / ‘-q` strips / `-v error` muxes), so tripping it means a runaway, not a real result.

64 * 1024 * 1024
READ_CHUNK =
64 * 1024
TMP_MARKER =

Marker embedded in every staging-temp filename (Runner, Ffmpeg, Qpdf) and matched by Runner#skip?, so a leftover temp from an interrupted run is ignored on a later directory scan. One literal keeps the producers and the matcher from drifting (qpdf previously embedded a divergent “.metaclean.qpdf.tmp.” that didn’t contain this marker).

'.metaclean.tmp.'
CLEAN_SUFFIX =

Suffix of the default “<name>_clean.<ext>” outputs. Runner#build_clean_path writes it; CLEAN_OUTPUT_RE derives the loop-prevention match from it so the producer and Runner#skip? can’t disagree.

'_clean'
CLEAN_OUTPUT_RE =

Matches our own “<name>_clean.<ext>” outputs (with optional “_N” collision counter) so a recursive re-run doesn’t re-clean them. Compiled once here, in the module body that runs after the requires, so CLEAN_SUFFIX exists.

/#{Regexp.escape(CLEAN_SUFFIX)}(_\d+)?\.[^.]+\z/
VERSION =
'4.1.0'

Class Method Summary collapse

Class Method Details

.capture3(*cmd, timeout: COMMAND_TIMEOUT, max_output: MAX_OUTPUT_BYTES) ⇒ Object

Drop-in replacement for Open3.capture3 that returns the same [out, err, status] triple but kills the command (and anything it spawned) if it runs past ‘timeout` OR floods more than `max_output` bytes on either stream.



53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/metaclean.rb', line 53

def self.capture3(*cmd, timeout: COMMAND_TIMEOUT, max_output: MAX_OUTPUT_BYTES)
  Open3.popen3(*cmd, pgroup: true) do |stdin, stdout, stderr, wait_thr|
    stdin.close
    # Drain both pipes concurrently: a tool that fills one pipe buffer would
    # otherwise block forever before exiting, and `join` below would never see
    # it finish even though it isn't actually hung.
    out_t = read_capped(stdout, max_output, wait_thr)
    err_t = read_capped(stderr, max_output, wait_thr)

    if wait_thr.join(timeout).nil?
      kill_group(wait_thr)
      out_t.join(2)
      err_t.join(2)
      raise Error, "#{cmd.first} timed out after #{timeout}s"
    end

    out, out_over = out_t.value
    err, err_over = err_t.value
    raise Error, "#{cmd.first} exceeded the #{max_output}-byte output limit" if out_over || err_over

    [out, err, wait_thr.value]
  end
end

.ensure_tools!Object

Preflight: all four tools must be installed. We run them together for full coverage and to verify the strip, so a partial toolchain is not “good enough” — bail with one clear message naming what’s missing and how to install everything. Called once by the CLI before any inspect/clean work.

Raises:



139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
# File 'lib/metaclean.rb', line 139

def self.ensure_tools!
  missing = []
  missing << 'exiftool' unless Exiftool.available?
  missing << 'mat2'     unless Mat2.available?
  missing << 'qpdf'     unless Qpdf.available?
  missing << 'ffmpeg'   unless Ffmpeg.available?
  return if missing.empty?

  raise ToolsMissing, <<~MSG
    Missing required tool(s): #{missing.join(', ')}

    metaclean needs ExifTool, mat2, qpdf and ffmpeg together. Install all four:
      macOS:          brew install exiftool mat2 qpdf ffmpeg
      Debian/Ubuntu:  sudo apt install libimage-exiftool-perl mat2 qpdf ffmpeg
      Fedora:         sudo dnf install perl-Image-ExifTool mat2 qpdf ffmpeg
      Arch:           sudo pacman -S perl-image-exiftool mat2 qpdf ffmpeg
      Windows:        use WSL2 (https://learn.microsoft.com/windows/wsl/install) + the Debian/Ubuntu line
  MSG
end

.ext_of(path) ⇒ Object

Lower-cased, dot-stripped extension used for FORMAT ROUTING decisions (Strategy#tools_for, Strategy#mat2_essential?, Mat2.supports?). One definition so every routing path normalizes the extension identically —a future tweak (double extensions, locale-safe downcasing) lands once.



114
115
116
# File 'lib/metaclean.rb', line 114

def self.ext_of(path)
  File.extname(path.to_s).downcase.delete('.')
end

.kill_group(wait_thr) ⇒ Object

SIGTERM the child’s whole process group — pgroup:true made the child the group leader, so any helpers it forked are signalled too — escalating to SIGKILL if it ignores TERM. A negative pid targets the group.



103
104
105
106
107
108
# File 'lib/metaclean.rb', line 103

def self.kill_group(wait_thr)
  Process.kill('-TERM', wait_thr.pid)
  Process.kill('-KILL', wait_thr.pid) unless wait_thr.join(2)
rescue Errno::ESRCH, Errno::EPERM
  nil # already gone, or not permitted to signal it — nothing more to do
end

.read_capped(io, limit, wait_thr) ⇒ Object

Read an IO into a String in a thread, but stop accumulating once it passes ‘limit` bytes — and kill the command then, so a flooding stream is cut off promptly instead of waiting out the full timeout. After the cap is hit it keeps draining (discarding) so the dying child isn’t blocked on a full pipe. Returns [string, overflowed?].



82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# File 'lib/metaclean.rb', line 82

def self.read_capped(io, limit, wait_thr)
  Thread.new do
    buf = +''
    over = false
    while (chunk = io.read(READ_CHUNK))
      next if over # past the cap: drain & discard so the child can exit

      buf << chunk
      next unless buf.bytesize > limit

      over = true
      buf = buf.byteslice(0, limit)
      kill_group(wait_thr)
    end
    [buf, over]
  end
end

.safe_path(path) ⇒ Object

A path beginning with “-” is misread as an option by the tools we shell out to — e.g. exiftool’s ‘-config FILE` loads and runs arbitrary Perl. Open3 argument arrays bypass the shell, but NOT the invoked tool’s own option parser. Prefixing a leading-dash relative path with “./” makes it unambiguously a filename to every tool. Absolute paths and normal names pass through untouched. Used at every shell-out boundary.



31
32
33
34
# File 'lib/metaclean.rb', line 31

def self.safe_path(path)
  s = path.to_s
  s.start_with?('-') ? File.join('.', s) : s
end