Class: Uniword::FormatDetector

Inherits:
Object
  • Object
show all
Defined in:
lib/uniword/format_detector.rb

Overview

Detects document format from file signatures and extensions.

Responsibility: Identify document format using file magic numbers and fallback to extension-based detection. Follows Single Responsibility Principle - detection logic separated from other concerns.

Detection strategy:

  1. Check file signature (magic number)

  2. Check MIME headers for MHTML

  3. Fallback to file extension

Examples:

Detect format

detector = Uniword::FormatDetector.new
format = detector.detect("document.docx")
# => :docx

Constant Summary collapse

ZIP_SIGNATURE =

ZIP file magic number (PKx03x04)

[0x50, 0x4B, 0x03, 0x04].pack("C*").freeze
HTML_MARKERS =

HTML tag markers

["<!DOCTYPE html", "<html", "<HTML"].freeze
MIME_HEADER =

MIME version header for MHTML

"MIME-Version:"

Instance Method Summary collapse

Instance Method Details

#detect(path) ⇒ Symbol

Detect the format of a file or stream.

Examples:

Detect DOCX

detector = FormatDetector.new
format = detector.detect("document.docx")
# => :docx

Parameters:

  • path (String, IO, StringIO)

    The file path or stream

Returns:

  • (Symbol)

    The detected format (:docx, :mhtml)

Raises:

  • (ArgumentError)

    if path is invalid

  • (ArgumentError)

    if format cannot be detected



40
41
42
43
44
45
46
47
48
49
50
51
52
# File 'lib/uniword/format_detector.rb', line 40

def detect(path)
  # For streams, detect from content
  return detect_stream_format(path) if path.is_a?(IO) || path.is_a?(StringIO)

  validate_path(path)

  # Try signature-based detection first
  format = detect_by_signature(path)
  return format if format

  # Fallback to extension-based detection
  detect_by_extension(path)
end