Class: Uniword::Infrastructure::ZipExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/uniword/infrastructure/zip_extractor.rb

Overview

Extracts content from ZIP archives (e.g., DOCX files).

Responsibility: Handle ZIP file extraction operations. Does NOT handle: Document parsing or deserialization.

DOCX files are ZIP archives containing XML files and media. This class provides low-level ZIP extraction functionality.

Examples:

Extract content from a DOCX file

extractor = Uniword::Infrastructure::ZipExtractor.new
content = extractor.extract("document.docx")
xml = content["word/document.xml"]

Instance Method Summary collapse

Instance Method Details

#extract(path) ⇒ Hash<String, String>

Extract all files from a ZIP archive or stream.

Parameters:

  • path (String, IO, StringIO)

    The path to the ZIP file or stream

Returns:

  • (Hash<String, String>)

    Hash mapping file paths to contents

Raises:

  • (ArgumentError)

    if path is invalid

  • (Zip::Error)

    if ZIP extraction fails



26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# File 'lib/uniword/infrastructure/zip_extractor.rb', line 26

def extract(path)
  # Handle streams directly
  return extract_from_stream(path) if path.is_a?(IO) || path.is_a?(StringIO)

  validate_path(path)

  content = {}

  Zip::File.open(path) do |zip_file|
    zip_file.each do |entry|
      next if entry.directory?

      content[entry.name] =
        entry.get_input_stream.read.force_encoding("UTF-8")
    end

    # Explicitly extract theme if present
    theme_entry = zip_file.find_entry("word/theme/theme1.xml")
    if theme_entry && !content.key?("word/theme/theme1.xml")
      content["word/theme/theme1.xml"] =
        theme_entry.get_input_stream.read.force_encoding("UTF-8")
    end
  end

  content
end

#extract_file(path, entry_path) ⇒ String?

Extract a specific file from a ZIP archive.

Parameters:

  • path (String)

    The path to the ZIP file

  • entry_path (String)

    The path of the file within the ZIP

Returns:

  • (String, nil)

    The file content, or nil if not found

Raises:

  • (ArgumentError)

    if path is invalid



85
86
87
88
89
90
91
92
93
94
# File 'lib/uniword/infrastructure/zip_extractor.rb', line 85

def extract_file(path, entry_path)
  validate_path(path)

  Zip::File.open(path) do |zip_file|
    entry = zip_file.find_entry(entry_path)
    return nil unless entry

    entry.get_input_stream.read.force_encoding("UTF-8")
  end
end

#extract_from_stream(stream) ⇒ Hash<String, String>

Extract from IO or StringIO stream

Parameters:

  • stream (IO, StringIO)

    The stream to extract from

Returns:

  • (Hash<String, String>)

    Hash mapping file paths to contents



57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# File 'lib/uniword/infrastructure/zip_extractor.rb', line 57

def extract_from_stream(stream)
  content = {}

  Zip::File.open_buffer(stream) do |zip_file|
    zip_file.each do |entry|
      next if entry.directory?

      content[entry.name] =
        entry.get_input_stream.read.force_encoding("UTF-8")
    end

    # Explicitly extract theme if present
    theme_entry = zip_file.find_entry("word/theme/theme1.xml")
    if theme_entry && !content.key?("word/theme/theme1.xml")
      content["word/theme/theme1.xml"] =
        theme_entry.get_input_stream.read.force_encoding("UTF-8")
    end
  end

  content
end

#list_files(path) ⇒ Array<String>

List all files in a ZIP archive.

Parameters:

  • path (String)

    The path to the ZIP file

Returns:

  • (Array<String>)

    Array of file paths within the ZIP

Raises:

  • (ArgumentError)

    if path is invalid



101
102
103
104
105
106
107
108
109
110
111
112
113
# File 'lib/uniword/infrastructure/zip_extractor.rb', line 101

def list_files(path)
  validate_path(path)

  files = []

  Zip::File.open(path) do |zip_file|
    zip_file.each do |entry|
      files << entry.name unless entry.directory?
    end
  end

  files
end