Module: Philiprehberger::GzipKit

Defined in:
lib/philiprehberger/gzip_kit.rb,
lib/philiprehberger/gzip_kit/version.rb

Overview

GzipKit provides gzip compression and decompression with streaming support.

The module exposes both string-oriented and IO-oriented entry points:

Streaming and file methods read in 64 KB chunks by default. The chunk size can be tuned via the chunk_size: keyword when dealing with very small or very large payloads.

Examples:

Compress and decompress a string

compressed = Philiprehberger::GzipKit.compress('hello')
Philiprehberger::GzipKit.decompress(compressed) # => "hello"

Defined Under Namespace

Classes: Error

Constant Summary collapse

CHUNK_SIZE =
64 * 1024
GZIP_MAGIC =
[0x1f, 0x8b].freeze
VERSION =
'0.4.0'

Class Method Summary collapse

Class Method Details

.compress(string, level: Zlib::DEFAULT_COMPRESSION, stats: false) ⇒ String, Hash

Compress a string to gzip bytes.

Examples:

Compress a string

Philiprehberger::GzipKit.compress('hello, world!')
# => "\x1F\x8B\b\x00..." (binary gzip bytes)

Compress with stats

Philiprehberger::GzipKit.compress('a' * 10_000, stats: true)
# => { data: "...", ratio: 0.99, original_size: 10000, compressed_size: 41 }

Parameters:

  • string (String)

    the data to compress

  • level (Integer) (defaults to: Zlib::DEFAULT_COMPRESSION)

    compression level (Zlib::DEFAULT_COMPRESSION by default)

  • stats (Boolean) (defaults to: false)

    when true, return a hash with compression statistics

Returns:

  • (String, Hash)

    gzip-compressed bytes, or a stats hash when stats: true



44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# File 'lib/philiprehberger/gzip_kit.rb', line 44

def self.compress(string, level: Zlib::DEFAULT_COMPRESSION, stats: false)
  io_out = StringIO.new
  io_out.binmode
  gz = Zlib::GzipWriter.new(io_out, level)
  gz.write(string)
  gz.close
  compressed = io_out.string

  if stats
    original_size = string.bytesize
    compressed_size = compressed.bytesize
    ratio = original_size.zero? ? 0.0 : 1.0 - (compressed_size.to_f / original_size)
    {
      data: compressed,
      ratio: ratio,
      original_size: original_size,
      compressed_size: compressed_size
    }
  else
    compressed
  end
end

.compress_file(src, dest, level: Zlib::DEFAULT_COMPRESSION, chunk_size: CHUNK_SIZE) {|bytes_processed, total_bytes| ... } ⇒ void

This method returns an undefined value.

Compress a file to a gzip file.

Parameters:

  • src (String)

    path to the source file

  • dest (String)

    path to the destination gzip file

  • level (Integer) (defaults to: Zlib::DEFAULT_COMPRESSION)

    compression level (Zlib::DEFAULT_COMPRESSION by default)

  • chunk_size (Integer) (defaults to: CHUNK_SIZE)

    bytes per read chunk (defaults to 64 KB)

Yields:

  • (bytes_processed, total_bytes)

    progress callback

Yield Parameters:

  • bytes_processed (Integer)

    bytes processed so far

  • total_bytes (Integer)

    total file size

Raises:

  • (ArgumentError)

    if chunk_size is not a positive Integer



133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
# File 'lib/philiprehberger/gzip_kit.rb', line 133

def self.compress_file(src, dest, level: Zlib::DEFAULT_COMPRESSION, chunk_size: CHUNK_SIZE, &block)
  validate_chunk_size!(chunk_size)

  File.open(src, 'rb') do |io_in|
    File.open(dest, 'wb') do |io_out|
      if block
        total_bytes = File.size(src)
        bytes_processed = 0
        gz = Zlib::GzipWriter.new(io_out, level)
        while (chunk = io_in.read(chunk_size))
          gz.write(chunk)
          bytes_processed += chunk.bytesize
          block.call(bytes_processed, total_bytes)
        end
        gz.finish
      else
        compress_stream(io_in, io_out, level: level, chunk_size: chunk_size)
      end
    end
  end
end

.compress_stream(io_in, io_out, level: Zlib::DEFAULT_COMPRESSION, chunk_size: CHUNK_SIZE) ⇒ void

This method returns an undefined value.

Streaming compression from one IO to another.

Examples:

Compress from one IO to another

File.open('input.txt', 'rb') do |io_in|
  File.open('output.gz', 'wb') do |io_out|
    Philiprehberger::GzipKit.compress_stream(io_in, io_out)
  end
end

Tune the chunk size for small payloads

Philiprehberger::GzipKit.compress_stream(io_in, io_out, chunk_size: 4 * 1024)

Parameters:

  • io_in (IO)

    readable input stream

  • io_out (IO)

    writable output stream

  • level (Integer) (defaults to: Zlib::DEFAULT_COMPRESSION)

    compression level (Zlib::DEFAULT_COMPRESSION by default)

  • chunk_size (Integer) (defaults to: CHUNK_SIZE)

    bytes per read chunk (defaults to 64 KB)

Raises:

  • (ArgumentError)

    if chunk_size is not a positive Integer



263
264
265
266
267
268
269
270
271
# File 'lib/philiprehberger/gzip_kit.rb', line 263

def self.compress_stream(io_in, io_out, level: Zlib::DEFAULT_COMPRESSION, chunk_size: CHUNK_SIZE)
  validate_chunk_size!(chunk_size)

  gz = Zlib::GzipWriter.new(io_out, level)
  while (chunk = io_in.read(chunk_size))
    gz.write(chunk)
  end
  gz.finish
end

.compressed?(data) ⇒ Boolean

Check if data is gzip-compressed by inspecting magic bytes.

Parameters:

  • data (String)

    data to check

Returns:

  • (Boolean)

    true if data starts with gzip magic bytes



115
116
117
118
119
120
# File 'lib/philiprehberger/gzip_kit.rb', line 115

def self.compressed?(data)
  return false if data.nil? || data.bytesize < 2

  bytes = data.bytes
  bytes[0] == GZIP_MAGIC[0] && bytes[1] == GZIP_MAGIC[1]
end

.concat(data_a, data_b) ⇒ String

Concatenate two gzip-compressed strings.

Per the gzip specification, concatenated gzip streams are valid.

Parameters:

  • data_a (String)

    first gzip-compressed string

  • data_b (String)

    second gzip-compressed string

Returns:

  • (String)

    concatenated gzip data

Raises:

  • (Error)

    if either input is not valid gzip



194
195
196
197
198
199
200
201
# File 'lib/philiprehberger/gzip_kit.rb', line 194

def self.concat(data_a, data_b)
  raise Error, 'first argument is not valid gzip data' unless compressed?(data_a)
  raise Error, 'second argument is not valid gzip data' unless compressed?(data_b)

  result = String.new(data_a, encoding: Encoding::BINARY)
  result << data_b.b
  result
end

.decompress(data, stats: false) ⇒ String, Hash

Decompress gzip bytes to a string.

Examples:

Decompress gzip bytes

compressed = Philiprehberger::GzipKit.compress('hello')
Philiprehberger::GzipKit.decompress(compressed) # => "hello"

Decompress with stats

compressed = Philiprehberger::GzipKit.compress('a' * 10_000)
Philiprehberger::GzipKit.decompress(compressed, stats: true)
# => { data: "aaaa...", ratio: 0.0041 }

Parameters:

  • data (String)

    gzip-compressed bytes

  • stats (Boolean) (defaults to: false)

    when true, return a hash with decompression statistics

Returns:

  • (String, Hash)

    decompressed string, or a stats hash when stats: true

Raises:

  • (Zlib::GzipFile::Error)

    if the data is not valid gzip



82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# File 'lib/philiprehberger/gzip_kit.rb', line 82

def self.decompress(data, stats: false)
  io_in = StringIO.new(data)
  io_in.binmode
  result = String.new(encoding: Encoding::BINARY)

  # Handle concatenated gzip streams per gzip spec
  until io_in.eof?
    gz = Zlib::GzipReader.new(io_in)
    result << gz.read
    # GzipReader leaves io_in positioned after the stream
    unused = gz.unused
    gz.finish
    if unused
      io_in.pos -= unused.bytesize
    end
  end

  decompressed = result.force_encoding(Encoding::UTF_8)

  if stats
    decompressed_size = decompressed.bytesize
    compressed_size = data.bytesize
    ratio = decompressed_size.zero? ? 0.0 : compressed_size.to_f / decompressed_size
    { data: decompressed, ratio: ratio }
  else
    decompressed
  end
end

.decompress_file(src, dest, chunk_size: CHUNK_SIZE) {|bytes_processed, total_bytes| ... } ⇒ void

This method returns an undefined value.

Decompress a gzip file to a regular file.

Parameters:

  • src (String)

    path to the gzip source file

  • dest (String)

    path to the destination file

  • chunk_size (Integer) (defaults to: CHUNK_SIZE)

    bytes per read chunk (defaults to 64 KB)

Yields:

  • (bytes_processed, total_bytes)

    progress callback

Yield Parameters:

  • bytes_processed (Integer)

    bytes decompressed so far

  • total_bytes (nil)

    always nil (total unknown until decompression completes)

Raises:

  • (ArgumentError)

    if chunk_size is not a positive Integer



165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
# File 'lib/philiprehberger/gzip_kit.rb', line 165

def self.decompress_file(src, dest, chunk_size: CHUNK_SIZE, &block)
  validate_chunk_size!(chunk_size)

  File.open(src, 'rb') do |io_in|
    File.open(dest, 'wb') do |io_out|
      if block
        gz = Zlib::GzipReader.new(io_in)
        bytes_processed = 0
        while (chunk = gz.read(chunk_size))
          io_out.write(chunk)
          bytes_processed += chunk.bytesize
          block.call(bytes_processed, nil)
        end
        gz.close
      else
        decompress_stream(io_in, io_out, chunk_size: chunk_size)
      end
    end
  end
end

.decompress_stream(io_in, io_out, chunk_size: CHUNK_SIZE) ⇒ void

This method returns an undefined value.

Streaming decompression from one IO to another.

Examples:

Decompress from one IO to another

File.open('output.gz', 'rb') do |io_in|
  File.open('restored.txt', 'wb') do |io_out|
    Philiprehberger::GzipKit.decompress_stream(io_in, io_out)
  end
end

Parameters:

  • io_in (IO)

    readable input stream containing gzip data

  • io_out (IO)

    writable output stream

  • chunk_size (Integer) (defaults to: CHUNK_SIZE)

    bytes per read chunk (defaults to 64 KB)

Raises:

  • (ArgumentError)

    if chunk_size is not a positive Integer



287
288
289
290
291
292
293
294
295
296
# File 'lib/philiprehberger/gzip_kit.rb', line 287

def self.decompress_stream(io_in, io_out, chunk_size: CHUNK_SIZE)
  validate_chunk_size!(chunk_size)

  gz = Zlib::GzipReader.new(io_in)
  while (chunk = gz.read(chunk_size))
    io_out.write(chunk)
  end
ensure
  gz&.close
end

.equivalent?(blob_a, blob_b) ⇒ Boolean

Check whether two gzip-compressed blobs decompress to equal byte strings.

Useful for comparing gzip outputs produced at different compression levels or with different metadata — only the decompressed payloads are compared.

Parameters:

  • blob_a (String)

    first gzip-compressed string

  • blob_b (String)

    second gzip-compressed string

Returns:

  • (Boolean)

    true iff both blobs decompress to equal byte strings

Raises:

  • (Error)

    if either input is not valid gzip



212
213
214
215
216
217
218
219
# File 'lib/philiprehberger/gzip_kit.rb', line 212

def self.equivalent?(blob_a, blob_b)
  raise Error, 'first argument is not valid gzip data' unless compressed?(blob_a)
  raise Error, 'second argument is not valid gzip data' unless compressed?(blob_b)

  decompress(blob_a).b == decompress(blob_b).b
rescue Zlib::GzipFile::Error => e
  raise Error, "failed to decompress gzip data: #{e.message}"
end

.inspect_header(data) ⇒ Hash?

Inspect the gzip header without decompressing.

Parameters:

  • data (String)

    gzip-compressed data

Returns:

  • (Hash, nil)

    header info or nil if not valid gzip



225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
# File 'lib/philiprehberger/gzip_kit.rb', line 225

def self.inspect_header(data)
  return nil unless compressed?(data)

  io = StringIO.new(data)
  io.binmode
  gz = Zlib::GzipReader.new(io)

  {
    method: :deflate,
    mtime: gz.mtime,
    os: gz.os_code,
    original_name: gz.orig_name && gz.orig_name.empty? ? nil : gz.orig_name,
    comment: gz.comment && gz.comment.empty? ? nil : gz.comment
  }
rescue Zlib::GzipFile::Error
  nil
ensure
  gz&.close
end