Class: RZstd::Dictionary

Inherits:
Object
  • Object
show all
Defined in:
lib/rzstd.rb

Constant Summary collapse

ZDICT_MAGIC =
"\x37\xA4\x30\xEC".b.freeze
USER_DICT_ID_MIN =

Public Dict_ID range per the Zstandard spec. IDs ‘0..32_767` are reserved for a future registrar, and `>= 2**31` is reserved. Only `32_768..(2**31 - 1)` is available for private/auto-generated dicts.

32_768
USER_DICT_ID_MAX =
(2**31) - 1
USER_DICT_ID_SIZE =
USER_DICT_ID_MAX - USER_DICT_ID_MIN + 1

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.new(bytes, level: DEFAULT_LEVEL) ⇒ Object

Public constructor. Resolves the Zstd ‘Dict_ID`:

  • If ‘bytes` begins with the ZDICT magic (`0x EC30A437` LE), the id is read from bytes `[4..7]` of the dictionary header. This is the same id zstd writes into every compressed frame header via `ZSTD_c_dictIDFlag` (enabled by default), so on-wire frames and `Dictionary#id` agree.

  • Otherwise the dict is raw content: zstd writes a frame ‘dictID` of 0, and this wrapper falls back to `sha256(bytes)` LE mapped into the public range `32_768..(2**31 - 1)`, purely as an out-of-band identifier for the Ruby side. Wrong-dict decoding of raw dicts is caught by the content checksum the encoder enables.



54
55
56
57
58
59
60
61
62
63
# File 'lib/rzstd.rb', line 54

def self.new(bytes, level: DEFAULT_LEVEL)
  id = if bytes.byteslice(0, 4) == ZDICT_MAGIC
    bytes.byteslice(4, 4).unpack1("V")
  else
    raw = Digest::SHA256.digest(bytes).byteslice(0, 4).unpack1("V")
    USER_DICT_ID_MIN + (raw % USER_DICT_ID_SIZE)
  end

  _native_new(bytes, id, Integer(level))
end

.train(samples, capacity: 64 * 1024) ⇒ String

Trains a raw-content dictionary from a corpus of sample frames. Wraps ‘ZDICT_trainFromBuffer`. Returns the trained dictionary as a binary String, ready to feed back into `Dictionary.new`.

ZDICT recommends roughly 100 KiB total samples and at least 10 samples; under-provisioned inputs raise.

Parameters:

  • samples (Array<String>)

    sample frames (any encoding)

  • capacity (Integer) (defaults to: 64 * 1024)

    upper bound on the produced dict size

Returns:

  • (String)

    trained dictionary bytes (binary)



76
77
78
79
80
81
# File 'lib/rzstd.rb', line 76

def self.train(samples, capacity: 64 * 1024)
  sizes = samples.map { |s| s.bytesize }
  buffer = String.new(capacity: sizes.sum, encoding: Encoding::BINARY)
  samples.each { |s| buffer << s.b }
  _native_train(buffer, sizes, Integer(capacity))
end

Instance Method Details

#decompress(bytes, max_output_size: nil) ⇒ Object



84
85
86
# File 'lib/rzstd.rb', line 84

def decompress(bytes, max_output_size: nil)
  _native_decompress(bytes, Integer(max_output_size || 0))
end