Class: RZstd::Dictionary

Inherits:

Object

Object
RZstd::Dictionary

show all

Defined in:: lib/rzstd.rb

Constant Summary collapse

ZDICT_MAGIC =

"\x37\xA4\x30\xEC".b.freeze

USER_DICT_ID_MIN =

Public Dict_ID range per the Zstandard spec. IDs ‘0..32_767` are reserved for a future registrar, and `>= 2**31` is reserved. Only `32_768..(2**31 - 1)` is available for private/auto-generated dicts.

32_768

USER_DICT_ID_MAX =

(2**31) - 1

USER_DICT_ID_SIZE =

USER_DICT_ID_MAX - USER_DICT_ID_MIN + 1

Class Method Summary collapse

.new(bytes, level: DEFAULT_LEVEL) ⇒ Object

Public constructor.
.train(samples, capacity: 64 * 1024) ⇒ String

Trains a raw-content dictionary from a corpus of sample frames.

Instance Method Summary collapse

#decompress(bytes, max_output_size: nil) ⇒ Object

Class Method Details

.new(bytes, level: DEFAULT_LEVEL) ⇒ `Object`

Public constructor. Resolves the Zstd ‘Dict_ID`:

If ‘bytes` begins with the ZDICT magic (`0x EC30A437` LE), the id is read from bytes `[4..7]` of the dictionary header. This is the same id zstd writes into every compressed frame header via `ZSTD_c_dictIDFlag` (enabled by default), so on-wire frames and `Dictionary#id` agree.
Otherwise the dict is raw content: zstd writes a frame ‘dictID` of 0, and this wrapper falls back to `sha256(bytes)` LE mapped into the public range `32_768..(2**31 - 1)`, purely as an out-of-band identifier for the Ruby side. Wrong-dict decoding of raw dicts is caught by the content checksum the encoder enables.

# File 'lib/rzstd.rb', line 54

def self.new(bytes, level: DEFAULT_LEVEL)
  id = if bytes.byteslice(0, 4) == ZDICT_MAGIC
    bytes.byteslice(4, 4).unpack1("V")
  else
    raw = Digest::SHA256.digest(bytes).byteslice(0, 4).unpack1("V")
    USER_DICT_ID_MIN + (raw % USER_DICT_ID_SIZE)
  end

  _native_new(bytes, id, Integer(level))
end

.train(samples, capacity: 64 * 1024) ⇒ `String`

Trains a raw-content dictionary from a corpus of sample frames. Wraps ‘ZDICT_trainFromBuffer`. Returns the trained dictionary as a binary String, ready to feed back into `Dictionary.new`.

ZDICT recommends roughly 100 KiB total samples and at least 10 samples; under-provisioned inputs raise.

Parameters:

samples (Array<String>) —

sample frames (any encoding)
capacity (Integer) (defaults to: 64 * 1024) —

upper bound on the produced dict size

Returns:

(String) —

trained dictionary bytes (binary)

# File 'lib/rzstd.rb', line 76

def self.train(samples, capacity: 64 * 1024)
  sizes = samples.map { |s| s.bytesize }
  buffer = String.new(capacity: sizes.sum, encoding: Encoding::BINARY)
  samples.each { |s| buffer << s.b }
  _native_train(buffer, sizes, Integer(capacity))
end

Instance Method Details

#decompress(bytes, max_output_size: nil) ⇒ `Object`



84
85
86

# File 'lib/rzstd.rb', line 84

def decompress(bytes, max_output_size: nil)
  _native_decompress(bytes, Integer(max_output_size || 0))
end