Class: RZstd::Dictionary

Inherits:
Data
  • Object
show all
Defined in:
lib/rzstd/dictionary.rb,
lib/rzstd/dictionary.rb

Overview

Pure value type for a Zstd dictionary: raw bytes plus a 4-byte id. Built on ‘Data.define`, so it’s immutable, gets ‘==` / `#hash` / `#deconstruct` for free, and is shareable across Ractors.

The id defaults to:

  • the ‘Dict_ID` field from the ZDICT header if the bytes begin with the ZDICT magic (`37 A4 30 EC`) — this matches the id zstd writes into every compressed frame header via `ZSTD_c_dictIDFlag`;

  • ‘sha256(bytes)[0, 4]` interpreted little-endian, mapped into the public `32_768..(2**31 − 1)` range, for raw-content dictionaries (which carry a frame `Dict_ID` of 0 and therefore can’t use id-based mismatch detection).

Callers can override via ‘id:` (e.g. a value coordinated out of band).

Trained dictionaries are produced by ‘Dictionary.train(samples, capacity:)` and are ZDICT-format.

Constant Summary collapse

ZDICT_MAGIC =
"\x37\xA4\x30\xEC".b.freeze
USER_DICT_ID_MIN =
32_768
USER_DICT_ID_MAX =
(2**31) - 1
USER_DICT_ID_SIZE =
USER_DICT_ID_MAX - USER_DICT_ID_MIN + 1

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(bytes:, id: nil) ⇒ Dictionary

Returns a new instance of Dictionary.



32
33
34
35
36
37
38
39
40
41
# File 'lib/rzstd/dictionary.rb', line 32

def initialize(bytes:, id: nil)
  b = bytes.b
  id ||= if b.byteslice(0, 4) == ZDICT_MAGIC
           b.byteslice(4, 4).unpack1("V")
         else
           raw = Digest::SHA256.digest(b).byteslice(0, 4).unpack1("V")
           USER_DICT_ID_MIN + (raw % USER_DICT_ID_SIZE)
         end
  super(bytes: b.freeze, id: id)
end

Instance Attribute Details

#bytesObject (readonly)

Returns the value of attribute bytes

Returns:

  • (Object)

    the current value of bytes



23
24
25
# File 'lib/rzstd/dictionary.rb', line 23

def bytes
  @bytes
end

#idObject (readonly)

Returns the value of attribute id

Returns:

  • (Object)

    the current value of id



23
24
25
# File 'lib/rzstd/dictionary.rb', line 23

def id
  @id
end

Class Method Details

.train(samples, capacity: 64 * 1024) ⇒ Dictionary

Trains a dictionary from a corpus of sample frames. Wraps ‘ZDICT_trainFromBuffer`. Returns a fresh Dictionary value (ZDICT-format, with its own dict_id in the header).

ZDICT recommends roughly 100 KiB total samples and at least 10 samples; under-provisioned inputs raise.

Parameters:

  • samples (Array<String>)

    sample frames (any encoding)

  • capacity (Integer) (defaults to: 64 * 1024)

    upper bound on the produced dict size

Returns:

  • (Dictionary)

    trained dictionary, ZDICT-format



59
60
61
62
63
64
65
# File 'lib/rzstd/dictionary.rb', line 59

def self.train(samples, capacity: 64 * 1024)
  sizes  = samples.map(&:bytesize)
  buffer = String.new(capacity: sizes.sum, encoding: Encoding::BINARY)
  samples.each { |s| buffer << s.b }
  bytes = RZstd._native_train(buffer, sizes, Integer(capacity))
  new(bytes: bytes)
end

Instance Method Details

#sizeObject



44
45
46
# File 'lib/rzstd/dictionary.rb', line 44

def size
  bytes.bytesize
end