Class: Parse::Retrieval::Chunker::Base

Inherits:
Object
  • Object
show all
Defined in:
lib/parse/retrieval/chunker.rb

Overview

Abstract base. Subclasses MUST implement #chunk.

Subclasses get one free behavior from Base: #chunk_with_meta, which wraps #chunk and reports whether the result was capped. Parse::Retrieval.retrieve calls #chunk_with_meta so it can stamp a truncation signal onto each emitted chunk's metadata.

Direct Known Subclasses

FixedSizeOverlap

Instance Method Summary collapse

Instance Method Details

#chunk(text) ⇒ Array<String>

Returns zero or more chunks. MUST return [] for blank/nil input.

Parameters:

  • text (String, nil)

    source document text.

Returns:

  • (Array<String>)

    zero or more chunks. MUST return [] for blank/nil input.

Raises:

  • (NotImplementedError)

    unless overridden.



51
52
53
# File 'lib/parse/retrieval/chunker.rb', line 51

def chunk(text)
  raise NotImplementedError, "#{self.class}#chunk must return Array<String>."
end

#chunk_with_meta(text) ⇒ Hash

Wrap #chunk with truncation metadata. The default implementation here does NOT cap — it reports the chunk list as produced. FixedSizeOverlap overrides this to enforce its max_chunks_per_document cap and report the pre-cap count.

Parameters:

Returns:

  • (Hash)

    { chunks: Array<String>, truncated: Boolean, total_before_truncation: Integer }.



63
64
65
66
# File 'lib/parse/retrieval/chunker.rb', line 63

def chunk_with_meta(text)
  chunks = Array(chunk(text))
  { chunks: chunks, truncated: false, total_before_truncation: chunks.length }
end