Class: HTM::Loaders::MarkdownChunker

Inherits:
Object
  • Object
show all
Defined in:
lib/htm/loaders/markdown_chunker.rb

Overview

Markdown-aware text chunker using Baran

Wraps Baran::MarkdownSplitter to provide intelligent text chunking that respects markdown structure (headers, code blocks, etc.).

Examples:

Basic usage

chunker = MarkdownChunker.new
chunks = chunker.chunk("# Header\n\nParagraph text.\n\n## Subheader\n\nMore text.")
# => ["# Header\n\nParagraph text.", "## Subheader\n\nMore text."]

With custom chunk size

chunker = MarkdownChunker.new(chunk_size: 512, chunk_overlap: 50)
chunks = chunker.chunk(long_text)

With full metadata (includes cursor positions)

chunker = MarkdownChunker.new
chunks = chunker.(text)
# => [{ text: "...", cursor: 0, metadata: nil }, { text: "...", cursor: 156, metadata: nil }]

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(chunk_size: nil, chunk_overlap: nil) ⇒ MarkdownChunker

Returns a new instance of MarkdownChunker.

Parameters:

  • chunk_size (Integer) (defaults to: nil)

    Maximum characters per chunk (default: from config or 1024)

  • chunk_overlap (Integer) (defaults to: nil)

    Character overlap between chunks (default: from config or 64)



29
30
31
32
33
34
35
36
37
# File 'lib/htm/loaders/markdown_chunker.rb', line 29

def initialize(chunk_size: nil, chunk_overlap: nil)
  @chunk_size = chunk_size || HTM.configuration.chunk_size
  @chunk_overlap = chunk_overlap || HTM.configuration.chunk_overlap

  @splitter = Baran::MarkdownSplitter.new(
    chunk_size: @chunk_size,
    chunk_overlap: @chunk_overlap
  )
end

Instance Attribute Details

#chunk_overlapObject (readonly)

Returns the value of attribute chunk_overlap.



76
77
78
# File 'lib/htm/loaders/markdown_chunker.rb', line 76

def chunk_overlap
  @chunk_overlap
end

#chunk_sizeObject (readonly)

Returns the value of attribute chunk_size.



76
77
78
# File 'lib/htm/loaders/markdown_chunker.rb', line 76

def chunk_size
  @chunk_size
end

Instance Method Details

#chunk(text) ⇒ Array<String>

Split text into markdown-aware chunks (text only)

Parameters:

  • text (String)

    Text to chunk

Returns:

  • (Array<String>)

    Array of text chunks



44
45
46
47
48
49
50
51
52
53
54
55
# File 'lib/htm/loaders/markdown_chunker.rb', line 44

def chunk(text)
  return [] if text.nil? || text.strip.empty?

  # Normalize line endings
  normalized = text.gsub(/\r\n?/, "\n")

  # Use Baran's MarkdownSplitter
  result = @splitter.chunks(normalized)

  # Extract text from chunk hashes, filter empty
  result.map { |chunk| chunk[:text].strip }.reject(&:empty?)
end

#chunk_with_metadata(text) ⇒ Array<Hash>

Split text and return full chunk data (with cursor positions)

Returns Baran’s full output including:

  • :text [String] The chunk content

  • :cursor [Integer] Character offset where chunk starts in original text

Parameters:

  • text (String)

    Text to chunk

Returns:

  • (Array<Hash>)

    Array of chunk hashes with :text and :cursor



66
67
68
69
70
71
72
73
74
# File 'lib/htm/loaders/markdown_chunker.rb', line 66

def (text)
  return [] if text.nil? || text.strip.empty?

  # Normalize line endings
  normalized = text.gsub(/\r\n?/, "\n")

  # Use Baran's MarkdownSplitter - returns [{text:, cursor:}, ...]
  @splitter.chunks(normalized)
end