Class: HTM::Loaders::MarkdownLoader

Inherits:
Object
  • Object
show all
Defined in:
lib/htm/loaders/markdown_loader.rb

Overview

Markdown file loader

Loads markdown files into HTM long-term memory with support for:

  • YAML frontmatter parsing (stored as metadata on first chunk)

  • Paragraph-based chunking

  • Re-sync on file changes (via mtime comparison)

  • Duplicate detection via content_hash

Examples:

Load a single file

loader = MarkdownLoader.new(htm)
result = loader.load_file('/path/to/doc.md')
# => { file_path: '/path/to/doc.md', chunks_created: 5, ... }

Load a directory

results = loader.load_directory('/path/to/docs', pattern: '**/*.md')

Constant Summary collapse

FRONTMATTER_REGEX =
/\A---\s*\n(.*?)\n---\s*\n/m
MAX_FILE_SIZE =

10 MB maximum file size

10 * 1024 * 1024

Instance Method Summary collapse

Constructor Details

#initialize(htm_instance, chunk_size: nil, chunk_overlap: nil) ⇒ MarkdownLoader

Returns a new instance of MarkdownLoader.

Parameters:

  • htm_instance (HTM)

    The HTM instance to use for storing nodes

  • chunk_size (Integer) (defaults to: nil)

    Maximum characters per chunk (default: from config)

  • chunk_overlap (Integer) (defaults to: nil)

    Character overlap between chunks (default: from config)



31
32
33
34
35
36
37
# File 'lib/htm/loaders/markdown_loader.rb', line 31

def initialize(htm_instance, chunk_size: nil, chunk_overlap: nil)
  @htm = htm_instance
  @chunker = MarkdownChunker.new(
    chunk_size: chunk_size,
    chunk_overlap: chunk_overlap
  )
end

Instance Method Details

#load_directory(path, pattern: '**/*.md', force: false) ⇒ Array<Hash>

Load all matching files from a directory

Parameters:

  • path (String)

    Directory path

  • pattern (String) (defaults to: '**/*.md')

    Glob pattern (default: ‘*/.md’)

  • force (Boolean) (defaults to: false)

    Force re-sync even if unchanged

Returns:

  • (Array<Hash>)

    Results for each file



82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# File 'lib/htm/loaders/markdown_loader.rb', line 82

def load_directory(path, pattern: '**/*.md', force: false)
  expanded_path = File.expand_path(path)

  unless File.exist?(expanded_path)
    raise ArgumentError, "Directory not found: #{path}"
  end

  unless File.directory?(expanded_path)
    raise ArgumentError, "Not a directory: #{path}"
  end

  files = Dir.glob(File.join(expanded_path, pattern))

  files.map do |file_path|
    load_file(file_path, force: force)
  rescue StandardError => e
    { file_path: file_path, error: e.message, skipped: false }
  end
end

#load_file(path, force: false) ⇒ Hash

Load a single markdown file into long-term memory

Parameters:

  • path (String)

    Path to markdown file

  • force (Boolean) (defaults to: false)

    Force re-sync even if mtime unchanged

Returns:

  • (Hash)

    Load result with keys:

    • :file_path [String] Absolute path to file

    • :chunks_created [Integer] Number of new chunks created

    • :chunks_updated [Integer] Number of existing chunks updated

    • :chunks_deleted [Integer] Number of chunks soft-deleted

    • :skipped [Boolean] True if file was unchanged and skipped



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# File 'lib/htm/loaders/markdown_loader.rb', line 50

def load_file(path, force: false)
  expanded_path = validate_file_path!(path)
  content       = read_file_content(expanded_path, path)
  stat          = File.stat(expanded_path)
  file_hash     = Digest::SHA256.hexdigest(content)

  source = HTM::Models::FileSource.first(file_path: expanded_path)
  is_new = source.nil?
  source ||= HTM::Models::FileSource.new(file_path: expanded_path)

  unless force || is_new || source.needs_sync?(stat.mtime)
    return { file_path: expanded_path, chunks_created: 0, chunks_updated: 0, chunks_deleted: 0, skipped: true }
  end

  frontmatter, body = extract_frontmatter(content)
  chunks = @chunker.(body)
  prepend_frontmatter_to_chunk(frontmatter, chunks)

  source.save if is_new
  result = sync_chunks(source, chunks)
  source.update(file_hash: file_hash, mtime: stat.mtime, file_size: stat.size,
                frontmatter: frontmatter, last_synced_at: Time.now)
  result.merge(file_path: expanded_path, file_source_id: source.id, skipped: false)
end