Module: OllamaChat::Parsing
Overview
A module that provides content parsing functionality for OllamaChat.
The Parsing module encapsulates methods for processing various types of input sources including HTML, XML, CSV, RSS, Atom, PDF, and Postscript documents. It handles content extraction and conversion into standardized text formats suitable for chat interactions. The module supports different document policies for handling imported or embedded content and provides utilities for parsing structured data from multiple source types.
Constant Summary collapse
- DOCUMENT_POLICY_STATES =
An array of valid document policy states that define how document references in user text are handled.
These states control the behavior of the document policy selector:
ignoring: Document references are ignored.embedding: Document references are embedded into the conversation context for RAG.importing: Document references are imported into the conversation.summarizing: Document references are summarized for reference.
%w[ ignoring embedding importing summarizing ]
Instance Method Summary collapse
-
#parse_atom(source_io) ⇒ String
The parse_atom method processes an Atom feed from the provided IO source and converts it into a formatted text representation.
-
#parse_content(content, images) ⇒ String
Parses a string for URLs, file refs, and image links, then returns the transformed content.
-
#parse_png(source_io) ⇒ Array<String>?
Extracts embedded metadata from a PNG image, including character profiles, prompts, and workflows.
-
#parse_rss(source_io) ⇒ String
The parse_rss method processes an RSS feed source and converts it into a formatted text representation.
-
#parse_source(source_io) ⇒ String?
The parse_source method processes different types of input sources and converts them into a standardized text representation.
-
#pdf_read(io) ⇒ String
The pdf_read method extracts text content from a PDF file by reading all pages.
-
#personalize_character_profile(char) ⇒ String
Personalizes a character profile by replacing the {user} placeholder.
-
#ps_read(io) ⇒ String?
Reads and processes PDF content using Ghostscript for conversion.
-
#reverse_markdown(html) ⇒ String
The reverse_markdown method converts HTML content into Markdown format.
Methods included from Utils::AnalyzeDirectory
Instance Method Details
#parse_atom(source_io) ⇒ String
The parse_atom method processes an Atom feed from the provided IO source and converts it into a formatted text representation. It extracts the feed title and iterates through each item to build a structured output containing titles, links, and update dates.
The content of each item is converted using reverse_markdown for better readability.
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
# File 'lib/ollama_chat/parsing.rb', line 147 def parse_atom(source_io) feed = RSS::Parser.parse(source_io, false, false) title = <<~EOT # #{feed.title.content} EOT feed.items.inject(title) do |text, item| text << <<~EOT ## [#{item&.title&.content}](#{item&.link&.href}) updated on #{item&.updated&.content} #{reverse_markdown(item&.content&.content)} EOT end end |
#parse_content(content, images) ⇒ String
Parses a string for URLs, file refs, and image links, then returns
the transformed content. Detects http(s) URLs, file:// paths,
quoted file paths, and collects any image URLs into the supplied
images array.
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 |
# File 'lib/ollama_chat/parsing.rb', line 257 def parse_content(content, images) images.clear contents = [ content ] content.scan(CONTENT_REGEXP).each { |url, file_url, quoted_file, file| if file && Pathname.new(file)..directory? contents << generate_structure(file).to_json next end check_exist = false case when url source = url when file_url check_exist = true source = file_url when quoted_file file = quoted_file.gsub('\"', ?") file =~ %r(\A[~./]) or file.prepend('./') check_exist = true source = file when file file = file.gsub('\ ', ' ') file =~ %r(\A[~./]) or file.prepend('./') check_exist = true source = file end fetch_source(source, check_exist:) do |source_io| case source_io&.content_type&.media_type when 'image' add_image(images, source_io, source) if source_io&.content_type&.sub_type == 'png' source_io.rewind if results = parse_png(source_io) contents.concat results end end when 'text', 'application', nil case document_policy.selected when 'ignoring' nil when 'importing' contents << import_source(source_io, source) when 'embedding' (source_io, source) when 'summarizing' contents << summarize_source(source_io, source) end else STDERR.puts( "Cannot fetch #{source.to_s.inspect} with content type "\ "#{source_io&.content_type.inspect}" ) end end } contents.select { _1.present? rescue nil }.compact * "\n\n" end |
#parse_png(source_io) ⇒ Array<String>?
Extracts embedded metadata from a PNG image, including character profiles, prompts, and workflows. Character profiles are automatically personalized to replace placeholders with the current user's name.
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
# File 'lib/ollama_chat/parsing.rb', line 74 def parse_png(source_io) = OllamaChat::Utils::PNGMetadataExtractor.extract_all(source_io) or return results = [] if data = .delete('chara') and char = OllamaChat::Utils::PNGMetadataExtractor.decode_character(data) then results << "Character Profile:\n\n#{personalize_character_profile(char)}" end if data = .delete('parameters') and params = OllamaChat::Utils::PNGMetadataExtractor.parse_a1111_parameters(data) then results << "Generation Settings:\n\n#{params.to_json}" end if data = convert_to_utf8(.delete('prompt')) results << "Prompt:\n\n#{data}" end if data = convert_to_utf8(.delete('workflow')) results << "Workflow:\n\n#{data}" end if data = .full? { _1.transform_values { |v| convert_to_utf8(v) } } results << "Metadata:\n\n#{data}" end results.full? end |
#parse_rss(source_io) ⇒ String
The parse_rss method processes an RSS feed source and converts it into a formatted text representation. It extracts the channel title and iterates through each item in the feed to build a structured output. The method uses the RSS parser to handle the source input and formats the title, link, publication date, and description of each item into a readable text format with markdown-style headers and links.
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
# File 'lib/ollama_chat/parsing.rb', line 117 def parse_rss(source_io) feed = RSS::Parser.parse(source_io, false, false) title = <<~EOT # #{feed&.channel&.title} EOT feed.items.inject(title) do |text, item| text << <<~EOT ## [#{item&.title}](#{item&.link}) updated on #{item&.pubDate} #{reverse_markdown(item&.description)} EOT end end |
#parse_source(source_io) ⇒ String?
The parse_source method processes different types of input sources and converts them into a standardized text representation.
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
# File 'lib/ollama_chat/parsing.rb', line 35 def parse_source(source_io) case source_io&.content_type when 'text/html' reverse_markdown(source_io.read) when 'text/xml', 'application/xml' if source_io.read(8192) =~ %r(^\s*<rss\s) source_io.rewind return parse_rss(source_io) end source_io.rewind source_io.read when 'application/rss+xml' parse_rss(source_io) when 'application/atom+xml' parse_atom(source_io) when 'application/postscript' ps_read(source_io) when 'application/pdf' pdf_read(source_io) when 'image/png' results = parse_png(source_io) and return results.join("\n\n---\n\n") STDERR.puts "Could not parse metadata from #{source_io&.content_type} document." nil when %r(\Aapplication/(json|ld\+json|x-ruby|x-perl|x-gawk|x-python|x-javascript|x-c?sh|x-dosexec|x-shellscript|x-tex|x-latex|x-lyx|x-bibtex)), %r(\Atext/), nil source_io.read else STDERR.puts "Cannot parse #{source_io&.content_type} document." return end end |
#pdf_read(io) ⇒ String
The pdf_read method extracts text content from a PDF file by reading all pages.
171 172 173 174 |
# File 'lib/ollama_chat/parsing.rb', line 171 def pdf_read(io) reader = PDF::Reader.new(io) reader.pages.inject(+'') { |result, page| result << page.text } end |
#personalize_character_profile(char) ⇒ String
Personalizes a character profile by replacing the {user} placeholder.
229 230 231 232 |
# File 'lib/ollama_chat/parsing.rb', line 229 def personalize_character_profile(char) name = user_name || 'the user' char.gsub('{{user}}', name) end |
#ps_read(io) ⇒ String?
Reads and processes PDF content using Ghostscript for conversion
This method takes an IO object containing PDF data, processes it through Ghostscript's pdfwrite device, and returns the processed PDF content. If Ghostscript is not available in the system path, it outputs an error message.
186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
# File 'lib/ollama_chat/parsing.rb', line 186 def ps_read(io) gs = `which gs`.chomp if gs.present? Tempfile.create do |tmp| IO.popen("#{gs} -q -sDEVICE=pdfwrite -sOutputFile=#{tmp.path} -", 'wb') do |gs_io| until io.eof? buffer = io.read(1 << 17) IO.select(nil, [ gs_io ], nil) gs_io.write buffer end gs_io.close File.open(tmp.path, 'rb') do |pdf| pdf_read(pdf) end end end else STDERR.puts "Cannot convert #{io&.content_type} whith ghostscript, gs not in path." end end |
#reverse_markdown(html) ⇒ String
The reverse_markdown method converts HTML content into Markdown format.
This method processes HTML input and transforms it into equivalent Markdown, using specific conversion options to ensure compatibility and formatting.
216 217 218 219 220 221 222 223 |
# File 'lib/ollama_chat/parsing.rb', line 216 def reverse_markdown(html) ReverseMarkdown.convert( html, unknown_tags: :bypass, github_flavored: true, tag_border: '' ) end |