Class: Uniword::StreamingParser
- Inherits:
-
Object
- Object
- Uniword::StreamingParser
- Defined in:
- lib/uniword/streaming_parser.rb
Overview
Streaming parser for large DOCX documents.
This parser uses SAX-based parsing to process large documents without loading the entire DOM into memory. It’s designed for scenarios where memory efficiency is critical.
Defined Under Namespace
Classes: DocumentSaxHandler
Constant Summary collapse
- STREAMING_THRESHOLD =
Threshold for when to use streaming (in bytes) Documents larger than this will use streaming by default
10 * 1024 * 1024
Class Method Summary collapse
-
.should_stream?(file_path) ⇒ Boolean
Determine if file should use streaming based on size.
Instance Method Summary collapse
-
#initialize ⇒ StreamingParser
constructor
Initialize the streaming parser.
-
#parse_metadata_only(zip_content) ⇒ Hash
Parse document metadata without loading full content.
-
#parse_streaming(zip_content) ⇒ Document
Parse a large DOCX document with streaming.
-
#set_limits(paragraphs: nil, tables: nil) ⇒ void
Set limits for how many elements to parse.
Constructor Details
#initialize ⇒ StreamingParser
Initialize the streaming parser
21 22 23 24 |
# File 'lib/uniword/streaming_parser.rb', line 21 def initialize @paragraph_limit = nil @table_limit = nil end |
Class Method Details
.should_stream?(file_path) ⇒ Boolean
Determine if file should use streaming based on size
40 41 42 43 44 |
# File 'lib/uniword/streaming_parser.rb', line 40 def self.should_stream?(file_path) File.size(file_path) > STREAMING_THRESHOLD rescue Errno::ENOENT, Errno::EACCES false end |
Instance Method Details
#parse_metadata_only(zip_content) ⇒ Hash
Parse document metadata without loading full content
This is useful for getting document stats quickly
259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 |
# File 'lib/uniword/streaming_parser.rb', line 259 def (zip_content) document_xml = zip_content["word/document.xml"] return {} unless document_xml = { paragraph_count: 0, table_count: 0, has_images: false, } # Use simple regex for fast counting (not parsing full DOM) [:paragraph_count] = document_xml.scan(/<w:p[ >]/).size [:table_count] = document_xml.scan(/<w:tbl[ >]/).size [:has_images] = document_xml.include?("<w:drawing") end |
#parse_streaming(zip_content) ⇒ Document
Parse a large DOCX document with streaming
50 51 52 53 54 55 56 57 58 59 60 61 62 |
# File 'lib/uniword/streaming_parser.rb', line 50 def parse_streaming(zip_content) document = Document.new document_xml = zip_content["word/document.xml"] return document unless document_xml # Use Nokogiri's SAX parser for streaming handler = DocumentSaxHandler.new(document, @paragraph_limit, @table_limit) parser = Nokogiri::XML::SAX::Parser.new(handler) parser.parse(document_xml) document end |
#set_limits(paragraphs: nil, tables: nil) ⇒ void
This method returns an undefined value.
Set limits for how many elements to parse
31 32 33 34 |
# File 'lib/uniword/streaming_parser.rb', line 31 def set_limits(paragraphs: nil, tables: nil) @paragraph_limit = paragraphs @table_limit = tables end |