Class: Uniword::StreamingParser

Inherits:
Object
  • Object
show all
Defined in:
lib/uniword/streaming_parser.rb

Overview

Streaming parser for large DOCX documents.

This parser uses SAX-based parsing to process large documents without loading the entire DOM into memory. It’s designed for scenarios where memory efficiency is critical.

Examples:

Parse a large document

parser = Uniword::StreamingParser.new
document = parser.parse_file('large_document.docx')

Defined Under Namespace

Classes: DocumentSaxHandler

Constant Summary collapse

STREAMING_THRESHOLD =

Threshold for when to use streaming (in bytes) Documents larger than this will use streaming by default

10 * 1024 * 1024

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initializeStreamingParser

Initialize the streaming parser



21
22
23
24
# File 'lib/uniword/streaming_parser.rb', line 21

def initialize
  @paragraph_limit = nil
  @table_limit = nil
end

Class Method Details

.should_stream?(file_path) ⇒ Boolean

Determine if file should use streaming based on size

Parameters:

  • file_path (String)

    Path to the file

Returns:

  • (Boolean)

    true if file is large enough for streaming



40
41
42
43
44
# File 'lib/uniword/streaming_parser.rb', line 40

def self.should_stream?(file_path)
  File.size(file_path) > STREAMING_THRESHOLD
rescue Errno::ENOENT, Errno::EACCES
  false
end

Instance Method Details

#parse_metadata_only(zip_content) ⇒ Hash

Parse document metadata without loading full content

This is useful for getting document stats quickly

Parameters:

  • zip_content (Hash)

    Extracted ZIP content

Returns:

  • (Hash)

    Document metadata



259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
# File 'lib/uniword/streaming_parser.rb', line 259

def (zip_content)
  document_xml = zip_content["word/document.xml"]
  return {} unless document_xml

   = {
    paragraph_count: 0,
    table_count: 0,
    has_images: false,
  }

  # Use simple regex for fast counting (not parsing full DOM)
  [:paragraph_count] = document_xml.scan(/<w:p[ >]/).size
  [:table_count] = document_xml.scan(/<w:tbl[ >]/).size
  [:has_images] = document_xml.include?("<w:drawing")

  
end

#parse_streaming(zip_content) ⇒ Document

Parse a large DOCX document with streaming

Parameters:

  • zip_content (Hash)

    Extracted ZIP content from DOCX

Returns:

  • (Document)

    The parsed document



50
51
52
53
54
55
56
57
58
59
60
61
62
# File 'lib/uniword/streaming_parser.rb', line 50

def parse_streaming(zip_content)
  document = Document.new
  document_xml = zip_content["word/document.xml"]

  return document unless document_xml

  # Use Nokogiri's SAX parser for streaming
  handler = DocumentSaxHandler.new(document, @paragraph_limit, @table_limit)
  parser = Nokogiri::XML::SAX::Parser.new(handler)
  parser.parse(document_xml)

  document
end

#set_limits(paragraphs: nil, tables: nil) ⇒ void

This method returns an undefined value.

Set limits for how many elements to parse

Parameters:

  • paragraphs (Integer, nil) (defaults to: nil)

    Maximum paragraphs to parse

  • tables (Integer, nil) (defaults to: nil)

    Maximum tables to parse



31
32
33
34
# File 'lib/uniword/streaming_parser.rb', line 31

def set_limits(paragraphs: nil, tables: nil)
  @paragraph_limit = paragraphs
  @table_limit = tables
end