Class: Uniword::StreamingParser

Inherits:

Object

Object
Uniword::StreamingParser

show all

Defined in:: lib/uniword/streaming_parser.rb

Overview

Streaming parser for large DOCX documents.

This parser uses SAX-based parsing to process large documents without loading the entire DOM into memory. It’s designed for scenarios where memory efficiency is critical.

Examples:

Parse a large document

parser = Uniword::StreamingParser.new
document = parser.parse_file('large_document.docx')

Defined Under Namespace

Classes: DocumentSaxHandler

Constant Summary collapse

STREAMING_THRESHOLD = Threshold for when to use streaming (in bytes) Documents larger than this will use streaming by default

10 * 1024 * 1024

Class Method Summary collapse

.should_stream?(file_path) ⇒ Boolean

Determine if file should use streaming based on size.

Instance Method Summary collapse

#initialize ⇒ StreamingParser constructor

Initialize the streaming parser.
#parse_metadata_only(zip_content) ⇒ Hash

Parse document metadata without loading full content.
#parse_streaming(zip_content) ⇒ Document

Parse a large DOCX document with streaming.
#set_limits(paragraphs: nil, tables: nil) ⇒ void

Set limits for how many elements to parse.

Constructor Details

#initialize ⇒ `StreamingParser`

Initialize the streaming parser

# File 'lib/uniword/streaming_parser.rb', line 21

def initialize
  @paragraph_limit = nil
  @table_limit = nil
end

Class Method Details

.should_stream?(file_path) ⇒ `Boolean`

Determine if file should use streaming based on size

Parameters:

file_path (String) —

Path to the file

Returns:

(Boolean) —

true if file is large enough for streaming

# File 'lib/uniword/streaming_parser.rb', line 40

def self.should_stream?(file_path)
  File.size(file_path) > STREAMING_THRESHOLD
rescue Errno::ENOENT, Errno::EACCES
  false
end

Instance Method Details

#parse_metadata_only(zip_content) ⇒ `Hash`

Parse document metadata without loading full content

This is useful for getting document stats quickly

Parameters:

zip_content (Hash) —

Extracted ZIP content

Returns:

(Hash) —

Document metadata

# File 'lib/uniword/streaming_parser.rb', line 259

def parse_metadata_only(zip_content)
  document_xml = zip_content["word/document.xml"]
  return {} unless document_xml

  metadata = {
    paragraph_count: 0,
    table_count: 0,
    has_images: false,
  }

  # Use simple regex for fast counting (not parsing full DOM)
  metadata[:paragraph_count] = document_xml.scan(/<w:p[ >]/).size
  metadata[:table_count] = document_xml.scan(/<w:tbl[ >]/).size
  metadata[:has_images] = document_xml.include?("<w:drawing")

  metadata
end

#parse_streaming(zip_content) ⇒ `Document`

Parse a large DOCX document with streaming

Parameters:

zip_content (Hash) —

Extracted ZIP content from DOCX

Returns:

(Document) —

The parsed document

# File 'lib/uniword/streaming_parser.rb', line 50

def parse_streaming(zip_content)
  document = Document.new
  document_xml = zip_content["word/document.xml"]

  return document unless document_xml

  # Use Nokogiri's SAX parser for streaming
  handler = DocumentSaxHandler.new(document, @paragraph_limit, @table_limit)
  parser = Nokogiri::XML::SAX::Parser.new(handler)
  parser.parse(document_xml)

  document
end

#set_limits(paragraphs: nil, tables: nil) ⇒ `void`

This method returns an undefined value.

Set limits for how many elements to parse

Parameters:

paragraphs (Integer, nil) (defaults to: nil) —

Maximum paragraphs to parse
tables (Integer, nil) (defaults to: nil) —

Maximum tables to parse

# File 'lib/uniword/streaming_parser.rb', line 31

def set_limits(paragraphs: nil, tables: nil)
  @paragraph_limit = paragraphs
  @table_limit = tables
end

Class: Uniword::StreamingParser

Overview

Examples:

Parse a large document

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize ⇒ StreamingParser

Class Method Details

.should_stream?(file_path) ⇒ Boolean

Instance Method Details

#parse_metadata_only(zip_content) ⇒ Hash

#parse_streaming(zip_content) ⇒ Document

#set_limits(paragraphs: nil, tables: nil) ⇒ void

#initialize ⇒ `StreamingParser`

.should_stream?(file_path) ⇒ `Boolean`

#parse_metadata_only(zip_content) ⇒ `Hash`

#parse_streaming(zip_content) ⇒ `Document`

#set_limits(paragraphs: nil, tables: nil) ⇒ `void`