Class: Uniword::Batch::DocumentProcessor

Inherits:
Object
  • Object
show all
Defined in:
lib/uniword/batch/document_processor.rb

Overview

Orchestrates batch document processing through configurable pipeline stages.

Responsibility: Load pipeline configuration and coordinate stage execution. Single Responsibility - only orchestrates processing, delegates work to stages.

Follows Open/Closed Principle - new stages can be added via configuration without modifying this class.

Examples:

Process batch of documents

processor = DocumentProcessor.new(pipeline_config: 'config/pipeline.yml')
results = processor.process_batch(
  input_dir: 'documents/input/',
  output_dir: 'documents/output/'
)
puts results.summary_text

Process with custom stages

processor = DocumentProcessor.new
processor.add_stage(CustomStage.new(enabled: true))
results = processor.process_file('document.docx', 'output.docx')

Constant Summary collapse

STAGE_CLASSES =

Stage class registry Maps stage names to their class implementations

{
  normalize_styles: "NormalizeStylesStage",
  update_metadata: "UpdateMetadataStage",
  validate_links: "ValidateLinksStage",
  quality_check: "QualityCheckStage",
  convert_format: "ConvertFormatStage",
  compress_images: "CompressImagesStage",
}.freeze

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(pipeline_config: nil, config: nil, parallel: false, max_workers: 4) ⇒ DocumentProcessor

Initialize document processor

Parameters:

  • pipeline_config (String, nil) (defaults to: nil)

    Path to pipeline configuration file

  • config (Hash, nil) (defaults to: nil)

    Direct configuration hash (overrides pipeline_config)

  • parallel (Boolean) (defaults to: false)

    Enable parallel processing

  • max_workers (Integer) (defaults to: 4)

    Maximum number of parallel workers



48
49
50
51
52
53
54
55
56
57
# File 'lib/uniword/batch/document_processor.rb', line 48

def initialize(pipeline_config: nil, config: nil, parallel: false,
max_workers: 4)
  @config = load_configuration(pipeline_config, config)
  @parallel = parallel || @config.dig(:pipeline, :parallel,
                                      :enabled) || false
  @max_workers = max_workers || @config.dig(:pipeline, :parallel,
                                            :max_workers) || 4
  @stages = load_stages
  @custom_stages = []
end

Instance Attribute Details

#configObject (readonly)

Returns the value of attribute config.



29
30
31
# File 'lib/uniword/batch/document_processor.rb', line 29

def config
  @config
end

#stagesObject (readonly)

Returns the value of attribute stages.



29
30
31
# File 'lib/uniword/batch/document_processor.rb', line 29

def stages
  @stages
end

Instance Method Details

#add_stage(stage) ⇒ self

Add a custom processing stage

Parameters:

Returns:

  • (self)


138
139
140
141
142
143
144
145
146
# File 'lib/uniword/batch/document_processor.rb', line 138

def add_stage(stage)
  unless stage.is_a?(ProcessingStage)
    raise ArgumentError,
          "Stage must inherit from ProcessingStage"
  end

  @custom_stages << stage
  self
end

#disabled_stagesArray<String>

Get list of disabled stage names

Returns:

  • (Array<String>)

    Names of disabled stages



159
160
161
162
# File 'lib/uniword/batch/document_processor.rb', line 159

def disabled_stages
  all_stages = @stages + @custom_stages
  all_stages.reject(&:enabled?).map(&:name)
end

#enabled_stagesArray<String>

Get list of enabled stage names

Returns:

  • (Array<String>)

    Names of enabled stages



151
152
153
154
# File 'lib/uniword/batch/document_processor.rb', line 151

def enabled_stages
  all_stages = @stages + @custom_stages
  all_stages.select(&:enabled?).map(&:name)
end

#process_batch(input_dir:, output_dir:, pattern: "*.{docx,doc}") ⇒ BatchResult

Process a batch of documents from input directory

Parameters:

  • input_dir (String)

    Input directory path

  • output_dir (String)

    Output directory path

  • pattern (String) (defaults to: "*.{docx,doc}")

    File pattern to match (default: ‘*.docx,doc’)

Returns:



65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/uniword/batch/document_processor.rb', line 65

def process_batch(input_dir:, output_dir:, pattern: "*.{docx,doc}")
  validate_directories!(input_dir, output_dir)

  # Create output directory if it doesn't exist
  FileUtils.mkdir_p(output_dir)

  # Find all matching files
  files = Dir.glob(File.join(input_dir, pattern))

  result = BatchResult.new

  if @parallel && files.size > 1
    process_parallel(files, input_dir, output_dir, result)
  else
    process_sequential(files, input_dir, output_dir, result)
  end

  result.complete!
end

#process_file(input_path, output_path) ⇒ BatchResult

Process a single document file

Parameters:

  • input_path (String)

    Input file path

  • output_path (String)

    Output file path

Returns:



90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# File 'lib/uniword/batch/document_processor.rb', line 90

def process_file(input_path, output_path)
  result = BatchResult.new
  start_time = Time.now

  begin
    # Load document
    document = DocumentFactory.from_file(input_path)

    # Create context
    context = {
      input_path: input_path,
      output_path: output_path,
      filename: File.basename(input_path),
    }

    # Execute pipeline
    executed_stages = []
    all_stages = @stages + @custom_stages

    all_stages.each do |stage|
      next unless stage.enabled?

      stage.process(document, context)
      executed_stages << stage.name
    end

    # Save output
    output_dir = File.dirname(output_path)
    FileUtils.mkdir_p(output_dir)
    document.save(output_path)

    duration = Time.now - start_time
    result.add_success(
      file: input_path,
      duration: duration,
      stages: executed_stages,
    )
  rescue StandardError => e
    handle_error(e, input_path, result)
  end

  result.complete!
end