Module: Rpdfium

Defined in:
lib/rpdfium.rb,
lib/rpdfium/raw.rb,
lib/rpdfium/page.rb,
lib/rpdfium/errors.rb,
lib/rpdfium/io/png.rb,
lib/rpdfium/version.rb,
lib/rpdfium/document.rb,
lib/rpdfium/form/form.rb,
lib/rpdfium/table/cells.rb,
lib/rpdfium/table/edges.rb,
lib/rpdfium/table/table.rb,
lib/rpdfium/util/cluster.rb,
lib/rpdfium/search/search.rb,
lib/rpdfium/image/embedded.rb,
lib/rpdfium/structure/tree.rb,
lib/rpdfium/table/debugger.rb,
lib/rpdfium/table/extractor.rb,
lib/rpdfium/util/word_merger.rb,
lib/rpdfium/structure/element.rb,
lib/rpdfium/structure/outline.rb,
lib/rpdfium/util/label_matcher.rb,
lib/rpdfium/util/word_extractor.rb,
lib/rpdfium/structure/attachment.rb,
lib/rpdfium/util/text_extraction.rb,
lib/rpdfium/annotation/annotation.rb,
lib/rpdfium/util/column_inference.rb

Overview

rpdfium - Ruby bindings to PDFium with table extraction.

Top-level API:

Rpdfium.open(path_or_io_or_bytes) { |doc| ... }
Rpdfium.extract_text(path)
Rpdfium.extract_tables(path)
Rpdfium.render_to_pngs(path, output_dir:)

Defined Under Namespace

Modules: Form, IO, Image, Raw, Structure, Table, Util Classes: Annotation, Attachment, Document, Error, FormError, LoadError, Outline, Page, PageError, PasswordError, Search, TextPage

Constant Summary collapse

PDFIUM_ERRORS =
{
  0 => "Success",
  1 => "Unknown error",
  2 => "File not found or could not be opened",
  3 => "File not in PDF format or corrupted",
  4 => "Password required or incorrect",
  5 => "Unsupported security scheme",
  6 => "Page not found or content error"
}.freeze
VERSION =
"0.4.1"

Class Method Summary collapse

Class Method Details

.extract_tables(input, password: nil, keep_blank_rows: false, **opts) ⇒ Object

Estrai tutte le tabelle di tutte le pagine. Ritorna Array<{ page: Integer, rows: Array<Array<String>> }>.

‘keep_blank_rows: false` (default) elimina le righe completamente vuote che la strategia `:text` di words_to_edges_h genera per costruzione (ogni riga visiva produce due edges, top + bottom, e tra coppie di edges adiacenti si formano “righe spurie” di altezza pari al gap interlinea). Con `keep_blank_rows: true` ottieni l’output grezzo di Table#extract.



70
71
72
73
74
75
76
77
78
79
# File 'lib/rpdfium.rb', line 70

def self.extract_tables(input, password: nil, keep_blank_rows: false, **opts)
  open(input, password: password) do |doc|
    doc.flat_map do |page|
      Table::Extractor.new(page, **opts).extract.map do |rows|
        rows = rows.reject { |r| r.all? { |c| c.nil? || c.empty? } } unless keep_blank_rows
        { page: page.index, rows: rows }
      end
    end
  end
end

.extract_text(input, password: nil) ⇒ Object

Estrai tutto il testo di tutte le pagine, una stringa per pagina.



58
59
60
# File 'lib/rpdfium.rb', line 58

def self.extract_text(input, password: nil)
  open(input, password: password) { |doc| doc.map(&:text) }
end

.init!Object



21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# File 'lib/rpdfium/errors.rb', line 21

def init!
  @init_mutex ||= Mutex.new
  @init_mutex.synchronize do
    return if @initialized

    unless Raw.native_loaded?
      raise LoadError, <<~MSG.strip
        PDFium native library not loaded.
        Set ENV["PDFIUM_LIBRARY_PATH"] to libpdfium.{so,dylib,dll}, or
        install the rpdfium-binary gem.
        Original load error: #{Raw.load_error&.message}
      MSG
    end

    Raw.FPDF_InitLibrary
    @initialized = true
    # Cleanup automatico a process exit. Ordine garantito: tutti i
    # finalizer Ruby vengono eseguiti prima di at_exit dei nostri blocchi.
    at_exit { Raw.FPDF_DestroyLibrary if @initialized }
  end
end

.initialized?Boolean

Returns:

  • (Boolean)


43
44
45
# File 'lib/rpdfium/errors.rb', line 43

def initialized?
  @initialized == true
end

.last_error_codeObject



47
48
49
# File 'lib/rpdfium/errors.rb', line 47

def last_error_code
  Raw.FPDF_GetLastError
end

.last_error_messageObject



51
52
53
# File 'lib/rpdfium/errors.rb', line 51

def last_error_message
  PDFIUM_ERRORS[last_error_code] || "Unknown PDFium error (#{last_error_code})"
end

.open(input, password: nil, &block) ⇒ Object



53
54
55
# File 'lib/rpdfium.rb', line 53

def self.open(input, password: nil, &block)
  Document.open(input, password: password, &block)
end

.render_to_pngs(input, output_dir:, scale: 2.0, password: nil) ⇒ Object

Renderizza ogni pagina in un PNG dentro output_dir.



82
83
84
85
86
87
88
89
90
91
# File 'lib/rpdfium.rb', line 82

def self.render_to_pngs(input, output_dir:, scale: 2.0, password: nil)
  Dir.mkdir(output_dir) unless Dir.exist?(output_dir)
  open(input, password: password) do |doc|
    doc.map do |page|
      path = File.join(output_dir, format("page_%04d.png", page.index + 1))
      page.render_to_png(path, scale: scale)
      path
    end
  end
end