Module: Rpdfium
- Defined in:
- lib/rpdfium.rb,
lib/rpdfium/raw.rb,
lib/rpdfium/page.rb,
lib/rpdfium/errors.rb,
lib/rpdfium/io/png.rb,
lib/rpdfium/version.rb,
lib/rpdfium/document.rb,
lib/rpdfium/form/form.rb,
lib/rpdfium/table/cells.rb,
lib/rpdfium/table/edges.rb,
lib/rpdfium/table/table.rb,
lib/rpdfium/util/cluster.rb,
lib/rpdfium/search/search.rb,
lib/rpdfium/image/embedded.rb,
lib/rpdfium/structure/tree.rb,
lib/rpdfium/table/debugger.rb,
lib/rpdfium/table/extractor.rb,
lib/rpdfium/util/word_merger.rb,
lib/rpdfium/structure/element.rb,
lib/rpdfium/structure/outline.rb,
lib/rpdfium/util/label_matcher.rb,
lib/rpdfium/util/word_extractor.rb,
lib/rpdfium/structure/attachment.rb,
lib/rpdfium/util/text_extraction.rb,
lib/rpdfium/annotation/annotation.rb,
lib/rpdfium/util/column_inference.rb
Overview
rpdfium - Ruby bindings to PDFium with table extraction.
Top-level API:
Rpdfium.open(path_or_io_or_bytes) { |doc| ... }
Rpdfium.extract_text(path)
Rpdfium.extract_tables(path)
Rpdfium.render_to_pngs(path, output_dir:)
Defined Under Namespace
Modules: Form, IO, Image, Raw, Structure, Table, Util Classes: Annotation, Attachment, Document, Error, FormError, LoadError, Outline, Page, PageError, PasswordError, Search, TextPage
Constant Summary collapse
- PDFIUM_ERRORS =
{ 0 => "Success", 1 => "Unknown error", 2 => "File not found or could not be opened", 3 => "File not in PDF format or corrupted", 4 => "Password required or incorrect", 5 => "Unsupported security scheme", 6 => "Page not found or content error" }.freeze
- VERSION =
"0.4.1"
Class Method Summary collapse
-
.extract_tables(input, password: nil, keep_blank_rows: false, **opts) ⇒ Object
Estrai tutte le tabelle di tutte le pagine.
-
.extract_text(input, password: nil) ⇒ Object
Estrai tutto il testo di tutte le pagine, una stringa per pagina.
- .init! ⇒ Object
- .initialized? ⇒ Boolean
- .last_error_code ⇒ Object
- .last_error_message ⇒ Object
- .open(input, password: nil, &block) ⇒ Object
-
.render_to_pngs(input, output_dir:, scale: 2.0, password: nil) ⇒ Object
Renderizza ogni pagina in un PNG dentro output_dir.
Class Method Details
.extract_tables(input, password: nil, keep_blank_rows: false, **opts) ⇒ Object
Estrai tutte le tabelle di tutte le pagine. Ritorna Array<{ page: Integer, rows: Array<Array<String>> }>.
‘keep_blank_rows: false` (default) elimina le righe completamente vuote che la strategia `:text` di words_to_edges_h genera per costruzione (ogni riga visiva produce due edges, top + bottom, e tra coppie di edges adiacenti si formano “righe spurie” di altezza pari al gap interlinea). Con `keep_blank_rows: true` ottieni l’output grezzo di Table#extract.
70 71 72 73 74 75 76 77 78 79 |
# File 'lib/rpdfium.rb', line 70 def self.extract_tables(input, password: nil, keep_blank_rows: false, **opts) open(input, password: password) do |doc| doc.flat_map do |page| Table::Extractor.new(page, **opts).extract.map do |rows| rows = rows.reject { |r| r.all? { |c| c.nil? || c.empty? } } unless keep_blank_rows { page: page.index, rows: rows } end end end end |
.extract_text(input, password: nil) ⇒ Object
Estrai tutto il testo di tutte le pagine, una stringa per pagina.
58 59 60 |
# File 'lib/rpdfium.rb', line 58 def self.extract_text(input, password: nil) open(input, password: password) { |doc| doc.map(&:text) } end |
.init! ⇒ Object
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# File 'lib/rpdfium/errors.rb', line 21 def init! @init_mutex ||= Mutex.new @init_mutex.synchronize do return if @initialized unless Raw.native_loaded? raise LoadError, <<~MSG.strip PDFium native library not loaded. Set ENV["PDFIUM_LIBRARY_PATH"] to libpdfium.{so,dylib,dll}, or install the rpdfium-binary gem. Original load error: #{Raw.load_error&.} MSG end Raw.FPDF_InitLibrary @initialized = true # Cleanup automatico a process exit. Ordine garantito: tutti i # finalizer Ruby vengono eseguiti prima di at_exit dei nostri blocchi. at_exit { Raw.FPDF_DestroyLibrary if @initialized } end end |
.initialized? ⇒ Boolean
43 44 45 |
# File 'lib/rpdfium/errors.rb', line 43 def initialized? @initialized == true end |
.last_error_code ⇒ Object
47 48 49 |
# File 'lib/rpdfium/errors.rb', line 47 def last_error_code Raw.FPDF_GetLastError end |
.last_error_message ⇒ Object
51 52 53 |
# File 'lib/rpdfium/errors.rb', line 51 def PDFIUM_ERRORS[last_error_code] || "Unknown PDFium error (#{last_error_code})" end |
.open(input, password: nil, &block) ⇒ Object
53 54 55 |
# File 'lib/rpdfium.rb', line 53 def self.open(input, password: nil, &block) Document.open(input, password: password, &block) end |
.render_to_pngs(input, output_dir:, scale: 2.0, password: nil) ⇒ Object
Renderizza ogni pagina in un PNG dentro output_dir.
82 83 84 85 86 87 88 89 90 91 |
# File 'lib/rpdfium.rb', line 82 def self.render_to_pngs(input, output_dir:, scale: 2.0, password: nil) Dir.mkdir(output_dir) unless Dir.exist?(output_dir) open(input, password: password) do |doc| doc.map do |page| path = File.join(output_dir, format("page_%04d.png", page.index + 1)) page.render_to_png(path, scale: scale) path end end end |