Class: Rubino::Documents::Converters::Pdf

Inherits:

Object

Object
Rubino::Documents::Converters::Pdf

show all

Defined in:: lib/rubino/documents/converters/pdf.rb

Overview

PDF -> Markdown via ‘pdf-reader` (pure Ruby, MIT, OPTIONAL). Text-first: each page’s text is extracted and pages are joined with a blank line. Honest limits (documented in specs):

- No OCR. A scanned / image-only PDF yields no extractable text; we
  return a clear "no extractable text (scanned?)" note, not a crash.
- Multi-column / complex layout: pdf-reader gives reading order by
  token position, which is imperfect for multi-column pages -- word
  order may differ from the visual layout. Best-effort, not exact.
- The token-position table heuristic markitdown does with pdfplumber is
  intentionally deferred; it is the hard, low-ceiling part.

Constant Summary collapse

MIMES =

%w[application/pdf].freeze

Instance Method Details

#accepts?(mime, path) ⇒ `Boolean`

Returns:

(Boolean)

# File 'lib/rubino/documents/converters/pdf.rb', line 26

def accepts?(mime, path)
  return true if MIMES.include?(mime.to_s)

  File.extname(path.to_s).downcase == ".pdf"
end

#available? ⇒ `Boolean`

Returns:

(Boolean)

# File 'lib/rubino/documents/converters/pdf.rb', line 19

def available?
  require "pdf/reader"
  true
rescue LoadError
  false
end

#convert(path) ⇒ `Object`

# File 'lib/rubino/documents/converters/pdf.rb', line 32

def convert(path)
  require "pdf/reader"
  reader = PDF::Reader.new(path)
  pages = reader.pages.map { |page| page_text(page) }
  text = pages.reject(&:empty?).join("\n\n")
  return scanned_note if text.strip.empty?

  text
rescue PDF::Reader::MalformedPDFError, PDF::Reader::UnsupportedFeatureError
  scanned_note
end