Class: Rubino::Documents::Converters::Pdf
- Inherits:
-
Object
- Object
- Rubino::Documents::Converters::Pdf
- Defined in:
- lib/rubino/documents/converters/pdf.rb
Overview
PDF -> Markdown via ‘pdf-reader` (pure Ruby, MIT, OPTIONAL). Text-first: each page’s text is extracted and pages are joined with a blank line. Honest limits (documented in specs):
- No OCR. A scanned / image-only PDF yields no extractable text; we
return a clear "no extractable text (scanned?)" note, not a crash.
- Multi-column / complex layout: pdf-reader gives reading order by
token position, which is imperfect for multi-column pages -- word
order may differ from the visual layout. Best-effort, not exact.
- The token-position table heuristic markitdown does with pdfplumber is
intentionally deferred; it is the hard, low-ceiling part.
Constant Summary collapse
- MIMES =
%w[application/pdf].freeze
Instance Method Summary collapse
Instance Method Details
#accepts?(mime, path) ⇒ Boolean
26 27 28 29 30 |
# File 'lib/rubino/documents/converters/pdf.rb', line 26 def accepts?(mime, path) return true if MIMES.include?(mime.to_s) File.extname(path.to_s).downcase == ".pdf" end |
#available? ⇒ Boolean
19 20 21 22 23 24 |
# File 'lib/rubino/documents/converters/pdf.rb', line 19 def available? require "pdf/reader" true rescue LoadError false end |
#convert(path) ⇒ Object
32 33 34 35 36 37 38 39 40 41 42 |
# File 'lib/rubino/documents/converters/pdf.rb', line 32 def convert(path) require "pdf/reader" reader = PDF::Reader.new(path) pages = reader.pages.map { |page| page_text(page) } text = pages.reject(&:empty?).join("\n\n") return scanned_note if text.strip.empty? text rescue PDF::Reader::MalformedPDFError, PDF::Reader::UnsupportedFeatureError scanned_note end |