Pdfsink
Ruby wrapper for pdfsink-rs, a fast pure-Rust PDF extraction tool. Extract text, words, page objects, tables, hyperlinks, and regex matches from PDFs — without a Python runtime.
The gem shells out to the bundled pdfsink-rs binary and parses its JSON
output, so there is nothing to load into your Ruby process and no FFI.
Requirements
- Ruby >= 3.2
- For source installs only: a Rust toolchain (
cargo) to compile the binary. Precompiled platform gems bundle the binary and need no toolchain.
Installation
# Gemfile
gem "pdfsink"
bundle install
# or
gem install pdfsink
On supported platforms (x86_64-linux, aarch64-linux, x86_64-darwin,
arm64-darwin) a precompiled gem ships the binary. Elsewhere the source gem
compiles it on install via cargo install pdfsink-rs.
Building from source
git clone https://github.com/AccountAim/pdfsink-ruby
cd pdfsink-ruby
bundle install
bundle exec rake cargo:build # compiles the binary into lib/pdfsink/
bundle exec rspec
To point the gem at a binary you built elsewhere, set PDFSINK_BIN:
export PDFSINK_BIN=/path/to/pdfsink-rs
Usage
Open a document
doc = Pdfsink.open("report.pdf")
doc.page_count # => 12
doc.pages # => [#<Pdfsink::Page number=1 612.0x792.0>, ...]
doc.each_page { |page| puts page.extract_text }
Extract text
page = doc.page(1)
page.extract_text # => "Quarterly Report\n..."
# One-shot, without holding a Document:
Pdfsink.extract_text("report.pdf", page: 1)
Words with positions
page.extract_words
# => [{"text" => "Quarterly", "x0" => 72.0, "top" => 90.1, ...}, ...]
Page objects
page.object_counts # => {"chars" => 812, "lines" => 14, "rects" => 3, ...}
page.objects # => {"chars" => [...], "lines" => [...], ...}
Tables
page.tables # default strategy ("lines")
page.tables(strategy: :text) # infer from text alignment
# => [["Name", "Age", "City"], ["Alice", "31", "Oakland"], ...]
Strategies: :lines, :lines_strict, :text, :explicit
(see Pdfsink::TableStrategy).
Hyperlinks
page.links
# => [{"uri" => "https://example.com", "x0" => 72.0, ...}]
Search
page.search(/total:\s*\$\d+/i)
# => [{"text" => "Total: $420", "x0" => ..., ...}]
Binary version
Pdfsink.version # => "0.2.8"
Rails Integration
# config/application.rb (or an initializer)
config.pdfsink.default_table_strategy = :text
config.pdfsink.binary_path = Rails.root.join("bin/pdfsink-rs").to_s
Configuration
Pdfsink.configure do |config|
config.default_table_strategy = :text # used when Page#tables gets no strategy
end
The binary is located in this order:
PDFSINK_BINenvironment variablelib/pdfsink/pdfsink-rsinside the gem (bundled / built)ext/pdfsink/bin/pdfsink-rs(dev location)pdfsink-rsonPATH
Development
bundle install
bundle exec rake cargo:build
bundle exec rspec
License
MIT — see LICENSE.