Class: Pdfsink::Page

Inherits:
Object
  • Object
show all
Defined in:
lib/pdfsink/page.rb

Overview

A single page of a Document.

Each accessor shells out to the pdfsink-rs binary for that page; results are cached so repeated reads don’t re-spawn the process. Page-level metadata (dimensions, rotation, bbox, object counts) comes from the document’s info payload and needs no extra spawn.

Examples:

doc  = Pdfsink::Document.open("report.pdf")
page = doc.page(1)
page.width            # => 612.0
page.extract_text     # => "Quarterly Report\n..."
page.tables           # => [[["Q1", "Q2"], ["10", "20"]]]

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(document, number, meta) ⇒ Page

Returns a new instance of Page.

Parameters:

  • document (Document)
  • number (Integer)

    1-based page number

  • meta (Hash)

    the per-page slice of the document info payload



24
25
26
27
28
# File 'lib/pdfsink/page.rb', line 24

def initialize(document, number, meta)
  @document = document
  @number   = number
  @meta     = meta
end

Instance Attribute Details

#numberInteger (readonly)

Returns 1-based page number.

Returns:

  • (Integer)

    1-based page number



19
20
21
# File 'lib/pdfsink/page.rb', line 19

def number
  @number
end

Instance Method Details

#bboxHash

Returns the page bounding box (“top”, “x1”, “bottom”).

Returns:

  • (Hash)

    the page bounding box (“top”, “x1”, “bottom”)



40
# File 'lib/pdfsink/page.rb', line 40

def bbox = @meta["bbox"]

#extract_textString

The page’s text in reading order.

Returns:

  • (String)


48
49
50
# File 'lib/pdfsink/page.rb', line 48

def extract_text
  @extract_text ||= Cli.text(path, number)
end

#extract_wordsArray<Hash>

Words with positions and font metadata.

Returns:

  • (Array<Hash>)


55
56
57
# File 'lib/pdfsink/page.rb', line 55

def extract_words
  @extract_words ||= Cli.words(path, number)
end

#heightFloat

Returns page height in PDF points.

Returns:

  • (Float)

    page height in PDF points



34
# File 'lib/pdfsink/page.rb', line 34

def height = @meta["height"]

#inspectObject



89
90
91
# File 'lib/pdfsink/page.rb', line 89

def inspect
  "#<Pdfsink::Page number=#{number} #{width}x#{height}>"
end

Hyperlinks on the page.

Returns:

  • (Array<Hash>)


69
70
71
# File 'lib/pdfsink/page.rb', line 69

def links
  @links ||= Cli.links(path, number)
end

#object_countsHash

Returns counts of each object kind on the page.

Returns:

  • (Hash)

    counts of each object kind on the page



43
# File 'lib/pdfsink/page.rb', line 43

def object_counts = @meta["object_counts"]

#objectsHash

Every page object (chars, lines, rects, curves, images, annots, …).

Returns:

  • (Hash)

    keyed by object kind



62
63
64
# File 'lib/pdfsink/page.rb', line 62

def objects
  @objects ||= Cli.objects(path, number)
end

#rotationInteger

Returns clockwise rotation in degrees (0, 90, 180, 270).

Returns:

  • (Integer)

    clockwise rotation in degrees (0, 90, 180, 270)



37
# File 'lib/pdfsink/page.rb', line 37

def rotation = @meta["rotation"]

#search(pattern) ⇒ Array<Hash>

Regex search matches within the page text.

Parameters:

  • pattern (String, Regexp)

    the pattern to search for

Returns:

  • (Array<Hash>)


77
78
79
# File 'lib/pdfsink/page.rb', line 77

def search(pattern)
  Cli.search(path, number, pattern.is_a?(Regexp) ? pattern.source : pattern.to_s)
end

#tables(strategy: nil) ⇒ Array<Array>?

The page’s largest detected table, or nil if none is found.

Parameters:

  • strategy (Symbol, String, nil) (defaults to: nil)

    table-detection strategy

Returns:

  • (Array<Array>, nil)

    rows of cells



85
86
87
# File 'lib/pdfsink/page.rb', line 85

def tables(strategy: nil)
  Cli.table(path, number, TableStrategy.resolve(strategy))
end

#widthFloat

Returns page width in PDF points.

Returns:

  • (Float)

    page width in PDF points



31
# File 'lib/pdfsink/page.rb', line 31

def width = @meta["width"]